Module 24: Retry Strategies

Why Retries Matter

Transient failures are common in distributed systems. Smart retry strategies improve reliability without overwhelming services.
┌─────────────────────────────────────────────────────────────────┐ │ RETRY OVERVIEW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Transient Failures (should retry): │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Network timeout │ │ │ │ • Connection refused (service restarting) │ │ │ │ • 503 Service Unavailable │ │ │ │ • 429 Too Many Requests │ │ │ │ • Database deadlock │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Permanent Failures (don't retry): │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • 400 Bad Request │ │ │ │ • 401 Unauthorized │ │ │ │ • 404 Not Found │ │ │ │ • 422 Validation Error │ │ │ │ • Business logic errors │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Retry Pitfalls: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Without jitter: With jitter: │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Server down │ │ Server down │ │ │ │ │ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ │ │ t=0: │████████████ t=0: │███ │ │ │ │ t=1: │████████████ t=0.5: │██ │ │ │ │ t=2: │████████████ t=1.2: │████ │ │ │ │ │ t=1.8: │██ │ │ │ │ (Thundering herd!) (Spread out) │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Retry Strategies

┌─────────────────────────────────────────────────────────────────┐ │ RETRY STRATEGIES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. IMMEDIATE RETRY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ fail → retry → retry → retry → give up │ │ │ │ │ │ │ │ Use: Very short-lived failures │ │ │ │ Risk: Can overwhelm already struggling service │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. FIXED DELAY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ fail → wait 1s → retry → wait 1s → retry │ │ │ │ │ │ │ │ Simple but doesn't adapt to failure severity │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. EXPONENTIAL BACKOFF │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ fail → 1s → retry → 2s → retry → 4s → retry → 8s │ │ │ │ │ │ │ │ delay = min(base * 2^attempt, maxDelay) │ │ │ │ │ │ │ │ Adapts to severity, reduces load over time │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. EXPONENTIAL BACKOFF + JITTER │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ fail → 0.8s → retry → 2.3s → retry → 3.9s → retry │ │ │ │ │ │ │ │ delay = random(0, min(base * 2^attempt, maxDelay)) │ │ │ │ │ │ │ │ Best practice: prevents thundering herd │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 5. DECORRELATED JITTER │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ delay = min(maxDelay, random(base, lastDelay * 3)) │ │ │ │ │ │ │ │ AWS recommended approach │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Retry Implementation

go
package retry import ( "context" "errors" "math" "math/rand" "time" ) // RetryableError indicates an error that can be retried type RetryableError struct { Err error Retryable bool } func (e *RetryableError) Error() string { return e.Err.Error() } func (e *RetryableError) Unwrap() error { return e.Err } // IsRetryable checks if an error should be retried func IsRetryable(err error) bool { var retryableErr *RetryableError if errors.As(err, &retryableErr) { return retryableErr.Retryable } // Default: assume retryable for network errors return true } // Strategy defines how retries should be performed type Strategy interface { NextDelay(attempt int) time.Duration MaxAttempts() int } // FixedDelay implements fixed delay retry type FixedDelay struct { Delay time.Duration Attempts int } func (f *FixedDelay) NextDelay(attempt int) time.Duration { return f.Delay } func (f *FixedDelay) MaxAttempts() int { return f.Attempts } // ExponentialBackoff implements exponential backoff type ExponentialBackoff struct { InitialDelay time.Duration MaxDelay time.Duration Multiplier float64 Attempts int } func NewExponentialBackoff() *ExponentialBackoff { return &ExponentialBackoff{ InitialDelay: 100 * time.Millisecond, MaxDelay: 30 * time.Second, Multiplier: 2.0, Attempts: 5, } } func (e *ExponentialBackoff) NextDelay(attempt int) time.Duration { delay := float64(e.InitialDelay) * math.Pow(e.Multiplier, float64(attempt)) if delay > float64(e.MaxDelay) { delay = float64(e.MaxDelay) } return time.Duration(delay) } func (e *ExponentialBackoff) MaxAttempts() int { return e.Attempts } // ExponentialBackoffWithJitter adds jitter to exponential backoff type ExponentialBackoffWithJitter struct { ExponentialBackoff JitterFactor float64 // 0.0 to 1.0 } func NewExponentialBackoffWithJitter() *ExponentialBackoffWithJitter { return &ExponentialBackoffWithJitter{ ExponentialBackoff: *NewExponentialBackoff(), JitterFactor: 0.5, } } func (e *ExponentialBackoffWithJitter) NextDelay(attempt int) time.Duration { baseDelay := e.ExponentialBackoff.NextDelay(attempt) // Add jitter: delay * (1 - jitterFactor + random * jitterFactor * 2) jitter := 1.0 - e.JitterFactor + rand.Float64()*e.JitterFactor*2 return time.Duration(float64(baseDelay) * jitter) } // DecorrelatedJitter implements AWS-style decorrelated jitter type DecorrelatedJitter struct { BaseDelay time.Duration MaxDelay time.Duration Attempts int lastDelay time.Duration } func NewDecorrelatedJitter() *DecorrelatedJitter { return &DecorrelatedJitter{ BaseDelay: 100 * time.Millisecond, MaxDelay: 30 * time.Second, Attempts: 5, } } func (d *DecorrelatedJitter) NextDelay(attempt int) time.Duration { if attempt == 0 { d.lastDelay = d.BaseDelay return d.BaseDelay } // delay = random(base, lastDelay * 3) minDelay := float64(d.BaseDelay) maxDelay := float64(d.lastDelay) * 3 if maxDelay > float64(d.MaxDelay) { maxDelay = float64(d.MaxDelay) } delay := minDelay + rand.Float64()*(maxDelay-minDelay) d.lastDelay = time.Duration(delay) return d.lastDelay } func (d *DecorrelatedJitter) MaxAttempts() int { return d.Attempts } // Retryer performs retries with a given strategy type Retryer struct { strategy Strategy shouldRetry func(error) bool onRetry func(attempt int, err error, delay time.Duration) } func NewRetryer(strategy Strategy) *Retryer { return &Retryer{ strategy: strategy, shouldRetry: IsRetryable, } } func (r *Retryer) WithRetryCheck(fn func(error) bool) *Retryer { r.shouldRetry = fn return r } func (r *Retryer) WithOnRetry(fn func(attempt int, err error, delay time.Duration)) *Retryer { r.onRetry = fn return r } // Do executes the function with retries func (r *Retryer) Do(fn func() error) error { return r.DoWithContext(context.Background(), func(ctx context.Context) error { return fn() }) } // DoWithContext executes with context support func (r *Retryer) DoWithContext(ctx context.Context, fn func(context.Context) error) error { var lastErr error for attempt := 0; attempt < r.strategy.MaxAttempts(); attempt++ { err := fn(ctx) if err == nil { return nil } lastErr = err // Check if we should retry if !r.shouldRetry(err) { return err } // Check if we've exhausted attempts if attempt == r.strategy.MaxAttempts()-1 { break } // Calculate delay delay := r.strategy.NextDelay(attempt) if r.onRetry != nil { r.onRetry(attempt+1, err, delay) } // Wait with context select { case <-ctx.Done(): return ctx.Err() case <-time.After(delay): } } return lastErr } // DoWithResult executes and returns result func (r *Retryer) DoWithResult(fn func() (interface{}, error)) (interface{}, error) { var result interface{} err := r.Do(func() error { var err error result, err = fn() return err }) return result, err }

HTTP Retry Client

go
package retry import ( "context" "fmt" "io" "net/http" "time" ) // HTTPClient with retry support type HTTPClient struct { client *http.Client retryer *Retryer } func NewHTTPClient(timeout time.Duration, strategy Strategy) *HTTPClient { retryer := NewRetryer(strategy) retryer.WithRetryCheck(isHTTPRetryable) return &HTTPClient{ client: &http.Client{ Timeout: timeout, }, retryer: retryer, } } // isHTTPRetryable determines if HTTP error should be retried func isHTTPRetryable(err error) bool { // Network errors are retryable return true } // isStatusRetryable determines if HTTP status should be retried func isStatusRetryable(status int) bool { switch status { case http.StatusTooManyRequests, // 429 http.StatusServiceUnavailable, // 503 http.StatusGatewayTimeout, // 504 http.StatusBadGateway: // 502 return true default: return status >= 500 } } // Do performs HTTP request with retries func (c *HTTPClient) Do(req *http.Request) (*http.Response, error) { var resp *http.Response err := c.retryer.DoWithContext(req.Context(), func(ctx context.Context) error { // Clone request for retry reqCopy := req.Clone(ctx) // Reset body if present if req.GetBody != nil { body, err := req.GetBody() if err != nil { return err } reqCopy.Body = body } var err error resp, err = c.client.Do(reqCopy) if err != nil { return err } // Check if status is retryable if isStatusRetryable(resp.StatusCode) { // Read and discard body for connection reuse io.Copy(io.Discard, resp.Body) resp.Body.Close() return fmt.Errorf("retryable status: %d", resp.StatusCode) } return nil }) return resp, err } // Get performs GET request with retries func (c *HTTPClient) Get(ctx context.Context, url string) (*http.Response, error) { req, err := http.NewRequestWithContext(ctx, "GET", url, nil) if err != nil { return nil, err } return c.Do(req) } // RetryAfterClient handles Retry-After header type RetryAfterClient struct { client *http.Client maxWait time.Duration } func NewRetryAfterClient(timeout time.Duration, maxWait time.Duration) *RetryAfterClient { return &RetryAfterClient{ client: &http.Client{ Timeout: timeout, }, maxWait: maxWait, } } func (c *RetryAfterClient) Do(ctx context.Context, req *http.Request) (*http.Response, error) { for { reqCopy := req.Clone(ctx) if req.GetBody != nil { body, _ := req.GetBody() reqCopy.Body = body } resp, err := c.client.Do(reqCopy) if err != nil { return nil, err } // Check for rate limiting if resp.StatusCode == http.StatusTooManyRequests { retryAfter := parseRetryAfter(resp.Header.Get("Retry-After")) if retryAfter > c.maxWait { return resp, nil // Don't wait too long } io.Copy(io.Discard, resp.Body) resp.Body.Close() select { case <-ctx.Done(): return nil, ctx.Err() case <-time.After(retryAfter): continue } } return resp, nil } } func parseRetryAfter(header string) time.Duration { if header == "" { return time.Second } // Try parsing as seconds var seconds int if _, err := fmt.Sscanf(header, "%d", &seconds); err == nil { return time.Duration(seconds) * time.Second } // Try parsing as HTTP date if t, err := time.Parse(time.RFC1123, header); err == nil { return time.Until(t) } return time.Second }

Idempotency for Safe Retries

go
package retry import ( "context" "crypto/sha256" "encoding/hex" "sync" "time" ) // IdempotencyKey generates a unique key for a request type IdempotencyKey string // IdempotencyStore tracks processed requests type IdempotencyStore interface { Get(ctx context.Context, key IdempotencyKey) (*IdempotencyRecord, error) Set(ctx context.Context, key IdempotencyKey, record *IdempotencyRecord) error Delete(ctx context.Context, key IdempotencyKey) error } type IdempotencyRecord struct { Key IdempotencyKey Response []byte StatusCode int CreatedAt time.Time ExpiresAt time.Time } // InMemoryIdempotencyStore is a simple in-memory implementation type InMemoryIdempotencyStore struct { mu sync.RWMutex records map[IdempotencyKey]*IdempotencyRecord } func NewInMemoryIdempotencyStore() *InMemoryIdempotencyStore { store := &InMemoryIdempotencyStore{ records: make(map[IdempotencyKey]*IdempotencyRecord), } go store.cleanup() return store } func (s *InMemoryIdempotencyStore) Get(ctx context.Context, key IdempotencyKey) (*IdempotencyRecord, error) { s.mu.RLock() defer s.mu.RUnlock() record, ok := s.records[key] if !ok { return nil, nil } if time.Now().After(record.ExpiresAt) { return nil, nil } return record, nil } func (s *InMemoryIdempotencyStore) Set(ctx context.Context, key IdempotencyKey, record *IdempotencyRecord) error { s.mu.Lock() defer s.mu.Unlock() s.records[key] = record return nil } func (s *InMemoryIdempotencyStore) Delete(ctx context.Context, key IdempotencyKey) error { s.mu.Lock() defer s.mu.Unlock() delete(s.records, key) return nil } func (s *InMemoryIdempotencyStore) cleanup() { ticker := time.NewTicker(time.Minute) for range ticker.C { s.mu.Lock() now := time.Now() for key, record := range s.records { if now.After(record.ExpiresAt) { delete(s.records, key) } } s.mu.Unlock() } } // IdempotentExecutor ensures operations are executed exactly once type IdempotentExecutor struct { store IdempotencyStore ttl time.Duration } func NewIdempotentExecutor(store IdempotencyStore, ttl time.Duration) *IdempotentExecutor { return &IdempotentExecutor{ store: store, ttl: ttl, } } // Execute runs the operation idempotently func (e *IdempotentExecutor) Execute( ctx context.Context, key IdempotencyKey, operation func(ctx context.Context) ([]byte, int, error), ) ([]byte, int, error) { // Check for existing result record, err := e.store.Get(ctx, key) if err != nil { return nil, 0, err } if record != nil { // Return cached result return record.Response, record.StatusCode, nil } // Execute operation response, statusCode, err := operation(ctx) if err != nil { return nil, 0, err } // Store result newRecord := &IdempotencyRecord{ Key: key, Response: response, StatusCode: statusCode, CreatedAt: time.Now(), ExpiresAt: time.Now().Add(e.ttl), } if err := e.store.Set(ctx, key, newRecord); err != nil { // Log but don't fail - operation succeeded } return response, statusCode, nil } // GenerateIdempotencyKey creates a key from request data func GenerateIdempotencyKey(userID, operation string, data []byte) IdempotencyKey { hash := sha256.New() hash.Write([]byte(userID)) hash.Write([]byte(operation)) hash.Write(data) return IdempotencyKey(hex.EncodeToString(hash.Sum(nil))) } // Usage in HTTP handler func CreatePaymentHandler(executor *IdempotentExecutor) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { // Get idempotency key from header keyHeader := r.Header.Get("Idempotency-Key") if keyHeader == "" { http.Error(w, "Idempotency-Key header required", http.StatusBadRequest) return } key := IdempotencyKey(keyHeader) response, statusCode, err := executor.Execute(r.Context(), key, func(ctx context.Context) ([]byte, int, error) { // Actually process payment result := processPayment(ctx, r) return result, http.StatusOK, nil }) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } w.WriteHeader(statusCode) w.Write(response) } } func processPayment(ctx context.Context, r *http.Request) []byte { // Payment processing logic return []byte(`{"status": "success"}`) }

Best Practices

┌─────────────────────────────────────────────────────────────────┐ │ RETRY BEST PRACTICES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. USE EXPONENTIAL BACKOFF WITH JITTER │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Prevents thundering herd │ │ │ │ • Adapts to failure severity │ │ │ │ • Industry best practice │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. SET MAXIMUM RETRIES AND TIMEOUT │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Don't retry forever │ │ │ │ • Set overall deadline │ │ │ │ • Fail fast when appropriate │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. ONLY RETRY TRANSIENT ERRORS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Network timeouts: YES │ │ │ │ • 503 Service Unavailable: YES │ │ │ │ • 400 Bad Request: NO │ │ │ │ • 401 Unauthorized: NO │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. MAKE OPERATIONS IDEMPOTENT │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Use idempotency keys │ │ │ │ • Design for safe retries │ │ │ │ • Avoid duplicate side effects │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 5. RESPECT RETRY-AFTER HEADERS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Honor rate limit guidance │ │ │ │ • Don't hammer services │ │ │ │ • Be a good citizen │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 6. LOG AND MONITOR RETRIES │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Track retry rates │ │ │ │ • Alert on high retry rates │ │ │ │ • Identify problematic dependencies │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Summary

┌─────────────────────────────────────────────────────────────────┐ │ RETRY SUMMARY │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Strategies: │ │ • Fixed delay: Simple but inflexible │ │ • Exponential backoff: Adapts to failure severity │ │ • With jitter: Prevents thundering herd │ │ • Decorrelated jitter: Best for high concurrency │ │ │ │ Key Considerations: │ │ • Only retry transient errors │ │ • Set maximum attempts and timeout │ │ • Make operations idempotent │ │ • Monitor retry rates │ │ │ │ Key Insight: │ │ "Retries turn transient failures into eventual success, │ │ but without jitter they can turn recovery into collapse." │ │ │ └─────────────────────────────────────────────────────────────────┘

All Blogs
Tags:retrybackoffresiliencedistributed-systems