Module 24: Retry Strategies

Name: DevSwiftTools
Rating: 4.8 (1250 reviews)

Why Retries Matter

Transient failures are common in distributed systems. Smart retry strategies improve reliability without overwhelming services.

┌─────────────────────────────────────────────────────────────────┐
│                    RETRY OVERVIEW                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Transient Failures (should retry):                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Network timeout                                     │     │
│  │ • Connection refused (service restarting)            │     │
│  │ • 503 Service Unavailable                            │     │
│  │ • 429 Too Many Requests                              │     │
│  │ • Database deadlock                                  │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Permanent Failures (don't retry):                             │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • 400 Bad Request                                    │     │
│  │ • 401 Unauthorized                                   │     │
│  │ • 404 Not Found                                      │     │
│  │ • 422 Validation Error                               │     │
│  │ • Business logic errors                              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Retry Pitfalls:                                                │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Without jitter:              With jitter:            │     │
│  │  ┌─────────────┐              ┌─────────────┐        │     │
│  │  │ Server down │              │ Server down │        │     │
│  │  └──────┬──────┘              └──────┬──────┘        │     │
│  │         │                            │               │     │
│  │  t=0:   │████████████         t=0:   │███            │     │
│  │  t=1:   │████████████         t=0.5: │██             │     │
│  │  t=2:   │████████████         t=1.2: │████           │     │
│  │         │                     t=1.8: │██             │     │
│  │  (Thundering herd!)           (Spread out)           │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Retry Strategies

┌─────────────────────────────────────────────────────────────────┐
│                  RETRY STRATEGIES                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. IMMEDIATE RETRY                                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ fail → retry → retry → retry → give up               │     │
│  │                                                       │     │
│  │ Use: Very short-lived failures                       │     │
│  │ Risk: Can overwhelm already struggling service       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. FIXED DELAY                                                │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ fail → wait 1s → retry → wait 1s → retry             │     │
│  │                                                       │     │
│  │ Simple but doesn't adapt to failure severity          │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. EXPONENTIAL BACKOFF                                        │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ fail → 1s → retry → 2s → retry → 4s → retry → 8s    │     │
│  │                                                       │     │
│  │ delay = min(base * 2^attempt, maxDelay)              │     │
│  │                                                       │     │
│  │ Adapts to severity, reduces load over time           │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. EXPONENTIAL BACKOFF + JITTER                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ fail → 0.8s → retry → 2.3s → retry → 3.9s → retry   │     │
│  │                                                       │     │
│  │ delay = random(0, min(base * 2^attempt, maxDelay))   │     │
│  │                                                       │     │
│  │ Best practice: prevents thundering herd              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  5. DECORRELATED JITTER                                        │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ delay = min(maxDelay, random(base, lastDelay * 3))   │     │
│  │                                                       │     │
│  │ AWS recommended approach                              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Retry Implementation

go
package retry

import (
    "context"
    "errors"
    "math"
    "math/rand"
    "time"
)

// RetryableError indicates an error that can be retried
type RetryableError struct {
    Err       error
    Retryable bool
}

func (e *RetryableError) Error() string {
    return e.Err.Error()
}

func (e *RetryableError) Unwrap() error {
    return e.Err
}

// IsRetryable checks if an error should be retried
func IsRetryable(err error) bool {
    var retryableErr *RetryableError
    if errors.As(err, &retryableErr) {
        return retryableErr.Retryable
    }
    // Default: assume retryable for network errors
    return true
}

// Strategy defines how retries should be performed
type Strategy interface {
    NextDelay(attempt int) time.Duration
    MaxAttempts() int
}

// FixedDelay implements fixed delay retry
type FixedDelay struct {
    Delay    time.Duration
    Attempts int
}

func (f *FixedDelay) NextDelay(attempt int) time.Duration {
    return f.Delay
}

func (f *FixedDelay) MaxAttempts() int {
    return f.Attempts
}

// ExponentialBackoff implements exponential backoff
type ExponentialBackoff struct {
    InitialDelay time.Duration
    MaxDelay     time.Duration
    Multiplier   float64
    Attempts     int
}

func NewExponentialBackoff() *ExponentialBackoff {
    return &ExponentialBackoff{
        InitialDelay: 100 * time.Millisecond,
        MaxDelay:     30 * time.Second,
        Multiplier:   2.0,
        Attempts:     5,
    }
}

func (e *ExponentialBackoff) NextDelay(attempt int) time.Duration {
    delay := float64(e.InitialDelay) * math.Pow(e.Multiplier, float64(attempt))
    if delay > float64(e.MaxDelay) {
        delay = float64(e.MaxDelay)
    }
    return time.Duration(delay)
}

func (e *ExponentialBackoff) MaxAttempts() int {
    return e.Attempts
}

// ExponentialBackoffWithJitter adds jitter to exponential backoff
type ExponentialBackoffWithJitter struct {
    ExponentialBackoff
    JitterFactor float64 // 0.0 to 1.0
}

func NewExponentialBackoffWithJitter() *ExponentialBackoffWithJitter {
    return &ExponentialBackoffWithJitter{
        ExponentialBackoff: *NewExponentialBackoff(),
        JitterFactor:       0.5,
    }
}

func (e *ExponentialBackoffWithJitter) NextDelay(attempt int) time.Duration {
    baseDelay := e.ExponentialBackoff.NextDelay(attempt)

    // Add jitter: delay * (1 - jitterFactor + random * jitterFactor * 2)
    jitter := 1.0 - e.JitterFactor + rand.Float64()*e.JitterFactor*2
    return time.Duration(float64(baseDelay) * jitter)
}

// DecorrelatedJitter implements AWS-style decorrelated jitter
type DecorrelatedJitter struct {
    BaseDelay time.Duration
    MaxDelay  time.Duration
    Attempts  int
    lastDelay time.Duration
}

func NewDecorrelatedJitter() *DecorrelatedJitter {
    return &DecorrelatedJitter{
        BaseDelay: 100 * time.Millisecond,
        MaxDelay:  30 * time.Second,
        Attempts:  5,
    }
}

func (d *DecorrelatedJitter) NextDelay(attempt int) time.Duration {
    if attempt == 0 {
        d.lastDelay = d.BaseDelay
        return d.BaseDelay
    }

    // delay = random(base, lastDelay * 3)
    minDelay := float64(d.BaseDelay)
    maxDelay := float64(d.lastDelay) * 3
    if maxDelay > float64(d.MaxDelay) {
        maxDelay = float64(d.MaxDelay)
    }

    delay := minDelay + rand.Float64()*(maxDelay-minDelay)
    d.lastDelay = time.Duration(delay)

    return d.lastDelay
}

func (d *DecorrelatedJitter) MaxAttempts() int {
    return d.Attempts
}

// Retryer performs retries with a given strategy
type Retryer struct {
    strategy    Strategy
    shouldRetry func(error) bool
    onRetry     func(attempt int, err error, delay time.Duration)
}

func NewRetryer(strategy Strategy) *Retryer {
    return &Retryer{
        strategy:    strategy,
        shouldRetry: IsRetryable,
    }
}

func (r *Retryer) WithRetryCheck(fn func(error) bool) *Retryer {
    r.shouldRetry = fn
    return r
}

func (r *Retryer) WithOnRetry(fn func(attempt int, err error, delay time.Duration)) *Retryer {
    r.onRetry = fn
    return r
}

// Do executes the function with retries
func (r *Retryer) Do(fn func() error) error {
    return r.DoWithContext(context.Background(), func(ctx context.Context) error {
        return fn()
    })
}

// DoWithContext executes with context support
func (r *Retryer) DoWithContext(ctx context.Context, fn func(context.Context) error) error {
    var lastErr error

    for attempt := 0; attempt < r.strategy.MaxAttempts(); attempt++ {
        err := fn(ctx)
        if err == nil {
            return nil
        }

        lastErr = err

        // Check if we should retry
        if !r.shouldRetry(err) {
            return err
        }

        // Check if we've exhausted attempts
        if attempt == r.strategy.MaxAttempts()-1 {
            break
        }

        // Calculate delay
        delay := r.strategy.NextDelay(attempt)

        if r.onRetry != nil {
            r.onRetry(attempt+1, err, delay)
        }

        // Wait with context
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(delay):
        }
    }

    return lastErr
}

// DoWithResult executes and returns result
func (r *Retryer) DoWithResult(fn func() (interface{}, error)) (interface{}, error) {
    var result interface{}
    err := r.Do(func() error {
        var err error
        result, err = fn()
        return err
    })
    return result, err
}

HTTP Retry Client

go
package retry

import (
    "context"
    "fmt"
    "io"
    "net/http"
    "time"
)

// HTTPClient with retry support
type HTTPClient struct {
    client  *http.Client
    retryer *Retryer
}

func NewHTTPClient(timeout time.Duration, strategy Strategy) *HTTPClient {
    retryer := NewRetryer(strategy)
    retryer.WithRetryCheck(isHTTPRetryable)

    return &HTTPClient{
        client: &http.Client{
            Timeout: timeout,
        },
        retryer: retryer,
    }
}

// isHTTPRetryable determines if HTTP error should be retried
func isHTTPRetryable(err error) bool {
    // Network errors are retryable
    return true
}

// isStatusRetryable determines if HTTP status should be retried
func isStatusRetryable(status int) bool {
    switch status {
    case http.StatusTooManyRequests,        // 429
        http.StatusServiceUnavailable,      // 503
        http.StatusGatewayTimeout,          // 504
        http.StatusBadGateway:              // 502
        return true
    default:
        return status >= 500
    }
}

// Do performs HTTP request with retries
func (c *HTTPClient) Do(req *http.Request) (*http.Response, error) {
    var resp *http.Response

    err := c.retryer.DoWithContext(req.Context(), func(ctx context.Context) error {
        // Clone request for retry
        reqCopy := req.Clone(ctx)

        // Reset body if present
        if req.GetBody != nil {
            body, err := req.GetBody()
            if err != nil {
                return err
            }
            reqCopy.Body = body
        }

        var err error
        resp, err = c.client.Do(reqCopy)
        if err != nil {
            return err
        }

        // Check if status is retryable
        if isStatusRetryable(resp.StatusCode) {
            // Read and discard body for connection reuse
            io.Copy(io.Discard, resp.Body)
            resp.Body.Close()
            return fmt.Errorf("retryable status: %d", resp.StatusCode)
        }

        return nil
    })

    return resp, err
}

// Get performs GET request with retries
func (c *HTTPClient) Get(ctx context.Context, url string) (*http.Response, error) {
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return nil, err
    }
    return c.Do(req)
}

// RetryAfterClient handles Retry-After header
type RetryAfterClient struct {
    client  *http.Client
    maxWait time.Duration
}

func NewRetryAfterClient(timeout time.Duration, maxWait time.Duration) *RetryAfterClient {
    return &RetryAfterClient{
        client: &http.Client{
            Timeout: timeout,
        },
        maxWait: maxWait,
    }
}

func (c *RetryAfterClient) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
    for {
        reqCopy := req.Clone(ctx)
        if req.GetBody != nil {
            body, _ := req.GetBody()
            reqCopy.Body = body
        }

        resp, err := c.client.Do(reqCopy)
        if err != nil {
            return nil, err
        }

        // Check for rate limiting
        if resp.StatusCode == http.StatusTooManyRequests {
            retryAfter := parseRetryAfter(resp.Header.Get("Retry-After"))

            if retryAfter > c.maxWait {
                return resp, nil // Don't wait too long
            }

            io.Copy(io.Discard, resp.Body)
            resp.Body.Close()

            select {
            case <-ctx.Done():
                return nil, ctx.Err()
            case <-time.After(retryAfter):
                continue
            }
        }

        return resp, nil
    }
}

func parseRetryAfter(header string) time.Duration {
    if header == "" {
        return time.Second
    }

    // Try parsing as seconds
    var seconds int
    if _, err := fmt.Sscanf(header, "%d", &seconds); err == nil {
        return time.Duration(seconds) * time.Second
    }

    // Try parsing as HTTP date
    if t, err := time.Parse(time.RFC1123, header); err == nil {
        return time.Until(t)
    }

    return time.Second
}

Idempotency for Safe Retries

go
package retry

import (
    "context"
    "crypto/sha256"
    "encoding/hex"
    "sync"
    "time"
)

// IdempotencyKey generates a unique key for a request
type IdempotencyKey string

// IdempotencyStore tracks processed requests
type IdempotencyStore interface {
    Get(ctx context.Context, key IdempotencyKey) (*IdempotencyRecord, error)
    Set(ctx context.Context, key IdempotencyKey, record *IdempotencyRecord) error
    Delete(ctx context.Context, key IdempotencyKey) error
}

type IdempotencyRecord struct {
    Key        IdempotencyKey
    Response   []byte
    StatusCode int
    CreatedAt  time.Time
    ExpiresAt  time.Time
}

// InMemoryIdempotencyStore is a simple in-memory implementation
type InMemoryIdempotencyStore struct {
    mu      sync.RWMutex
    records map[IdempotencyKey]*IdempotencyRecord
}

func NewInMemoryIdempotencyStore() *InMemoryIdempotencyStore {
    store := &InMemoryIdempotencyStore{
        records: make(map[IdempotencyKey]*IdempotencyRecord),
    }
    go store.cleanup()
    return store
}

func (s *InMemoryIdempotencyStore) Get(ctx context.Context, key IdempotencyKey) (*IdempotencyRecord, error) {
    s.mu.RLock()
    defer s.mu.RUnlock()

    record, ok := s.records[key]
    if !ok {
        return nil, nil
    }

    if time.Now().After(record.ExpiresAt) {
        return nil, nil
    }

    return record, nil
}

func (s *InMemoryIdempotencyStore) Set(ctx context.Context, key IdempotencyKey, record *IdempotencyRecord) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.records[key] = record
    return nil
}

func (s *InMemoryIdempotencyStore) Delete(ctx context.Context, key IdempotencyKey) error {
    s.mu.Lock()
    defer s.mu.Unlock()
    delete(s.records, key)
    return nil
}

func (s *InMemoryIdempotencyStore) cleanup() {
    ticker := time.NewTicker(time.Minute)
    for range ticker.C {
        s.mu.Lock()
        now := time.Now()
        for key, record := range s.records {
            if now.After(record.ExpiresAt) {
                delete(s.records, key)
            }
        }
        s.mu.Unlock()
    }
}

// IdempotentExecutor ensures operations are executed exactly once
type IdempotentExecutor struct {
    store    IdempotencyStore
    ttl      time.Duration
}

func NewIdempotentExecutor(store IdempotencyStore, ttl time.Duration) *IdempotentExecutor {
    return &IdempotentExecutor{
        store: store,
        ttl:   ttl,
    }
}

// Execute runs the operation idempotently
func (e *IdempotentExecutor) Execute(
    ctx context.Context,
    key IdempotencyKey,
    operation func(ctx context.Context) ([]byte, int, error),
) ([]byte, int, error) {
    // Check for existing result
    record, err := e.store.Get(ctx, key)
    if err != nil {
        return nil, 0, err
    }

    if record != nil {
        // Return cached result
        return record.Response, record.StatusCode, nil
    }

    // Execute operation
    response, statusCode, err := operation(ctx)
    if err != nil {
        return nil, 0, err
    }

    // Store result
    newRecord := &IdempotencyRecord{
        Key:        key,
        Response:   response,
        StatusCode: statusCode,
        CreatedAt:  time.Now(),
        ExpiresAt:  time.Now().Add(e.ttl),
    }

    if err := e.store.Set(ctx, key, newRecord); err != nil {
        // Log but don't fail - operation succeeded
    }

    return response, statusCode, nil
}

// GenerateIdempotencyKey creates a key from request data
func GenerateIdempotencyKey(userID, operation string, data []byte) IdempotencyKey {
    hash := sha256.New()
    hash.Write([]byte(userID))
    hash.Write([]byte(operation))
    hash.Write(data)
    return IdempotencyKey(hex.EncodeToString(hash.Sum(nil)))
}

// Usage in HTTP handler
func CreatePaymentHandler(executor *IdempotentExecutor) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Get idempotency key from header
        keyHeader := r.Header.Get("Idempotency-Key")
        if keyHeader == "" {
            http.Error(w, "Idempotency-Key header required", http.StatusBadRequest)
            return
        }

        key := IdempotencyKey(keyHeader)

        response, statusCode, err := executor.Execute(r.Context(), key,
            func(ctx context.Context) ([]byte, int, error) {
                // Actually process payment
                result := processPayment(ctx, r)
                return result, http.StatusOK, nil
            })

        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }

        w.WriteHeader(statusCode)
        w.Write(response)
    }
}

func processPayment(ctx context.Context, r *http.Request) []byte {
    // Payment processing logic
    return []byte(`{"status": "success"}`)
}

Best Practices

┌─────────────────────────────────────────────────────────────────┐
│               RETRY BEST PRACTICES                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. USE EXPONENTIAL BACKOFF WITH JITTER                        │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Prevents thundering herd                            │     │
│  │ • Adapts to failure severity                          │     │
│  │ • Industry best practice                              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. SET MAXIMUM RETRIES AND TIMEOUT                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Don't retry forever                                 │     │
│  │ • Set overall deadline                                │     │
│  │ • Fail fast when appropriate                          │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. ONLY RETRY TRANSIENT ERRORS                                │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Network timeouts: YES                               │     │
│  │ • 503 Service Unavailable: YES                        │     │
│  │ • 400 Bad Request: NO                                 │     │
│  │ • 401 Unauthorized: NO                                │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. MAKE OPERATIONS IDEMPOTENT                                 │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Use idempotency keys                                │     │
│  │ • Design for safe retries                             │     │
│  │ • Avoid duplicate side effects                        │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  5. RESPECT RETRY-AFTER HEADERS                                │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Honor rate limit guidance                           │     │
│  │ • Don't hammer services                               │     │
│  │ • Be a good citizen                                   │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  6. LOG AND MONITOR RETRIES                                    │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Track retry rates                                   │     │
│  │ • Alert on high retry rates                           │     │
│  │ • Identify problematic dependencies                   │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Summary

┌─────────────────────────────────────────────────────────────────┐
│                 RETRY SUMMARY                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Strategies:                                                    │
│  • Fixed delay: Simple but inflexible                          │
│  • Exponential backoff: Adapts to failure severity             │
│  • With jitter: Prevents thundering herd                       │
│  • Decorrelated jitter: Best for high concurrency              │
│                                                                 │
│  Key Considerations:                                            │
│  • Only retry transient errors                                 │
│  • Set maximum attempts and timeout                            │
│  • Make operations idempotent                                  │
│  • Monitor retry rates                                         │
│                                                                 │
│  Key Insight:                                                   │
│  "Retries turn transient failures into eventual success,       │
│   but without jitter they can turn recovery into collapse."    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Next Module: Module 25 - Observability

Previous Module: Module 23 - Circuit Breakers