Module 17: Multi-Region Architecture

Why Multi-Region?

As systems grow globally, running in a single region creates latency, availability, and compliance challenges.

┌─────────────────────────────────────────────────────────────────┐
│               WHY MULTI-REGION?                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. LATENCY                                                    │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Single region (US-East):                              │     │
│  │ • US users: 20ms                                      │     │
│  │ • EU users: 100ms                                     │     │
│  │ • APAC users: 200ms                                   │     │
│  │                                                       │     │
│  │ Multi-region (US + EU + APAC):                        │     │
│  │ • All users: 20-40ms                                  │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. AVAILABILITY                                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Single region: One region down = entire system down   │     │
│  │                                                       │     │
│  │ Multi-region: One region down = failover to others   │     │
│  │                                                       │     │
│  │ Single region: 99.9% (8.7 hours downtime/year)       │     │
│  │ Multi-region: 99.99% (52 minutes downtime/year)      │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. DATA RESIDENCY / COMPLIANCE                                │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ GDPR: EU user data must stay in EU                    │     │
│  │ CCPA: California data privacy requirements            │     │
│  │ Data sovereignty laws in many countries               │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. DISASTER RECOVERY                                          │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Natural disasters, power outages, network issues      │     │
│  │ can take out an entire region                         │     │
│  │                                                       │     │
│  │ Multi-region provides geographic redundancy           │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Multi-Region Patterns

┌─────────────────────────────────────────────────────────────────┐
│               MULTI-REGION PATTERNS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. ACTIVE-PASSIVE (Disaster Recovery)                         │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │      ┌─────────────┐        ┌─────────────┐          │     │
│  │      │   Active    │───────►│   Passive   │          │     │
│  │      │   Region    │  sync  │   Region    │          │     │
│  │      │  (US-East)  │        │  (US-West)  │          │     │
│  │      └─────────────┘        └─────────────┘          │     │
│  │            ▲                       │                  │     │
│  │            │                       │ (standby)        │     │
│  │        All traffic          Failover only            │     │
│  │                                                       │     │
│  │  • Primary handles all traffic                       │     │
│  │  • Secondary is warm standby                          │     │
│  │  • Failover: manual or automated                      │     │
│  │  • Lower cost, higher RTO                             │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. ACTIVE-ACTIVE (Load Distribution)                          │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │      ┌─────────────┐        ┌─────────────┐          │     │
│  │      │   Active    │◄──────►│   Active    │          │     │
│  │      │   Region    │  sync  │   Region    │          │     │
│  │      │  (US-East)  │        │  (EU-West)  │          │     │
│  │      └─────────────┘        └─────────────┘          │     │
│  │            ▲                       ▲                  │     │
│  │            │                       │                  │     │
│  │        US traffic             EU traffic              │     │
│  │                                                       │     │
│  │  • Both regions serve traffic                         │     │
│  │  • Bi-directional data sync                          │     │
│  │  • Instant failover                                   │     │
│  │  • Higher cost, lower RTO                             │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. ACTIVE-ACTIVE-ACTIVE (Global)                             │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │              ┌─────────────┐                         │     │
│  │              │   Active    │                         │     │
│  │         ┌───►│  (US-East)  │◄───┐                   │     │
│  │         │    └─────────────┘    │                    │     │
│  │         │                       │                    │     │
│  │         ▼                       ▼                    │     │
│  │  ┌─────────────┐        ┌─────────────┐             │     │
│  │  │   Active    │◄──────►│   Active    │             │     │
│  │  │  (EU-West)  │        │   (APAC)    │             │     │
│  │  └─────────────┘        └─────────────┘             │     │
│  │                                                       │     │
│  │  • All regions active                                 │     │
│  │  • Mesh replication                                   │     │
│  │  • Complex conflict resolution                        │     │
│  │  • Best latency globally                              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Replication Strategies

┌─────────────────────────────────────────────────────────────────┐
│              DATA REPLICATION STRATEGIES                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. SYNCHRONOUS REPLICATION                                    │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Write ──► Region A ──── sync ────► Region B         │     │
│  │                │                        │             │     │
│  │                └── Wait for ACK ────────┘             │     │
│  │                         │                             │     │
│  │                    Return OK                          │     │
│  │                                                       │     │
│  │  Pros: Strong consistency, no data loss              │     │
│  │  Cons: High latency, region failure blocks writes    │     │
│  │                                                       │     │
│  │  Use: Financial transactions, critical data          │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. ASYNCHRONOUS REPLICATION                                   │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Write ──► Region A ──── Return OK                   │     │
│  │                │                                      │     │
│  │                └── async replicate ──► Region B      │     │
│  │                        (later)                        │     │
│  │                                                       │     │
│  │  Pros: Low latency, region failure doesn't block    │     │
│  │  Cons: Eventual consistency, potential data loss     │     │
│  │                                                       │     │
│  │  Use: Analytics, logs, non-critical data            │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. SEMI-SYNCHRONOUS REPLICATION                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Write ──► Primary ──── sync ────► At least 1 replica │     │
│  │                │                                      │     │
│  │                └── async ──► Other replicas          │     │
│  │                                                       │     │
│  │  Balanced approach: durability + performance         │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Replication Lag:                                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Same region:     1-5ms                                │     │
│  │ Cross-region:    50-200ms                             │     │
│  │ Cross-continent: 100-300ms                            │     │
│  │                                                       │     │
│  │ This lag affects read-after-write consistency!       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Global Traffic Routing

┌─────────────────────────────────────────────────────────────────┐
│               GLOBAL TRAFFIC ROUTING                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  DNS-Based Routing (GeoDNS):                                   │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  User (Paris) ─► DNS ─► Returns EU server IP         │     │
│  │  User (Tokyo) ─► DNS ─► Returns APAC server IP       │     │
│  │                                                       │     │
│  │  Services: Route 53, Cloudflare, Google Cloud DNS    │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Routing Policies:                                              │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Geolocation:                                          │     │
│  │   EU users → eu-west-1                               │     │
│  │   US users → us-east-1                               │     │
│  │   Default  → us-east-1                               │     │
│  │                                                       │     │
│  │ Latency-based:                                        │     │
│  │   Route to region with lowest latency                 │     │
│  │   (based on health checks)                           │     │
│  │                                                       │     │
│  │ Weighted:                                             │     │
│  │   us-east-1: 70%                                     │     │
│  │   eu-west-1: 30%                                     │     │
│  │                                                       │     │
│  │ Failover:                                             │     │
│  │   Primary: us-east-1                                 │     │
│  │   Secondary: us-west-2 (if primary unhealthy)        │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Anycast:                                                       │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Same IP advertised from multiple locations            │     │
│  │ Network routes to nearest location                    │     │
│  │                                                       │     │
│  │ Used by: Cloudflare, major CDNs                      │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Global Load Balancer Implementation

go
package routing

import (
    "context"
    "math"
    "net/http"
    "sync"
    "time"
)

// Region represents a deployment region
type Region struct {
    Name      string
    Endpoint  string
    Healthy   bool
    Latency   time.Duration
    Weight    int
    Countries []string
}

// GlobalRouter routes requests to the best region
type GlobalRouter struct {
    regions       []*Region
    healthChecker *HealthChecker
    mu            sync.RWMutex
}

func NewGlobalRouter(regions []*Region) *GlobalRouter {
    gr := &GlobalRouter{regions: regions}
    gr.healthChecker = NewHealthChecker(regions)
    go gr.healthChecker.Start()
    return gr
}

// RouteByGeolocation routes based on user's country
func (g *GlobalRouter) RouteByGeolocation(country string) *Region {
    g.mu.RLock()
    defer g.mu.RUnlock()

    // Find region that serves this country
    for _, region := range g.regions {
        if !region.Healthy {
            continue
        }
        for _, c := range region.Countries {
            if c == country {
                return region
            }
        }
    }

    // Fallback to first healthy region
    for _, region := range g.regions {
        if region.Healthy {
            return region
        }
    }

    return nil
}

// RouteByLatency routes to region with lowest latency
func (g *GlobalRouter) RouteByLatency(clientIP string) *Region {
    g.mu.RLock()
    defer g.mu.RUnlock()

    var bestRegion *Region
    minLatency := time.Duration(math.MaxInt64)

    for _, region := range g.regions {
        if !region.Healthy {
            continue
        }
        if region.Latency < minLatency {
            minLatency = region.Latency
            bestRegion = region
        }
    }

    return bestRegion
}

// RouteWithFailover routes to primary, falls back to secondary
func (g *GlobalRouter) RouteWithFailover(primary, secondary string) *Region {
    g.mu.RLock()
    defer g.mu.RUnlock()

    for _, region := range g.regions {
        if region.Name == primary && region.Healthy {
            return region
        }
    }

    for _, region := range g.regions {
        if region.Name == secondary && region.Healthy {
            return region
        }
    }

    // Return any healthy region
    for _, region := range g.regions {
        if region.Healthy {
            return region
        }
    }

    return nil
}

// RouteWeighted routes based on weights
func (g *GlobalRouter) RouteWeighted() *Region {
    g.mu.RLock()
    defer g.mu.RUnlock()

    totalWeight := 0
    for _, region := range g.regions {
        if region.Healthy {
            totalWeight += region.Weight
        }
    }

    if totalWeight == 0 {
        return nil
    }

    r := randomInt(totalWeight)
    for _, region := range g.regions {
        if !region.Healthy {
            continue
        }
        r -= region.Weight
        if r < 0 {
            return region
        }
    }

    return nil
}

// HealthChecker monitors region health
type HealthChecker struct {
    regions  []*Region
    interval time.Duration
    client   *http.Client
}

func NewHealthChecker(regions []*Region) *HealthChecker {
    return &HealthChecker{
        regions:  regions,
        interval: 10 * time.Second,
        client:   &http.Client{Timeout: 5 * time.Second},
    }
}

func (h *HealthChecker) Start() {
    ticker := time.NewTicker(h.interval)
    for range ticker.C {
        h.checkAll()
    }
}

func (h *HealthChecker) checkAll() {
    var wg sync.WaitGroup
    for _, region := range h.regions {
        wg.Add(1)
        go func(r *Region) {
            defer wg.Done()
            h.checkRegion(r)
        }(region)
    }
    wg.Wait()
}

func (h *HealthChecker) checkRegion(region *Region) {
    start := time.Now()
    resp, err := h.client.Get(region.Endpoint + "/health")

    if err != nil {
        region.Healthy = false
        return
    }
    defer resp.Body.Close()

    region.Latency = time.Since(start)
    region.Healthy = resp.StatusCode == http.StatusOK
}

func randomInt(max int) int {
    // Use crypto/rand in production
    return int(time.Now().UnixNano() % int64(max))
}

Handling Data Consistency

┌─────────────────────────────────────────────────────────────────┐
│            MULTI-REGION DATA CONSISTENCY                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Challenge: User writes in Region A, reads in Region B         │
│                                                                 │
│  Timeline:                                                      │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ t=0:   User writes to Region A (US)                   │     │
│  │ t=1:   User gets redirected to Region B (EU)          │     │
│  │ t=2:   User reads from Region B                       │     │
│  │ t=100: Replication catches up to Region B             │     │
│  │                                                       │     │
│  │ Problem: User doesn't see their own write!            │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Solutions:                                                     │
│                                                                 │
│  1. STICKY SESSIONS                                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Route user to same region for session duration        │     │
│  │                                                       │     │
│  │ Pros: Simple, solves read-after-write                │     │
│  │ Cons: Uneven load, failover breaks session           │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. READ-FROM-WRITE-REGION                                     │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ After write, read from same region until sync         │     │
│  │                                                       │     │
│  │ Cookie: last_write_region=us-east-1; ts=123456       │     │
│  │ If ts < 5s ago, route reads to us-east-1             │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. VERSION TRACKING                                           │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Write returns version number                          │     │
│  │ Read includes minimum version requirement             │     │
│  │                                                       │     │
│  │ Write: POST /user → {version: 42}                    │     │
│  │ Read:  GET /user?min_version=42                      │     │
│  │ If region < v42, route to primary                    │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. GLOBAL PRIMARY FOR WRITES                                  │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ All writes go to single primary region                │     │
│  │ Reads can go to any region                            │     │
│  │                                                       │     │
│  │ Trade-off: Higher write latency, simpler consistency │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Session Affinity with Version Tracking

go
package multiregion

import (
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "strconv"
    "time"
)

// RegionalService handles requests with regional awareness
type RegionalService struct {
    regionID     string
    isPrimary    bool
    db           RegionalDB
    replicaLag   time.Duration
}

// RegionalDB represents a regionally replicated database
type RegionalDB interface {
    Write(ctx context.Context, key string, value interface{}) (version int64, error)
    Read(ctx context.Context, key string) (interface{}, int64, error)
    ReadWithMinVersion(ctx context.Context, key string, minVersion int64) (interface{}, int64, error)
    GetCurrentVersion(ctx context.Context, key string) (int64, error)
}

// SessionToken tracks write region and version
type SessionToken struct {
    UserID       string `json:"user_id"`
    WriteRegion  string `json:"write_region"`
    WriteVersion int64  `json:"write_version"`
    WriteTime    int64  `json:"write_time"`
}

func (s *RegionalService) WriteHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    var data map[string]interface{}
    json.NewDecoder(r.Body).Decode(&data)

    userID := r.Header.Get("X-User-ID")
    key := fmt.Sprintf("user:%s", userID)

    // Write to database
    version, err := s.db.Write(ctx, key, data)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    // Create session token with write info
    token := SessionToken{
        UserID:       userID,
        WriteRegion:  s.regionID,
        WriteVersion: version,
        WriteTime:    time.Now().UnixMilli(),
    }

    // Set cookie for session tracking
    tokenJSON, _ := json.Marshal(token)
    http.SetCookie(w, &http.Cookie{
        Name:     "session_token",
        Value:    string(tokenJSON),
        Path:     "/",
        MaxAge:   3600,
        HttpOnly: true,
        Secure:   true,
    })

    w.Header().Set("X-Write-Version", strconv.FormatInt(version, 10))
    w.Header().Set("X-Write-Region", s.regionID)

    json.NewEncoder(w).Encode(map[string]interface{}{
        "success": true,
        "version": version,
    })
}

func (s *RegionalService) ReadHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()

    userID := r.Header.Get("X-User-ID")
    key := fmt.Sprintf("user:%s", userID)

    // Check for session token
    cookie, _ := r.Cookie("session_token")
    var token SessionToken
    if cookie != nil {
        json.Unmarshal([]byte(cookie.Value), &token)
    }

    var data interface{}
    var version int64
    var err error

    // Check if we need read-after-write consistency
    if token.UserID == userID && s.needsConsistency(token) {
        // Check if local version is sufficient
        localVersion, _ := s.db.GetCurrentVersion(ctx, key)

        if localVersion >= token.WriteVersion {
            // Local replica is caught up
            data, version, err = s.db.Read(ctx, key)
        } else {
            // Need to route to write region or wait
            data, version, err = s.readWithConsistency(ctx, key, token)
        }
    } else {
        // Normal read from local replica
        data, version, err = s.db.Read(ctx, key)
    }

    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    w.Header().Set("X-Read-Version", strconv.FormatInt(version, 10))
    w.Header().Set("X-Read-Region", s.regionID)

    json.NewEncoder(w).Encode(data)
}

func (s *RegionalService) needsConsistency(token SessionToken) bool {
    // Only enforce consistency for recent writes
    writeAge := time.Since(time.UnixMilli(token.WriteTime))
    return writeAge < 30*time.Second
}

func (s *RegionalService) readWithConsistency(
    ctx context.Context,
    key string,
    token SessionToken,
) (interface{}, int64, error) {
    // Option 1: Wait for replication (with timeout)
    deadline := time.Now().Add(5 * time.Second)

    for time.Now().Before(deadline) {
        version, _ := s.db.GetCurrentVersion(ctx, key)
        if version >= token.WriteVersion {
            return s.db.Read(ctx, key)
        }
        time.Sleep(100 * time.Millisecond)
    }

    // Option 2: Read with minimum version (may route to primary)
    return s.db.ReadWithMinVersion(ctx, key, token.WriteVersion)
}

// Middleware to route to write region if needed
func ConsistencyRoutingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        cookie, _ := r.Cookie("session_token")
        if cookie == nil {
            next.ServeHTTP(w, r)
            return
        }

        var token SessionToken
        json.Unmarshal([]byte(cookie.Value), &token)

        // If recent write, route to write region
        writeAge := time.Since(time.UnixMilli(token.WriteTime))
        if writeAge < 5*time.Second {
            // Add header for load balancer to route to write region
            r.Header.Set("X-Preferred-Region", token.WriteRegion)
        }

        next.ServeHTTP(w, r)
    })
}

Conflict Resolution in Multi-Region

┌─────────────────────────────────────────────────────────────────┐
│            MULTI-REGION CONFLICT RESOLUTION                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Problem: Two users write to same data in different regions   │
│                                                                 │
│    Region A (US)              Region B (EU)                    │
│        │                          │                             │
│        ▼                          ▼                             │
│    Write X=1                  Write X=2                        │
│    at t=100                   at t=100                         │
│        │                          │                             │
│        └──────── Conflict ────────┘                            │
│                                                                 │
│  Solutions:                                                     │
│                                                                 │
│  1. LAST-WRITE-WINS (LWW)                                     │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Use timestamp to determine winner                     │     │
│  │ Simple but can lose data                              │     │
│  │                                                       │     │
│  │ Tip: Use Hybrid Logical Clocks for ordering          │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. REGION PRIORITY                                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Designate primary region for each data partition     │     │
│  │ US users → US region writes                          │     │
│  │ EU users → EU region writes                          │     │
│  │                                                       │     │
│  │ Avoid conflicts by partitioning                      │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. OPERATIONAL TRANSFORMS                                     │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ For collaborative editing (Google Docs style)        │     │
│  │ Transform concurrent operations to commute           │     │
│  │                                                       │     │
│  │ Complex but preserves all edits                      │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. CRDTs                                                      │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Design data structures to always merge               │     │
│  │ Counters, sets, registers                            │     │
│  │                                                       │     │
│  │ See Module 29 for details                            │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  5. APPLICATION-LEVEL RESOLUTION                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Store both versions, let application/user decide     │     │
│  │                                                       │     │
│  │ Example: Shopping cart merges items from both carts  │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Failover Strategies

┌─────────────────────────────────────────────────────────────────┐
│                  FAILOVER STRATEGIES                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. DNS FAILOVER                                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Health check fails → Update DNS records               │     │
│  │                                                       │     │
│  │ Pros: Simple, works for any protocol                 │     │
│  │ Cons: DNS TTL delays (30s-5min typical)              │     │
│  │       Client DNS caching                              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. LOAD BALANCER FAILOVER                                     │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Global LB detects unhealthy region, routes elsewhere │     │
│  │                                                       │     │
│  │ Pros: Fast failover (seconds)                        │     │
│  │ Cons: LB is single point of failure                  │     │
│  │       (use anycast/multi-LB)                         │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. CLIENT-SIDE FAILOVER                                       │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Client retries to different region on failure        │     │
│  │                                                       │     │
│  │ Pros: Fastest, no central dependency                 │     │
│  │ Cons: Client complexity, mobile app update delays    │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Failover Timeline:                                             │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ t=0:    Region A fails                                │     │
│  │ t=10s:  Health check detects failure                  │     │
│  │ t=15s:  Failover initiated                            │     │
│  │ t=20s:  DNS updated / LB reroutes                    │     │
│  │ t=30s:  Clients start using Region B                 │     │
│  │ t=5min: Most clients on Region B (DNS TTL)           │     │
│  │                                                       │     │
│  │ Total RTO: 30s-5min depending on strategy            │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Automatic Failover Implementation

go
package failover

import (
    "context"
    "log"
    "net/http"
    "sync"
    "time"
)

// FailoverController manages regional failover
type FailoverController struct {
    regions        []*Region
    activeRegion   *Region
    standbyRegion  *Region

    healthChecker  *HealthChecker
    dnsUpdater     DNSUpdater
    alerter        Alerter

    failoverMu     sync.Mutex
    lastFailover   time.Time
    cooldownPeriod time.Duration
}

type DNSUpdater interface {
    UpdateRecord(ctx context.Context, record string, ip string) error
}

type Alerter interface {
    SendAlert(ctx context.Context, message string) error
}

func NewFailoverController(
    regions []*Region,
    dns DNSUpdater,
    alerter Alerter,
) *FailoverController {
    fc := &FailoverController{
        regions:        regions,
        dnsUpdater:     dns,
        alerter:        alerter,
        cooldownPeriod: 5 * time.Minute,
    }

    // Set initial active/standby
    fc.activeRegion = regions[0]
    fc.standbyRegion = regions[1]

    fc.healthChecker = NewHealthChecker(regions)
    fc.healthChecker.OnUnhealthy = fc.handleUnhealthy

    return fc
}

func (fc *FailoverController) Start() {
    go fc.healthChecker.Start()
}

func (fc *FailoverController) handleUnhealthy(region *Region) {
    fc.failoverMu.Lock()
    defer fc.failoverMu.Unlock()

    // Only failover if active region is unhealthy
    if region != fc.activeRegion {
        return
    }

    // Check cooldown
    if time.Since(fc.lastFailover) < fc.cooldownPeriod {
        log.Printf("Failover skipped: in cooldown period")
        return
    }

    // Check if standby is healthy
    if !fc.standbyRegion.Healthy {
        fc.alerter.SendAlert(context.Background(),
            "CRITICAL: Both regions unhealthy, cannot failover")
        return
    }

    // Perform failover
    fc.performFailover()
}

func (fc *FailoverController) performFailover() {
    ctx := context.Background()

    oldActive := fc.activeRegion
    newActive := fc.standbyRegion

    log.Printf("Initiating failover: %s → %s", oldActive.Name, newActive.Name)

    // Alert on-call
    fc.alerter.SendAlert(ctx, fmt.Sprintf(
        "FAILOVER: Switching from %s to %s", oldActive.Name, newActive.Name))

    // Update DNS
    if err := fc.dnsUpdater.UpdateRecord(ctx, "api.example.com", newActive.IP); err != nil {
        log.Printf("DNS update failed: %v", err)
        fc.alerter.SendAlert(ctx, "CRITICAL: DNS failover update failed")
        return
    }

    // Swap regions
    fc.activeRegion = newActive
    fc.standbyRegion = oldActive
    fc.lastFailover = time.Now()

    log.Printf("Failover complete: %s is now active", newActive.Name)
    fc.alerter.SendAlert(ctx, fmt.Sprintf(
        "Failover complete: %s is now active", newActive.Name))
}

// Manual failover endpoint
func (fc *FailoverController) FailoverHandler(w http.ResponseWriter, r *http.Request) {
    if r.Method != http.MethodPost {
        http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
        return
    }

    // Require authorization
    if r.Header.Get("X-Failover-Token") != "secret" {
        http.Error(w, "Unauthorized", http.StatusUnauthorized)
        return
    }

    fc.failoverMu.Lock()
    defer fc.failoverMu.Unlock()

    fc.performFailover()
    w.Write([]byte("Failover initiated"))
}

// Status endpoint
func (fc *FailoverController) StatusHandler(w http.ResponseWriter, r *http.Request) {
    status := map[string]interface{}{
        "active_region":  fc.activeRegion.Name,
        "standby_region": fc.standbyRegion.Name,
        "last_failover":  fc.lastFailover,
        "regions":        []map[string]interface{}{},
    }

    for _, region := range fc.regions {
        status["regions"] = append(status["regions"].([]map[string]interface{}),
            map[string]interface{}{
                "name":    region.Name,
                "healthy": region.Healthy,
                "latency": region.Latency.String(),
            })
    }

    json.NewEncoder(w).Encode(status)
}

Best Practices

┌─────────────────────────────────────────────────────────────────┐
│            MULTI-REGION BEST PRACTICES                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. START SIMPLE                                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Start with active-passive before active-active     │     │
│  │ • Add regions incrementally                          │     │
│  │ • Master single-region first                         │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. DESIGN FOR EVENTUAL CONSISTENCY                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Assume replication lag                              │     │
│  │ • Design for read-after-write scenarios              │     │
│  │ • Use CRDTs where possible                           │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. TEST FAILOVER REGULARLY                                    │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Monthly failover drills                            │     │
│  │ • Chaos engineering (kill a region)                  │     │
│  │ • Measure RTO and RPO                                │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. MONITOR CROSS-REGION METRICS                               │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Replication lag                                    │     │
│  │ • Cross-region latency                               │     │
│  │ • Health check status                                │     │
│  │ • Error rates by region                              │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  5. DATA RESIDENCY COMPLIANCE                                  │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Know where data lives                              │     │
│  │ • Implement data isolation where required            │     │
│  │ • Audit data flows                                   │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  6. COST OPTIMIZATION                                          │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Data transfer between regions is expensive         │     │
│  │ • Right-size secondary regions                       │     │
│  │ • Consider cold standby for DR only                  │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Interview Questions

What's the difference between active-active and active-passive?
- Active-passive: One region serves traffic, other is standby
- Active-active: Both regions serve traffic, bidirectional sync
How do you handle read-after-write consistency across regions?
- Session affinity to write region
- Version tracking with routing
- Synchronous replication (high latency)
What factors determine how many regions you need?
- Latency requirements
- Availability targets
- Data residency requirements
- Cost constraints
How do you test multi-region failover?
- Chaos engineering
- Regular failover drills
- Load testing during failover
- Measure RTO/RPO
Design a multi-region architecture for a global e-commerce site
- GeoDNS for routing
- Active-active for product catalog (reads)
- Primary region for orders (writes)
- Cross-region replication with conflict resolution

Summary

┌─────────────────────────────────────────────────────────────────┐
│               MULTI-REGION SUMMARY                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Patterns:                                                      │
│  • Active-Passive: Simple DR, higher RTO                       │
│  • Active-Active: Lower latency, complex sync                  │
│                                                                 │
│  Replication:                                                   │
│  • Sync: Strong consistency, high latency                      │
│  • Async: Low latency, eventual consistency                    │
│                                                                 │
│  Routing:                                                       │
│  • GeoDNS for region selection                                 │
│  • Health checks for failover                                  │
│  • Session affinity for consistency                            │
│                                                                 │
│  Consistency:                                                   │
│  • Accept eventual consistency                                 │
│  • Use CRDTs or LWW for conflicts                             │
│  • Route reads to write region when needed                     │
│                                                                 │
│  Key Insight:                                                   │
│  "Going multi-region is a journey. Start with disaster         │
│   recovery, then evolve to active-active as you learn."       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Next Module: Module 18 - Message Queues

Previous Module: Module 16 - CDN and Edge Computing