Module 17: Multi-Region Architecture

Why Multi-Region?

As systems grow globally, running in a single region creates latency, availability, and compliance challenges.
┌─────────────────────────────────────────────────────────────────┐ │ WHY MULTI-REGION? │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. LATENCY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Single region (US-East): │ │ │ │ • US users: 20ms │ │ │ │ • EU users: 100ms │ │ │ │ • APAC users: 200ms │ │ │ │ │ │ │ │ Multi-region (US + EU + APAC): │ │ │ │ • All users: 20-40ms │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. AVAILABILITY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Single region: One region down = entire system down │ │ │ │ │ │ │ │ Multi-region: One region down = failover to others │ │ │ │ │ │ │ │ Single region: 99.9% (8.7 hours downtime/year) │ │ │ │ Multi-region: 99.99% (52 minutes downtime/year) │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. DATA RESIDENCY / COMPLIANCE │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ GDPR: EU user data must stay in EU │ │ │ │ CCPA: California data privacy requirements │ │ │ │ Data sovereignty laws in many countries │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. DISASTER RECOVERY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Natural disasters, power outages, network issues │ │ │ │ can take out an entire region │ │ │ │ │ │ │ │ Multi-region provides geographic redundancy │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Multi-Region Patterns

┌─────────────────────────────────────────────────────────────────┐ │ MULTI-REGION PATTERNS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. ACTIVE-PASSIVE (Disaster Recovery) │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Active │───────►│ Passive │ │ │ │ │ │ Region │ sync │ Region │ │ │ │ │ │ (US-East) │ │ (US-West) │ │ │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ ▲ │ │ │ │ │ │ │ (standby) │ │ │ │ All traffic Failover only │ │ │ │ │ │ │ │ • Primary handles all traffic │ │ │ │ • Secondary is warm standby │ │ │ │ • Failover: manual or automated │ │ │ │ • Lower cost, higher RTO │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. ACTIVE-ACTIVE (Load Distribution) │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Active │◄──────►│ Active │ │ │ │ │ │ Region │ sync │ Region │ │ │ │ │ │ (US-East) │ │ (EU-West) │ │ │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ ▲ ▲ │ │ │ │ │ │ │ │ │ │ US traffic EU traffic │ │ │ │ │ │ │ │ • Both regions serve traffic │ │ │ │ • Bi-directional data sync │ │ │ │ • Instant failover │ │ │ │ • Higher cost, lower RTO │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. ACTIVE-ACTIVE-ACTIVE (Global) │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌─────────────┐ │ │ │ │ │ Active │ │ │ │ │ ┌───►│ (US-East) │◄───┐ │ │ │ │ │ └─────────────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Active │◄──────►│ Active │ │ │ │ │ │ (EU-West) │ │ (APAC) │ │ │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ │ • All regions active │ │ │ │ • Mesh replication │ │ │ │ • Complex conflict resolution │ │ │ │ • Best latency globally │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Data Replication Strategies

┌─────────────────────────────────────────────────────────────────┐ │ DATA REPLICATION STRATEGIES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. SYNCHRONOUS REPLICATION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Write ──► Region A ──── sync ────► Region B │ │ │ │ │ │ │ │ │ │ └── Wait for ACK ────────┘ │ │ │ │ │ │ │ │ │ Return OK │ │ │ │ │ │ │ │ Pros: Strong consistency, no data loss │ │ │ │ Cons: High latency, region failure blocks writes │ │ │ │ │ │ │ │ Use: Financial transactions, critical data │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. ASYNCHRONOUS REPLICATION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Write ──► Region A ──── Return OK │ │ │ │ │ │ │ │ │ └── async replicate ──► Region B │ │ │ │ (later) │ │ │ │ │ │ │ │ Pros: Low latency, region failure doesn't block │ │ │ │ Cons: Eventual consistency, potential data loss │ │ │ │ │ │ │ │ Use: Analytics, logs, non-critical data │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. SEMI-SYNCHRONOUS REPLICATION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Write ──► Primary ──── sync ────► At least 1 replica │ │ │ │ │ │ │ │ │ └── async ──► Other replicas │ │ │ │ │ │ │ │ Balanced approach: durability + performance │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Replication Lag: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Same region: 1-5ms │ │ │ │ Cross-region: 50-200ms │ │ │ │ Cross-continent: 100-300ms │ │ │ │ │ │ │ │ This lag affects read-after-write consistency! │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Global Traffic Routing

┌─────────────────────────────────────────────────────────────────┐ │ GLOBAL TRAFFIC ROUTING │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ DNS-Based Routing (GeoDNS): │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ User (Paris) ─► DNS ─► Returns EU server IP │ │ │ │ User (Tokyo) ─► DNS ─► Returns APAC server IP │ │ │ │ │ │ │ │ Services: Route 53, Cloudflare, Google Cloud DNS │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Routing Policies: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Geolocation: │ │ │ │ EU users → eu-west-1 │ │ │ │ US users → us-east-1 │ │ │ │ Default → us-east-1 │ │ │ │ │ │ │ │ Latency-based: │ │ │ │ Route to region with lowest latency │ │ │ │ (based on health checks) │ │ │ │ │ │ │ │ Weighted: │ │ │ │ us-east-1: 70% │ │ │ │ eu-west-1: 30% │ │ │ │ │ │ │ │ Failover: │ │ │ │ Primary: us-east-1 │ │ │ │ Secondary: us-west-2 (if primary unhealthy) │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Anycast: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Same IP advertised from multiple locations │ │ │ │ Network routes to nearest location │ │ │ │ │ │ │ │ Used by: Cloudflare, major CDNs │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Global Load Balancer Implementation

go
package routing import ( "context" "math" "net/http" "sync" "time" ) // Region represents a deployment region type Region struct { Name string Endpoint string Healthy bool Latency time.Duration Weight int Countries []string } // GlobalRouter routes requests to the best region type GlobalRouter struct { regions []*Region healthChecker *HealthChecker mu sync.RWMutex } func NewGlobalRouter(regions []*Region) *GlobalRouter { gr := &GlobalRouter{regions: regions} gr.healthChecker = NewHealthChecker(regions) go gr.healthChecker.Start() return gr } // RouteByGeolocation routes based on user's country func (g *GlobalRouter) RouteByGeolocation(country string) *Region { g.mu.RLock() defer g.mu.RUnlock() // Find region that serves this country for _, region := range g.regions { if !region.Healthy { continue } for _, c := range region.Countries { if c == country { return region } } } // Fallback to first healthy region for _, region := range g.regions { if region.Healthy { return region } } return nil } // RouteByLatency routes to region with lowest latency func (g *GlobalRouter) RouteByLatency(clientIP string) *Region { g.mu.RLock() defer g.mu.RUnlock() var bestRegion *Region minLatency := time.Duration(math.MaxInt64) for _, region := range g.regions { if !region.Healthy { continue } if region.Latency < minLatency { minLatency = region.Latency bestRegion = region } } return bestRegion } // RouteWithFailover routes to primary, falls back to secondary func (g *GlobalRouter) RouteWithFailover(primary, secondary string) *Region { g.mu.RLock() defer g.mu.RUnlock() for _, region := range g.regions { if region.Name == primary && region.Healthy { return region } } for _, region := range g.regions { if region.Name == secondary && region.Healthy { return region } } // Return any healthy region for _, region := range g.regions { if region.Healthy { return region } } return nil } // RouteWeighted routes based on weights func (g *GlobalRouter) RouteWeighted() *Region { g.mu.RLock() defer g.mu.RUnlock() totalWeight := 0 for _, region := range g.regions { if region.Healthy { totalWeight += region.Weight } } if totalWeight == 0 { return nil } r := randomInt(totalWeight) for _, region := range g.regions { if !region.Healthy { continue } r -= region.Weight if r < 0 { return region } } return nil } // HealthChecker monitors region health type HealthChecker struct { regions []*Region interval time.Duration client *http.Client } func NewHealthChecker(regions []*Region) *HealthChecker { return &HealthChecker{ regions: regions, interval: 10 * time.Second, client: &http.Client{Timeout: 5 * time.Second}, } } func (h *HealthChecker) Start() { ticker := time.NewTicker(h.interval) for range ticker.C { h.checkAll() } } func (h *HealthChecker) checkAll() { var wg sync.WaitGroup for _, region := range h.regions { wg.Add(1) go func(r *Region) { defer wg.Done() h.checkRegion(r) }(region) } wg.Wait() } func (h *HealthChecker) checkRegion(region *Region) { start := time.Now() resp, err := h.client.Get(region.Endpoint + "/health") if err != nil { region.Healthy = false return } defer resp.Body.Close() region.Latency = time.Since(start) region.Healthy = resp.StatusCode == http.StatusOK } func randomInt(max int) int { // Use crypto/rand in production return int(time.Now().UnixNano() % int64(max)) }

Handling Data Consistency

┌─────────────────────────────────────────────────────────────────┐ │ MULTI-REGION DATA CONSISTENCY │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Challenge: User writes in Region A, reads in Region B │ │ │ │ Timeline: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ t=0: User writes to Region A (US) │ │ │ │ t=1: User gets redirected to Region B (EU) │ │ │ │ t=2: User reads from Region B │ │ │ │ t=100: Replication catches up to Region B │ │ │ │ │ │ │ │ Problem: User doesn't see their own write! │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Solutions: │ │ │ │ 1. STICKY SESSIONS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Route user to same region for session duration │ │ │ │ │ │ │ │ Pros: Simple, solves read-after-write │ │ │ │ Cons: Uneven load, failover breaks session │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. READ-FROM-WRITE-REGION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ After write, read from same region until sync │ │ │ │ │ │ │ │ Cookie: last_write_region=us-east-1; ts=123456 │ │ │ │ If ts < 5s ago, route reads to us-east-1 │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. VERSION TRACKING │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Write returns version number │ │ │ │ Read includes minimum version requirement │ │ │ │ │ │ │ │ Write: POST /user → {version: 42} │ │ │ │ Read: GET /user?min_version=42 │ │ │ │ If region < v42, route to primary │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. GLOBAL PRIMARY FOR WRITES │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ All writes go to single primary region │ │ │ │ Reads can go to any region │ │ │ │ │ │ │ │ Trade-off: Higher write latency, simpler consistency │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Session Affinity with Version Tracking

go
package multiregion import ( "context" "encoding/json" "fmt" "net/http" "strconv" "time" ) // RegionalService handles requests with regional awareness type RegionalService struct { regionID string isPrimary bool db RegionalDB replicaLag time.Duration } // RegionalDB represents a regionally replicated database type RegionalDB interface { Write(ctx context.Context, key string, value interface{}) (version int64, error) Read(ctx context.Context, key string) (interface{}, int64, error) ReadWithMinVersion(ctx context.Context, key string, minVersion int64) (interface{}, int64, error) GetCurrentVersion(ctx context.Context, key string) (int64, error) } // SessionToken tracks write region and version type SessionToken struct { UserID string `json:"user_id"` WriteRegion string `json:"write_region"` WriteVersion int64 `json:"write_version"` WriteTime int64 `json:"write_time"` } func (s *RegionalService) WriteHandler(w http.ResponseWriter, r *http.Request) { ctx := r.Context() var data map[string]interface{} json.NewDecoder(r.Body).Decode(&data) userID := r.Header.Get("X-User-ID") key := fmt.Sprintf("user:%s", userID) // Write to database version, err := s.db.Write(ctx, key, data) if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } // Create session token with write info token := SessionToken{ UserID: userID, WriteRegion: s.regionID, WriteVersion: version, WriteTime: time.Now().UnixMilli(), } // Set cookie for session tracking tokenJSON, _ := json.Marshal(token) http.SetCookie(w, &http.Cookie{ Name: "session_token", Value: string(tokenJSON), Path: "/", MaxAge: 3600, HttpOnly: true, Secure: true, }) w.Header().Set("X-Write-Version", strconv.FormatInt(version, 10)) w.Header().Set("X-Write-Region", s.regionID) json.NewEncoder(w).Encode(map[string]interface{}{ "success": true, "version": version, }) } func (s *RegionalService) ReadHandler(w http.ResponseWriter, r *http.Request) { ctx := r.Context() userID := r.Header.Get("X-User-ID") key := fmt.Sprintf("user:%s", userID) // Check for session token cookie, _ := r.Cookie("session_token") var token SessionToken if cookie != nil { json.Unmarshal([]byte(cookie.Value), &token) } var data interface{} var version int64 var err error // Check if we need read-after-write consistency if token.UserID == userID && s.needsConsistency(token) { // Check if local version is sufficient localVersion, _ := s.db.GetCurrentVersion(ctx, key) if localVersion >= token.WriteVersion { // Local replica is caught up data, version, err = s.db.Read(ctx, key) } else { // Need to route to write region or wait data, version, err = s.readWithConsistency(ctx, key, token) } } else { // Normal read from local replica data, version, err = s.db.Read(ctx, key) } if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } w.Header().Set("X-Read-Version", strconv.FormatInt(version, 10)) w.Header().Set("X-Read-Region", s.regionID) json.NewEncoder(w).Encode(data) } func (s *RegionalService) needsConsistency(token SessionToken) bool { // Only enforce consistency for recent writes writeAge := time.Since(time.UnixMilli(token.WriteTime)) return writeAge < 30*time.Second } func (s *RegionalService) readWithConsistency( ctx context.Context, key string, token SessionToken, ) (interface{}, int64, error) { // Option 1: Wait for replication (with timeout) deadline := time.Now().Add(5 * time.Second) for time.Now().Before(deadline) { version, _ := s.db.GetCurrentVersion(ctx, key) if version >= token.WriteVersion { return s.db.Read(ctx, key) } time.Sleep(100 * time.Millisecond) } // Option 2: Read with minimum version (may route to primary) return s.db.ReadWithMinVersion(ctx, key, token.WriteVersion) } // Middleware to route to write region if needed func ConsistencyRoutingMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { cookie, _ := r.Cookie("session_token") if cookie == nil { next.ServeHTTP(w, r) return } var token SessionToken json.Unmarshal([]byte(cookie.Value), &token) // If recent write, route to write region writeAge := time.Since(time.UnixMilli(token.WriteTime)) if writeAge < 5*time.Second { // Add header for load balancer to route to write region r.Header.Set("X-Preferred-Region", token.WriteRegion) } next.ServeHTTP(w, r) }) }

Conflict Resolution in Multi-Region

┌─────────────────────────────────────────────────────────────────┐ │ MULTI-REGION CONFLICT RESOLUTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Problem: Two users write to same data in different regions │ │ │ │ Region A (US) Region B (EU) │ │ │ │ │ │ ▼ ▼ │ │ Write X=1 Write X=2 │ │ at t=100 at t=100 │ │ │ │ │ │ └──────── Conflict ────────┘ │ │ │ │ Solutions: │ │ │ │ 1. LAST-WRITE-WINS (LWW) │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Use timestamp to determine winner │ │ │ │ Simple but can lose data │ │ │ │ │ │ │ │ Tip: Use Hybrid Logical Clocks for ordering │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. REGION PRIORITY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Designate primary region for each data partition │ │ │ │ US users → US region writes │ │ │ │ EU users → EU region writes │ │ │ │ │ │ │ │ Avoid conflicts by partitioning │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. OPERATIONAL TRANSFORMS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ For collaborative editing (Google Docs style) │ │ │ │ Transform concurrent operations to commute │ │ │ │ │ │ │ │ Complex but preserves all edits │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. CRDTs │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Design data structures to always merge │ │ │ │ Counters, sets, registers │ │ │ │ │ │ │ │ See Module 29 for details │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 5. APPLICATION-LEVEL RESOLUTION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Store both versions, let application/user decide │ │ │ │ │ │ │ │ Example: Shopping cart merges items from both carts │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Failover Strategies

┌─────────────────────────────────────────────────────────────────┐ │ FAILOVER STRATEGIES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. DNS FAILOVER │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Health check fails → Update DNS records │ │ │ │ │ │ │ │ Pros: Simple, works for any protocol │ │ │ │ Cons: DNS TTL delays (30s-5min typical) │ │ │ │ Client DNS caching │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. LOAD BALANCER FAILOVER │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Global LB detects unhealthy region, routes elsewhere │ │ │ │ │ │ │ │ Pros: Fast failover (seconds) │ │ │ │ Cons: LB is single point of failure │ │ │ │ (use anycast/multi-LB) │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. CLIENT-SIDE FAILOVER │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Client retries to different region on failure │ │ │ │ │ │ │ │ Pros: Fastest, no central dependency │ │ │ │ Cons: Client complexity, mobile app update delays │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Failover Timeline: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ t=0: Region A fails │ │ │ │ t=10s: Health check detects failure │ │ │ │ t=15s: Failover initiated │ │ │ │ t=20s: DNS updated / LB reroutes │ │ │ │ t=30s: Clients start using Region B │ │ │ │ t=5min: Most clients on Region B (DNS TTL) │ │ │ │ │ │ │ │ Total RTO: 30s-5min depending on strategy │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Automatic Failover Implementation

go
package failover import ( "context" "log" "net/http" "sync" "time" ) // FailoverController manages regional failover type FailoverController struct { regions []*Region activeRegion *Region standbyRegion *Region healthChecker *HealthChecker dnsUpdater DNSUpdater alerter Alerter failoverMu sync.Mutex lastFailover time.Time cooldownPeriod time.Duration } type DNSUpdater interface { UpdateRecord(ctx context.Context, record string, ip string) error } type Alerter interface { SendAlert(ctx context.Context, message string) error } func NewFailoverController( regions []*Region, dns DNSUpdater, alerter Alerter, ) *FailoverController { fc := &FailoverController{ regions: regions, dnsUpdater: dns, alerter: alerter, cooldownPeriod: 5 * time.Minute, } // Set initial active/standby fc.activeRegion = regions[0] fc.standbyRegion = regions[1] fc.healthChecker = NewHealthChecker(regions) fc.healthChecker.OnUnhealthy = fc.handleUnhealthy return fc } func (fc *FailoverController) Start() { go fc.healthChecker.Start() } func (fc *FailoverController) handleUnhealthy(region *Region) { fc.failoverMu.Lock() defer fc.failoverMu.Unlock() // Only failover if active region is unhealthy if region != fc.activeRegion { return } // Check cooldown if time.Since(fc.lastFailover) < fc.cooldownPeriod { log.Printf("Failover skipped: in cooldown period") return } // Check if standby is healthy if !fc.standbyRegion.Healthy { fc.alerter.SendAlert(context.Background(), "CRITICAL: Both regions unhealthy, cannot failover") return } // Perform failover fc.performFailover() } func (fc *FailoverController) performFailover() { ctx := context.Background() oldActive := fc.activeRegion newActive := fc.standbyRegion log.Printf("Initiating failover: %s → %s", oldActive.Name, newActive.Name) // Alert on-call fc.alerter.SendAlert(ctx, fmt.Sprintf( "FAILOVER: Switching from %s to %s", oldActive.Name, newActive.Name)) // Update DNS if err := fc.dnsUpdater.UpdateRecord(ctx, "api.example.com", newActive.IP); err != nil { log.Printf("DNS update failed: %v", err) fc.alerter.SendAlert(ctx, "CRITICAL: DNS failover update failed") return } // Swap regions fc.activeRegion = newActive fc.standbyRegion = oldActive fc.lastFailover = time.Now() log.Printf("Failover complete: %s is now active", newActive.Name) fc.alerter.SendAlert(ctx, fmt.Sprintf( "Failover complete: %s is now active", newActive.Name)) } // Manual failover endpoint func (fc *FailoverController) FailoverHandler(w http.ResponseWriter, r *http.Request) { if r.Method != http.MethodPost { http.Error(w, "Method not allowed", http.StatusMethodNotAllowed) return } // Require authorization if r.Header.Get("X-Failover-Token") != "secret" { http.Error(w, "Unauthorized", http.StatusUnauthorized) return } fc.failoverMu.Lock() defer fc.failoverMu.Unlock() fc.performFailover() w.Write([]byte("Failover initiated")) } // Status endpoint func (fc *FailoverController) StatusHandler(w http.ResponseWriter, r *http.Request) { status := map[string]interface{}{ "active_region": fc.activeRegion.Name, "standby_region": fc.standbyRegion.Name, "last_failover": fc.lastFailover, "regions": []map[string]interface{}{}, } for _, region := range fc.regions { status["regions"] = append(status["regions"].([]map[string]interface{}), map[string]interface{}{ "name": region.Name, "healthy": region.Healthy, "latency": region.Latency.String(), }) } json.NewEncoder(w).Encode(status) }

Best Practices

┌─────────────────────────────────────────────────────────────────┐ │ MULTI-REGION BEST PRACTICES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. START SIMPLE │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Start with active-passive before active-active │ │ │ │ • Add regions incrementally │ │ │ │ • Master single-region first │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. DESIGN FOR EVENTUAL CONSISTENCY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Assume replication lag │ │ │ │ • Design for read-after-write scenarios │ │ │ │ • Use CRDTs where possible │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. TEST FAILOVER REGULARLY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Monthly failover drills │ │ │ │ • Chaos engineering (kill a region) │ │ │ │ • Measure RTO and RPO │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. MONITOR CROSS-REGION METRICS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Replication lag │ │ │ │ • Cross-region latency │ │ │ │ • Health check status │ │ │ │ • Error rates by region │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 5. DATA RESIDENCY COMPLIANCE │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Know where data lives │ │ │ │ • Implement data isolation where required │ │ │ │ • Audit data flows │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 6. COST OPTIMIZATION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Data transfer between regions is expensive │ │ │ │ • Right-size secondary regions │ │ │ │ • Consider cold standby for DR only │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Interview Questions

  1. What's the difference between active-active and active-passive?
    • Active-passive: One region serves traffic, other is standby
    • Active-active: Both regions serve traffic, bidirectional sync
  2. How do you handle read-after-write consistency across regions?
    • Session affinity to write region
    • Version tracking with routing
    • Synchronous replication (high latency)
  3. What factors determine how many regions you need?
    • Latency requirements
    • Availability targets
    • Data residency requirements
    • Cost constraints
  4. How do you test multi-region failover?
    • Chaos engineering
    • Regular failover drills
    • Load testing during failover
    • Measure RTO/RPO
  5. Design a multi-region architecture for a global e-commerce site
    • GeoDNS for routing
    • Active-active for product catalog (reads)
    • Primary region for orders (writes)
    • Cross-region replication with conflict resolution

Summary

┌─────────────────────────────────────────────────────────────────┐ │ MULTI-REGION SUMMARY │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Patterns: │ │ • Active-Passive: Simple DR, higher RTO │ │ • Active-Active: Lower latency, complex sync │ │ │ │ Replication: │ │ • Sync: Strong consistency, high latency │ │ • Async: Low latency, eventual consistency │ │ │ │ Routing: │ │ • GeoDNS for region selection │ │ • Health checks for failover │ │ • Session affinity for consistency │ │ │ │ Consistency: │ │ • Accept eventual consistency │ │ • Use CRDTs or LWW for conflicts │ │ • Route reads to write region when needed │ │ │ │ Key Insight: │ │ "Going multi-region is a journey. Start with disaster │ │ recovery, then evolve to active-active as you learn." │ │ │ └─────────────────────────────────────────────────────────────────┘

All Blogs
Tags:multi-regiongeo-replicationglobal-systemsfailoverhigh-availability