Module 7: Replication Strategies

Why Replication?

Replication means keeping copies of data on multiple machines. It's fundamental to distributed systems for three reasons:

┌─────────────────────────────────────────────────────────────────┐
│                  WHY REPLICATE DATA?                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. FAULT TOLERANCE                                             │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ If one node fails, others have the data               │  │
│     │                                                        │  │
│     │  [Node A] ──✗──  Data still available on B and C     │  │
│     │  [Node B] ✓                                           │  │
│     │  [Node C] ✓                                           │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  2. SCALABILITY (Read throughput)                               │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Multiple replicas can serve read requests             │  │
│     │                                                        │  │
│     │  Client → Replica A                                   │  │
│     │  Client → Replica B   3x read throughput              │  │
│     │  Client → Replica C                                   │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  3. LATENCY (Geographic distribution)                           │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Place data close to users                             │  │
│     │                                                        │  │
│     │  User in Tokyo → Tokyo Replica (5ms)                  │  │
│     │  User in NYC   → NYC Replica (5ms)                    │  │
│     │                                                        │  │
│     │  vs Single datacenter: 150ms cross-continent          │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Fundamental Challenge

┌─────────────────────────────────────────────────────────────────┐
│              THE REPLICATION CHALLENGE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  How do you keep replicas in sync?                              │
│                                                                 │
│  Time 0: All replicas have x=1                                  │
│  ┌─────┐ ┌─────┐ ┌─────┐                                       │
│  │ x=1 │ │ x=1 │ │ x=1 │                                       │
│  └─────┘ └─────┘ └─────┘                                       │
│  Node A   Node B   Node C                                       │
│                                                                 │
│  Time 1: Client writes x=2 to Node A                            │
│  ┌─────┐ ┌─────┐ ┌─────┐                                       │
│  │ x=2 │ │ x=1 │ │ x=1 │  ← Inconsistent!                      │
│  └──┬──┘ └─────┘ └─────┘                                       │
│     │                                                           │
│     └── How/when does x=2 reach B and C?                       │
│                                                                 │
│  Options:                                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Synchronous: Wait for all replicas before confirming    │   │
│  │              + Strong consistency                        │   │
│  │              - Slow (wait for slowest replica)          │   │
│  │              - Unavailable if any replica down          │   │
│  │                                                          │   │
│  │ Asynchronous: Confirm immediately, replicate later      │   │
│  │               + Fast                                     │   │
│  │               + Available even if replicas down         │   │
│  │               - May lose data if primary fails          │   │
│  │               - Stale reads from replicas               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1. Single-Leader Replication (Primary-Replica)

The most common replication strategy. One node handles all writes; others replicate.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│              SINGLE-LEADER REPLICATION                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                        ┌──────────────┐                         │
│                        │    Leader    │                         │
│                        │  (Primary)   │                         │
│                        │              │                         │
│     Write ──────────► │   Handles    │                         │
│                        │   all writes │                         │
│                        └──────┬───────┘                         │
│                               │                                 │
│              ┌────────────────┼────────────────┐                │
│              │ Replication    │    Replication │                │
│              │ Stream         │    Stream      │                │
│              ▼                ▼                ▼                │
│       ┌──────────┐     ┌──────────┐     ┌──────────┐           │
│       │ Follower │     │ Follower │     │ Follower │           │
│       │ (Replica)│     │ (Replica)│     │ (Replica)│           │
│       └────┬─────┘     └────┬─────┘     └────┬─────┘           │
│            │                │                │                  │
│            ▲                ▲                ▲                  │
│            │                │                │                  │
│         Read             Read             Read                  │
│                                                                 │
│  Write path: Client → Leader → Confirm → Async replicate       │
│  Read path:  Client → Any replica (or leader)                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

┌─────────────────────────────────────────────────────────────────┐
│          SYNC vs ASYNC REPLICATION                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  SYNCHRONOUS:                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Client          Leader          Follower 1    Follower 2│   │
│  │   │               │                │              │      │   │
│  │   │──write(x=2)──►│                │              │      │   │
│  │   │               │──replicate────►│              │      │   │
│  │   │               │──replicate────────────────────►│     │   │
│  │   │               │◄──────────ack──│              │      │   │
│  │   │               │◄──────────────────────ack─────│      │   │
│  │   │◄────success───│                │              │      │   │
│  │   │               │                │              │      │   │
│  │   Total latency = Leader + max(Follower 1, Follower 2)   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ASYNCHRONOUS:                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Client          Leader          Follower 1    Follower 2│   │
│  │   │               │                │              │      │   │
│  │   │──write(x=2)──►│                │              │      │   │
│  │   │◄────success───│ (immediate)    │              │      │   │
│  │   │               │──replicate────►│              │      │   │
│  │   │               │──replicate────────────────────►│     │   │
│  │   │               │                │              │      │   │
│  │   Total latency = Leader only                            │   │
│  │   Risk: Data loss if Leader fails before replication     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  SEMI-SYNCHRONOUS (Common compromise):                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Wait for at least 1 follower (not all)                  │   │
│  │ Guarantees data on 2 nodes before confirming            │   │
│  │ Used by: MySQL semi-sync, PostgreSQL sync_commit        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Replication Log

┌─────────────────────────────────────────────────────────────────┐
│              REPLICATION LOG STRATEGIES                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. STATEMENT-BASED REPLICATION                                 │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Send SQL statements to replicas                       │  │
│     │ UPDATE users SET last_login = NOW() WHERE id = 123    │  │
│     │                                                        │  │
│     │ Problems:                                              │  │
│     │ • NOW() returns different times on replicas           │  │
│     │ • RAND() returns different values                     │  │
│     │ • Triggers may behave differently                     │  │
│     │ • Auto-increment in different order                   │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  2. WRITE-AHEAD LOG (WAL) SHIPPING                              │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Send physical log entries                             │  │
│     │ "Change byte X at offset Y in file Z"                 │  │
│     │                                                        │  │
│     │ Used by: PostgreSQL, Oracle                          │  │
│     │                                                        │  │
│     │ Pros: Exact replication                               │  │
│     │ Cons: Storage-engine specific, version coupling       │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  3. LOGICAL (ROW-BASED) REPLICATION                             │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Send logical changes                                  │  │
│     │ INSERT into users: {id: 123, name: "Alice"}           │  │
│     │ UPDATE users WHERE id=123: {name: "Alice" → "Alicia"} │  │
│     │ DELETE users WHERE id=123                             │  │
│     │                                                        │  │
│     │ Used by: MySQL ROW format, PostgreSQL logical         │  │
│     │                                                        │  │
│     │ Pros: Version independent, can transform/filter       │  │
│     │ Cons: More verbose for bulk operations                │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Leader Failover

┌─────────────────────────────────────────────────────────────────┐
│                    LEADER FAILOVER                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  When leader fails, a follower must be promoted:                │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 1. DETECT FAILURE                                       │   │
│  │    • Heartbeat timeout (e.g., 30 seconds)              │   │
│  │    • Health check failures                              │   │
│  │    • Must avoid false positives (network hiccup)        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 2. CHOOSE NEW LEADER                                    │   │
│  │    • Most up-to-date replica (least replication lag)   │   │
│  │    • Consensus algorithm (Raft/Paxos)                  │   │
│  │    • Manual intervention                                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 3. RECONFIGURE SYSTEM                                   │   │
│  │    • Update routing to new leader                       │   │
│  │    • Followers start replicating from new leader        │   │
│  │    • Old leader (if recovers) becomes follower          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PROBLEMS DURING FAILOVER:                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Data loss: Async writes not replicated                │   │
│  │ • Split-brain: Two nodes think they're leader           │   │
│  │ • Inconsistent sequence numbers                         │   │
│  │ • Stale reads during transition                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  GITHUB 2012 INCIDENT:                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • MySQL primary failed, promoted out-of-date replica    │   │
│  │ • Auto-increment IDs reused                             │   │
│  │ • Private repos became accessible to wrong users        │   │
│  │ • Lesson: Be careful with async replication failover    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. Multi-Leader Replication

Multiple nodes can accept writes. Useful for multi-datacenter setups.

┌─────────────────────────────────────────────────────────────────┐
│              MULTI-LEADER REPLICATION                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Datacenter 1 (US)              Datacenter 2 (EU)              │
│  ┌──────────────────┐           ┌──────────────────┐           │
│  │     Leader A     │◄─────────►│     Leader B     │           │
│  │   (US Primary)   │  Async    │   (EU Primary)   │           │
│  └────────┬─────────┘  Sync     └────────┬─────────┘           │
│           │                              │                      │
│     ┌─────┴─────┐                  ┌─────┴─────┐               │
│     │           │                  │           │               │
│     ▼           ▼                  ▼           ▼               │
│  Follower   Follower            Follower   Follower            │
│                                                                 │
│  US writes go to Leader A                                       │
│  EU writes go to Leader B                                       │
│  Leaders sync with each other asynchronously                    │
│                                                                 │
│  Benefits:                                                      │
│  • Writes accepted at local datacenter (low latency)           │
│  • Datacenter failure doesn't stop writes                      │
│                                                                 │
│  Challenges:                                                    │
│  • Write conflicts (same key modified in both DCs)             │
│  • Conflict resolution required                                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Conflict Resolution

┌─────────────────────────────────────────────────────────────────┐
│              CONFLICT RESOLUTION STRATEGIES                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Conflict: Both leaders modify same record concurrently         │
│                                                                 │
│  Leader A: x = "Alice"    Leader B: x = "Bob"                  │
│            at t=100                 at t=101                    │
│                                                                 │
│  After sync: Which value wins?                                  │
│                                                                 │
│  1. LAST WRITE WINS (LWW)                                       │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Higher timestamp wins                                 │  │
│     │ t=101 > t=100, so x = "Bob"                          │  │
│     │                                                        │  │
│     │ Problems:                                              │  │
│     │ • Clock skew: t=101 might actually be earlier         │  │
│     │ • Data loss: "Alice" update is silently dropped       │  │
│     │                                                        │  │
│     │ Used by: Cassandra, DynamoDB (default)               │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  2. MERGE VALUES                                                │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Combine both values somehow                           │  │
│     │ x = "Alice" + "Bob" = "Alice, Bob"                   │  │
│     │                                                        │  │
│     │ Works for: Shopping carts (union of items)           │  │
│     │ Doesn't work for: Single-value fields                │  │
│     │                                                        │  │
│     │ Used by: CRDTs (sets, counters)                      │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  3. KEEP BOTH (Application resolves)                            │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Store both versions with conflict marker             │  │
│     │ Application or user resolves later                   │  │
│     │                                                        │  │
│     │ x = [                                                │  │
│     │   {value: "Alice", version: [A:1]},                  │  │
│     │   {value: "Bob", version: [B:1]}                     │  │
│     │ ]                                                     │  │
│     │                                                        │  │
│     │ Used by: CouchDB, Riak                               │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  4. CUSTOM RESOLUTION FUNCTION                                  │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Application-defined merge logic                       │  │
│     │                                                        │  │
│     │ // Example: Higher priority user wins                 │  │
│     │ resolve(a, b) = a.user.priority > b.user.priority    │  │
│     │                 ? a : b                               │  │
│     │                                                        │  │
│     │ Used by: Many custom systems                          │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Topology

┌─────────────────────────────────────────────────────────────────┐
│              MULTI-LEADER TOPOLOGIES                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CIRCULAR:                                                      │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │       ┌────────┐                                          │ │
│  │    ┌──│  DC 1  │──┐                                       │ │
│  │    │  └────────┘  │                                       │ │
│  │    ▼              ▼                                       │ │
│  │ ┌────────┐    ┌────────┐                                  │ │
│  │ │  DC 3  │◄───│  DC 2  │                                  │ │
│  │ └────────┘    └────────┘                                  │ │
│  │                                                           │ │
│  │ Single point of failure: Any node breaks chain           │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
│  STAR (Hub-and-Spoke):                                          │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │         ┌────────┐                                        │ │
│  │         │  Hub   │                                        │ │
│  │         └───┬────┘                                        │ │
│  │     ┌───────┼───────┐                                     │ │
│  │     │       │       │                                     │ │
│  │     ▼       ▼       ▼                                     │ │
│  │ ┌──────┐┌──────┐┌──────┐                                  │ │
│  │ │ DC 1 ││ DC 2 ││ DC 3 │                                  │ │
│  │ └──────┘└──────┘└──────┘                                  │ │
│  │                                                           │ │
│  │ Single point of failure: Hub                             │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
│  ALL-TO-ALL (Mesh):                                             │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │     ┌────────┐                                            │ │
│  │     │  DC 1  │◄─────────────┐                             │ │
│  │     └───┬────┘              │                             │ │
│  │         │ ◄──────────┐      │                             │ │
│  │         ▼            │      │                             │ │
│  │     ┌────────┐◄──────┼──────┼───┐                         │ │
│  │     │  DC 2  │───────┼──────┼──►│                         │ │
│  │     └───┬────┘       │      │   │                         │ │
│  │         │            │      │   │                         │ │
│  │         ▼            │      │   │                         │ │
│  │     ┌────────┐───────┘      │   │                         │ │
│  │     │  DC 3  │◄─────────────┘   │                         │ │
│  │     └────────┘──────────────────┘                         │ │
│  │                                                           │ │
│  │ Most fault tolerant                                       │ │
│  │ But: Ordering issues (same update via different paths)    │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3. Leaderless Replication

No single node is special. Any node can accept writes.

┌─────────────────────────────────────────────────────────────────┐
│              LEADERLESS REPLICATION                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Client writes to multiple replicas in parallel                 │
│                                                                 │
│              Client                                             │
│                │                                                │
│       ┌────────┼────────┐                                      │
│       │        │        │                                      │
│       ▼        ▼        ▼                                      │
│    ┌─────┐  ┌─────┐  ┌─────┐                                   │
│    │Node │  │Node │  │Node │                                   │
│    │  A  │  │  B  │  │  C  │                                   │
│    └─────┘  └─────┘  └─────┘                                   │
│                                                                 │
│  Write succeeds if W nodes acknowledge                          │
│  Read queries R nodes, takes most recent value                  │
│                                                                 │
│  Configuration: N=3, W=2, R=2                                   │
│  • Write to at least 2 nodes                                   │
│  • Read from at least 2 nodes                                  │
│  • W + R > N ensures overlap (at least one has latest)         │
│                                                                 │
│  Used by: Cassandra, DynamoDB, Riak                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Quorum Reads and Writes

┌─────────────────────────────────────────────────────────────────┐
│              QUORUM CONFIGURATION                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  N = Number of replicas                                         │
│  W = Write quorum (nodes that must acknowledge write)           │
│  R = Read quorum (nodes to query for reads)                     │
│                                                                 │
│  For strong consistency: W + R > N                              │
│                                                                 │
│  Example: N=3                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Config    │ W │ R │ W+R │ Behavior                      │   │
│  │───────────┼───┼───┼─────┼──────────────────────────────│   │
│  │ Strong    │ 2 │ 2 │  4  │ Read sees latest write       │   │
│  │ Fast read │ 3 │ 1 │  4  │ Reads fast, writes slow      │   │
│  │ Fast write│ 1 │ 3 │  4  │ Writes fast, reads slow      │   │
│  │ Eventual  │ 1 │ 1 │  2  │ May read stale data          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Visual: W=2, R=2, N=3                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │   Write (W=2):                                          │   │
│  │   ┌─────┐  ┌─────┐  ┌─────┐                             │   │
│  │   │ A ✓ │  │ B ✓ │  │ C   │  2 ACKs = success          │   │
│  │   │v=2  │  │v=2  │  │v=1  │                             │   │
│  │   └─────┘  └─────┘  └─────┘                             │   │
│  │                                                          │   │
│  │   Read (R=2):                                           │   │
│  │   ┌─────┐  ┌─────┐  ┌─────┐                             │   │
│  │   │ A   │  │ B   │  │ C ✓ │  Query B and C             │   │
│  │   │v=2  │  │v=2✓ │  │v=1  │  B has v=2, C has v=1      │   │
│  │   └─────┘  └─────┘  └─────┘  Return v=2 (higher)       │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Read Repair and Anti-Entropy

┌─────────────────────────────────────────────────────────────────┐
│              KEEPING REPLICAS IN SYNC                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  READ REPAIR:                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ During read, detect and fix stale replicas              │   │
│  │                                                          │   │
│  │ 1. Read from R nodes                                    │   │
│  │ 2. Compare versions                                     │   │
│  │ 3. Write latest value back to stale nodes              │   │
│  │                                                          │   │
│  │ Read from A, B, C:                                      │   │
│  │ A has v=3, B has v=3, C has v=2                        │   │
│  │ → Write v=3 back to C                                   │   │
│  │                                                          │   │
│  │ Pros: Self-healing on frequently read keys             │   │
│  │ Cons: Rarely-read keys stay stale                      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ANTI-ENTROPY (Background repair):                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Periodic background process compares replicas           │   │
│  │ Uses Merkle trees for efficient comparison              │   │
│  │                                                          │   │
│  │ Merkle Tree:                                            │   │
│  │            ┌─────┐                                      │   │
│  │            │Root │ (hash of children)                   │   │
│  │            │ H1  │                                      │   │
│  │            └──┬──┘                                      │   │
│  │          ┌────┴────┐                                    │   │
│  │          ▼         ▼                                    │   │
│  │       ┌─────┐   ┌─────┐                                 │   │
│  │       │ H2  │   │ H3  │ (hash of children)              │   │
│  │       └──┬──┘   └──┬──┘                                 │   │
│  │      ┌───┴───┐ ┌───┴───┐                                │   │
│  │      ▼       ▼ ▼       ▼                                │   │
│  │    ┌───┐   ┌───┐ ┌───┐   ┌───┐                          │   │
│  │    │ A │   │ B │ │ C │   │ D │ (data blocks)            │   │
│  │    └───┘   └───┘ └───┘   └───┘                          │   │
│  │                                                          │   │
│  │ Compare root hashes → if different, drill down          │   │
│  │ Only transfer differing blocks                          │   │
│  │                                                          │   │
│  │ Used by: Cassandra, Riak, DynamoDB                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sloppy Quorums and Hinted Handoff

┌─────────────────────────────────────────────────────────────────┐
│              SLOPPY QUORUMS                                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Problem: Network partition makes designated nodes unreachable  │
│                                                                 │
│  Strict quorum: Fail if can't reach W designated nodes          │
│  Sloppy quorum: Accept writes on ANY available nodes            │
│                                                                 │
│  Normal operation:                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Key "user:123" belongs to nodes A, B, C                 │   │
│  │ Write goes to A, B, C                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  During partition (A unreachable):                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Strict: Write fails (can't reach A)                     │   │
│  │ Sloppy: Write to B, C, and D (not normally responsible) │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  HINTED HANDOFF:                                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ When writing to "wrong" node D:                         │   │
│  │ D stores value with hint: "This belongs to A"           │   │
│  │                                                          │   │
│  │ When A comes back online:                               │   │
│  │ D sends hinted data to A                                │   │
│  │ D deletes local copy                                    │   │
│  │                                                          │   │
│  │ Result: Data eventually reaches correct nodes           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Trade-off:                                                     │
│  • Improves write availability during partitions               │
│  • But: W + R > N no longer guarantees consistency!            │
│    (Reads might not find data on temporary nodes)              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Comparison of Strategies

┌─────────────────────────────────────────────────────────────────┐
│              REPLICATION STRATEGY COMPARISON                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                  │ Single-Leader │ Multi-Leader │ Leaderless   │
│  ────────────────┼───────────────┼──────────────┼──────────────│
│  Write latency   │ Leader only   │ Local DC     │ Quorum (W)   │
│  Read latency    │ Any replica   │ Any replica  │ Quorum (R)   │
│  Consistency     │ Strong*       │ Eventual     │ Tunable      │
│  Conflicts       │ None          │ Yes          │ Yes          │
│  Failover        │ Required      │ Automatic    │ N/A          │
│  Complexity      │ Low           │ High         │ Medium       │
│  ────────────────┼───────────────┼──────────────┼──────────────│
│  Used by         │ PostgreSQL    │ CouchDB      │ Cassandra    │
│                  │ MySQL         │ Multi-DC DBs │ DynamoDB     │
│                  │ MongoDB       │              │ Riak         │
│                                                                 │
│  * Strong with sync replication; eventual with async           │
│                                                                 │
│  WHEN TO USE EACH:                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Single-Leader: Most applications (simplicity, ACID)     │   │
│  │ Multi-Leader: Multi-datacenter with local writes        │   │
│  │ Leaderless: High availability, tunable consistency      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Interview Questions

Conceptual Questions

What's the trade-off between synchronous and asynchronous replication?
- Sync: Strong consistency, but slower and less available
- Async: Fast and available, but may lose data on leader failure
- Semi-sync: Compromise (at least 1 follower must ACK)
How do you handle write conflicts in multi-leader replication?
- LWW (timestamp-based, may lose data)
- Merge (combine values, e.g., CRDTs)
- Keep both (let application resolve)
- Custom resolver function
Explain quorum in leaderless replication.
- W + R > N ensures read sees latest write
- W = write quorum, R = read quorum, N = replicas
- Can tune for read-heavy or write-heavy workloads
What is read repair?
- During read, compare values from R nodes
- Write latest value back to stale nodes
- Self-healing for frequently read data

System Design Questions

Design a globally distributed database with low write latency.
- Multi-leader: One leader per region
- Async replication between regions
- Conflict resolution: LWW or custom
- Trade-off: eventual consistency
How would you implement leader election for failover?
- Consensus algorithm (Raft/Paxos)
- Distributed lock (ZooKeeper, etcd)
- Fencing tokens to prevent split-brain
- Avoid auto-promotion without proper leader detection

Summary

┌─────────────────────────────────────────────────────────────────┐
│              MODULE 7 KEY TAKEAWAYS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Replication provides: fault tolerance, scalability, latency │
│                                                                 │
│  2. Single-leader: Simple, no conflicts, but leader is SPOF     │
│     • Sync vs async replication trade-off                      │
│     • Failover is complex and error-prone                      │
│                                                                 │
│  3. Multi-leader: Local writes, high availability               │
│     • Write conflicts require resolution strategy              │
│     • Topologies: circular, star, all-to-all                   │
│                                                                 │
│  4. Leaderless: No single point of failure                      │
│     • Quorum for consistency: W + R > N                        │
│     • Read repair and anti-entropy for convergence             │
│     • Sloppy quorums trade consistency for availability        │
│                                                                 │
│  5. Choose based on requirements:                               │
│     • Single-leader: ACID, simplicity                          │
│     • Multi-leader: Multi-DC, local writes                     │
│     • Leaderless: High availability, tunable consistency       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Next Module: Partitioning and Sharding - How to split data across nodes for scalability.