Module 10: Distributed Transactions

The Problem

When a transaction spans multiple nodes or services, how do we ensure ACID properties?

┌─────────────────────────────────────────────────────────────────┐
│              THE DISTRIBUTED TRANSACTION PROBLEM                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Money transfer: Alice → Bob ($100)                            │
│                                                                 │
│  ┌─────────────────┐         ┌─────────────────┐               │
│  │   Service A     │         │   Service B     │               │
│  │   (Alice's      │         │   (Bob's        │               │
│  │    Account)     │         │    Account)     │               │
│  │                 │         │                 │               │
│  │  Balance: $500  │         │  Balance: $200  │               │
│  │  - $100         │   ???   │  + $100         │               │
│  │  = $400         │         │  = $300         │               │
│  └─────────────────┘         └─────────────────┘               │
│                                                                 │
│  What if Service A succeeds but Service B fails?               │
│  • Alice lost $100                                             │
│  • Bob didn't receive $100                                     │
│  • Money disappeared! (Violates atomicity)                     │
│                                                                 │
│  Requirements:                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ATOMICITY: Both succeed or both fail                    │   │
│  │ CONSISTENCY: No money created or destroyed              │   │
│  │ ISOLATION: Other transactions see consistent state      │   │
│  │ DURABILITY: Committed transactions survive crashes      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1. Two-Phase Commit (2PC)

The classic protocol for distributed transactions.

Protocol Overview

┌─────────────────────────────────────────────────────────────────┐
│              TWO-PHASE COMMIT PROTOCOL                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Roles:                                                         │
│  • Coordinator: Orchestrates the transaction                   │
│  • Participants: Execute the transaction parts                 │
│                                                                 │
│  PHASE 1: PREPARE (Voting)                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  Coordinator        Participant A      Participant B     │   │
│  │       │                   │                  │           │   │
│  │       │ ── PREPARE ──────►│                  │           │   │
│  │       │ ── PREPARE ────────────────────────► │           │   │
│  │       │                   │                  │           │   │
│  │       │ ◄── VOTE YES ─────│                  │           │   │
│  │       │ ◄── VOTE YES ─────────────────────── │           │   │
│  │       │                   │                  │           │   │
│  │                                                          │   │
│  │  Participants:                                           │   │
│  │  • Acquire locks                                         │   │
│  │  • Execute transaction (don't commit)                   │   │
│  │  • Write to transaction log                             │   │
│  │  • Vote YES (can commit) or NO (must abort)            │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PHASE 2: COMMIT/ABORT (Decision)                               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  Coordinator        Participant A      Participant B     │   │
│  │       │                   │                  │           │   │
│  │       │ ── COMMIT ───────►│                  │           │   │
│  │       │ ── COMMIT ─────────────────────────► │           │   │
│  │       │                   │                  │           │   │
│  │       │ ◄── ACK ──────────│                  │           │   │
│  │       │ ◄── ACK ──────────────────────────── │           │   │
│  │       │                   │                  │           │   │
│  │                                                          │   │
│  │  Decision rules:                                         │   │
│  │  • All vote YES → COMMIT                                │   │
│  │  • Any vote NO → ABORT                                  │   │
│  │  • Timeout → ABORT                                      │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2PC State Machine

┌─────────────────────────────────────────────────────────────────┐
│              2PC STATE MACHINES                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  COORDINATOR:                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  ┌────────┐  send PREPARE   ┌─────────┐                 │   │
│  │  │ INIT   │ ───────────────►│ WAITING │                 │   │
│  │  └────────┘                 └────┬────┘                 │   │
│  │                                  │                       │   │
│  │                    ┌─────────────┴─────────────┐         │   │
│  │                    │                           │         │   │
│  │            all YES │                   any NO  │         │   │
│  │                    ▼                           ▼         │   │
│  │              ┌──────────┐               ┌──────────┐     │   │
│  │              │COMMITTING│               │ ABORTING │     │   │
│  │              └────┬─────┘               └────┬─────┘     │   │
│  │                   │                          │           │   │
│  │           all ACK │                  all ACK │           │   │
│  │                   ▼                          ▼           │   │
│  │              ┌──────────┐               ┌──────────┐     │   │
│  │              │COMMITTED │               │ ABORTED  │     │   │
│  │              └──────────┘               └──────────┘     │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PARTICIPANT:                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  ┌────────┐  PREPARE    ┌─────────┐                     │   │
│  │  │ INIT   │ ──────────► │PREPARED │                     │   │
│  │  └────────┘  (can commit)└────┬───┘                     │   │
│  │       │                       │                          │   │
│  │       │ PREPARE          ┌────┴────┐                     │   │
│  │       │ (can't commit)   │         │                     │   │
│  │       ▼              COMMIT    ABORT                     │   │
│  │  ┌────────┐              │         │                     │   │
│  │  │ABORTED │              ▼         ▼                     │   │
│  │  └────────┘         ┌────────┐ ┌────────┐               │   │
│  │                     │COMMITTED│ │ABORTED │               │   │
│  │                     └────────┘ └────────┘               │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2PC Problems

┌─────────────────────────────────────────────────────────────────┐
│              2PC PROBLEMS                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PROBLEM 1: BLOCKING                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ If coordinator fails after sending PREPARE:             │   │
│  │                                                          │   │
│  │ Coordinator: PREPARE → ✗ (crash)                        │   │
│  │ Participant: Voted YES, waiting for COMMIT/ABORT...     │   │
│  │              Holding locks... forever?                   │   │
│  │                                                          │   │
│  │ Participant is "in doubt" - can't decide alone!         │   │
│  │ Must wait for coordinator recovery.                     │   │
│  │ Locks block other transactions.                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PROBLEM 2: COORDINATOR SPOF                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Coordinator failure blocks entire system                │   │
│  │ All participants wait indefinitely                      │   │
│  │                                                          │   │
│  │ Mitigation: Replicate coordinator (Paxos/Raft)          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PROBLEM 3: LATENCY                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Minimum 2 round-trips:                                  │   │
│  │   1. PREPARE → all participants                         │   │
│  │   2. COMMIT → all participants                          │   │
│  │                                                          │   │
│  │ Cross-datacenter: 100ms RTT × 2 = 200ms+ per txn        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PROBLEM 4: DURABILITY REQUIREMENTS                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Must fsync logs before:                                 │   │
│  │   - Voting YES (participant)                            │   │
│  │   - Sending COMMIT (coordinator)                        │   │
│  │                                                          │   │
│  │ Disk fsync: 5-10ms each                                 │   │
│  │ Adds significant latency                                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. Three-Phase Commit (3PC)

Attempts to solve the blocking problem by adding a phase.

┌─────────────────────────────────────────────────────────────────┐
│              THREE-PHASE COMMIT                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Phase 1: CAN COMMIT?                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Coordinator: "Can you commit this transaction?"         │   │
│  │ Participants: Vote YES or NO                            │   │
│  │ (Same as 2PC Phase 1)                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Phase 2: PRE-COMMIT                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Coordinator: "Everyone agreed, prepare to commit"       │   │
│  │ Participants: ACK and enter PRE-COMMITTED state         │   │
│  │                                                          │   │
│  │ NEW: All participants know the decision will be COMMIT  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Phase 3: DO COMMIT                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Coordinator: "Commit now"                               │   │
│  │ Participants: Actually commit                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  How it helps:                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ If coordinator fails in Phase 2:                        │   │
│  │ Participants know: "We're PRE-COMMITTED"                │   │
│  │ They can elect new coordinator and proceed to commit.   │   │
│  │                                                          │   │
│  │ If coordinator fails in Phase 1:                        │   │
│  │ No one is PRE-COMMITTED, safe to abort.                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  BUT: Still has problems with network partitions!              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ A gets PRE-COMMIT, B doesn't (partition)                │   │
│  │ A might commit, B might abort                           │   │
│  │ INCONSISTENT!                                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  3PC rarely used in practice - complexity not worth it.         │
│  Instead: Use Paxos-based commit or Saga pattern.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3. Saga Pattern

Alternative to distributed transactions: Sequence of local transactions with compensations.

┌─────────────────────────────────────────────────────────────────┐
│                    SAGA PATTERN                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Instead of:                                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ BEGIN DISTRIBUTED TRANSACTION                           │   │
│  │   Debit Alice $100                                      │   │
│  │   Credit Bob $100                                       │   │
│  │ COMMIT                                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Do:                                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ T1: Debit Alice $100 (local commit)                     │   │
│  │     ↓                                                    │   │
│  │ T2: Credit Bob $100 (local commit)                      │   │
│  │                                                          │   │
│  │ If T2 fails → Run C1: Credit Alice $100 (compensation)  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Each step:                                                     │
│  ┌───────────────┬────────────────────────────────────────┐    │
│  │ Transaction   │ Compensating Transaction               │    │
│  ├───────────────┼────────────────────────────────────────┤    │
│  │ Create Order  │ Cancel Order                           │    │
│  │ Reserve Stock │ Release Stock                          │    │
│  │ Charge Card   │ Refund Card                            │    │
│  │ Ship Item     │ Return Item                            │    │
│  └───────────────┴────────────────────────────────────────┘    │
│                                                                 │
│  Properties:                                                    │
│  • Each step commits immediately (no locks held)               │
│  • Failures trigger compensations (backward recovery)          │
│  • Eventually consistent (not ACID)                            │
│  • No distributed locks!                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Saga Execution Patterns

┌─────────────────────────────────────────────────────────────────┐
│              SAGA COORDINATION                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CHOREOGRAPHY (Event-driven):                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  Order      ─── OrderCreated ───►  Inventory            │   │
│  │  Service                            Service              │   │
│  │                                        │                 │   │
│  │                      ┌─StockReserved───┘                 │   │
│  │                      │                                   │   │
│  │                      ▼                                   │   │
│  │                   Payment                                │   │
│  │                   Service                                │   │
│  │                      │                                   │   │
│  │          ┌─PaymentCharged───┘                            │   │
│  │          │                                               │   │
│  │          ▼                                               │   │
│  │       Shipping                                           │   │
│  │       Service                                            │   │
│  │                                                          │   │
│  │  Services react to events, publish their own events     │   │
│  │  No central coordinator                                  │   │
│  │  Hard to understand flow, can become complex            │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ORCHESTRATION (Command-driven):                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │              ┌─────────────────┐                         │   │
│  │              │  Saga           │                         │   │
│  │              │  Orchestrator   │                         │   │
│  │              └────────┬────────┘                         │   │
│  │                       │                                  │   │
│  │         ┌─────────────┼─────────────┐                    │   │
│  │         │             │             │                    │   │
│  │         ▼             ▼             ▼                    │   │
│  │    ┌────────┐   ┌──────────┐  ┌──────────┐              │   │
│  │    │Inventory│   │ Payment │  │ Shipping │              │   │
│  │    │Service │   │ Service │  │ Service  │              │   │
│  │    └────────┘   └──────────┘  └──────────┘              │   │
│  │                                                          │   │
│  │  Orchestrator:                                           │   │
│  │  1. Tell Inventory: Reserve stock                       │   │
│  │  2. Tell Payment: Charge card                           │   │
│  │  3. Tell Shipping: Ship item                            │   │
│  │  4. If any fails: Run compensations                     │   │
│  │                                                          │   │
│  │  Easier to understand, but orchestrator is SPOF         │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Saga Failure Handling

┌─────────────────────────────────────────────────────────────────┐
│              SAGA COMPENSATION EXAMPLE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Order Saga: Create Order → Reserve → Pay → Ship                │
│                                                                 │
│  HAPPY PATH:                                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ T1: Create Order ✓                                      │   │
│  │ T2: Reserve Stock ✓                                     │   │
│  │ T3: Charge Card ✓                                       │   │
│  │ T4: Ship Item ✓                                         │   │
│  │                                                          │   │
│  │ SUCCESS!                                                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  FAILURE AT T3:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ T1: Create Order ✓                                      │   │
│  │ T2: Reserve Stock ✓                                     │   │
│  │ T3: Charge Card ✗ (insufficient funds)                  │   │
│  │                                                          │   │
│  │ COMPENSATE:                                              │   │
│  │ C2: Release Stock (undo T2)                             │   │
│  │ C1: Cancel Order (undo T1)                              │   │
│  │                                                          │   │
│  │ SAGA ABORTED (all effects undone)                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Compensation challenges:                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Not all actions are compensatable                     │   │
│  │   (Can't un-send email, un-ship physical item)          │   │
│  │                                                          │   │
│  │ • Compensation might fail too                           │   │
│  │   (Need retry logic, dead letter queues)                │   │
│  │                                                          │   │
│  │ • Temporary inconsistency visible                       │   │
│  │   (Order exists briefly before compensation)            │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4. Comparison: 2PC vs Saga

┌─────────────────────────────────────────────────────────────────┐
│              2PC vs SAGA COMPARISON                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                     │ 2PC            │ Saga                     │
│  ───────────────────┼────────────────┼──────────────────────────│
│  Consistency        │ Strong (ACID)  │ Eventual                 │
│  Isolation          │ Full           │ None (ACD)               │
│  Blocking           │ Yes            │ No                       │
│  Coordinator SPOF   │ Yes            │ Depends on pattern       │
│  Latency            │ Higher         │ Lower                    │
│  Compensation logic │ Automatic      │ Manual (per operation)   │
│  Partial failure    │ All-or-nothing │ Compensation chain       │
│                                                                 │
│  USE 2PC WHEN:                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Strong consistency required                           │   │
│  │ • Operations naturally atomic                           │   │
│  │ • Low latency environments (same datacenter)           │   │
│  │ • Database-level transactions                           │   │
│  │                                                          │   │
│  │ Examples: Banking within one bank, inventory systems   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  USE SAGA WHEN:                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Long-running transactions                             │   │
│  │ • Cross-service/cross-organization                      │   │
│  │ • High availability required                            │   │
│  │ • Eventual consistency acceptable                       │   │
│  │                                                          │   │
│  │ Examples: Order fulfillment, booking systems           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

5. Modern Approaches

Google Spanner: Globally Distributed ACID

┌─────────────────────────────────────────────────────────────────┐
│              GOOGLE SPANNER TRANSACTIONS                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  How Spanner achieves global ACID:                              │
│                                                                 │
│  1. TrueTime for timestamps                                     │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ GPS + atomic clocks → bounded time uncertainty        │  │
│     │ Every transaction gets globally unique timestamp      │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  2. Paxos per partition (shard)                                 │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Each partition: 5 replicas using Paxos               │  │
│     │ Strong consistency within partition                  │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  3. 2PC across partitions                                       │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Cross-partition transactions use 2PC                 │  │
│     │ Coordinator is Paxos group (fault tolerant)          │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  Result:                                                        │
│  • Global strong consistency                                   │
│  • Serializable isolation                                      │
│  • But: High latency (TrueTime wait + 2PC)                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CockroachDB: Serializable Distributed Transactions

┌─────────────────────────────────────────────────────────────────┐
│              COCKROACHDB TRANSACTIONS                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Similar to Spanner but without specialized hardware:           │
│                                                                 │
│  1. Hybrid Logical Clocks (HLC)                                 │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Physical time + logical counter                      │  │
│     │ No GPS/atomic clocks needed                          │  │
│     │ Bounded clock skew (default 500ms)                   │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  2. Raft per range (partition)                                  │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Data split into ranges                               │  │
│     │ Each range: Raft group (3-5 replicas)               │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  3. Parallel commits (optimization)                             │
│     ┌───────────────────────────────────────────────────────┐  │
│     │ Commit in single round-trip when possible            │  │
│     │ Avoids coordinator bottleneck                        │  │
│     └───────────────────────────────────────────────────────┘  │
│                                                                 │
│  Trade-off vs Spanner:                                          │
│  • More flexible deployment (commodity hardware)               │
│  • Higher clock skew → more uncertainty → slower writes        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Interview Questions

Conceptual Questions

Explain 2PC and its problems.
- Two phases: Prepare (vote), Commit (decide)
- Problems: Blocking (coordinator failure), latency, SPOF
How does Saga differ from 2PC?
- Saga: Sequence of local transactions + compensations
- No locks held across steps
- Eventually consistent, not ACID
When would you use 2PC vs Saga?
- 2PC: Strong consistency needed, low latency environment
- Saga: Long-running, cross-service, high availability
What is the "in-doubt" state in 2PC?
- Participant voted YES but didn't receive decision
- Holds locks, can't decide alone
- Must wait for coordinator recovery

System Design Questions

Design a money transfer system between banks.
- Cross-bank: Saga pattern (different organizations)
- Within bank: 2PC (control both sides)
- Handle compensation: Reversal transactions
- Idempotency keys for retry safety
How would you implement order placement across inventory, payment, and shipping?
- Saga with orchestrator
- Steps: Reserve → Charge → Ship
- Compensations: Release → Refund → Return
- Handle partial failures gracefully

Summary

┌─────────────────────────────────────────────────────────────────┐
│              MODULE 10 KEY TAKEAWAYS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Distributed transactions are hard                           │
│     • Network failures, partial failures                       │
│     • Need atomicity across nodes                              │
│                                                                 │
│  2. 2PC: Classic approach                                       │
│     • Prepare → Commit phases                                  │
│     • Strong consistency but blocking                          │
│     • Coordinator is SPOF                                      │
│                                                                 │
│  3. 3PC: Attempts to solve blocking                             │
│     • Adds pre-commit phase                                    │
│     • Still has partition problems                             │
│     • Rarely used in practice                                  │
│                                                                 │
│  4. Saga: Modern alternative                                    │
│     • Sequence of local transactions                           │
│     • Compensations for rollback                               │
│     • No distributed locks                                     │
│     • Eventually consistent                                    │
│                                                                 │
│  5. Choose based on requirements                                │
│     • Strong consistency → 2PC (or Spanner/CockroachDB)       │
│     • High availability → Saga                                 │
│     • Consider hybrid approaches                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Next Module: Caching Strategies - Write-through, write-behind, cache invalidation, and distributed caching patterns.