Module 4: Time, Clocks, and Ordering in Distributed Systems
The Fundamental Problem of Time
In a single-computer world, time is straightforward. In distributed systems, time becomes one of the most challenging problems.
┌─────────────────────────────────────────────────────────────────┐ │ WHY TIME IS HARD IN DISTRIBUTED SYSTEMS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Single Machine: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Event A ──────► Event B ──────► Event C │ │ │ │ │ │ │ │ │ │ │ t=100 t=200 t=300 │ │ │ │ │ │ │ │ One clock. One timeline. Order is clear. │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Distributed System: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Node 1: Event A ─────────────────► Event D │ │ │ │ │ │ │ │ │ │ t=100 t=350 │ │ │ │ │ ▲ │ │ │ │ ▼ │ │ │ │ │ Node 2: Event B ───► Event C ─────────┘ │ │ │ │ │ │ │ │ │ │ t=98 t=320 │ │ │ │ │ │ │ │ Different clocks. Different timelines. What's the order?│ │ │ │ Did A happen before B? (100 > 98 but they're different │ │ │ │ clocks!) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Why Clocks Differ
┌─────────────────────────────────────────────────────────────────┐ │ SOURCES OF CLOCK DIVERGENCE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. CRYSTAL OSCILLATOR DRIFT │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Typical quartz crystal: ±50 ppm (parts per million) │ │ │ │ = 50 microseconds per second │ │ │ │ = 4.3 seconds per day │ │ │ │ = 30 seconds per week │ │ │ │ │ │ │ │ Temperature affects oscillation frequency │ │ │ │ Server room A/C failure → clock drift accelerates │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. NTP SYNCHRONIZATION ISSUES │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ NTP accuracy: 1-50ms over internet │ │ │ │ NTP accuracy: 0.1-1ms on LAN │ │ │ │ │ │ │ │ Problems: │ │ │ │ • Network delays asymmetric │ │ │ │ • NTP server itself might be wrong │ │ │ │ • Leap seconds cause chaos │ │ │ │ • NTP might step clock backward! │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. VIRTUALIZATION │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ VMs can be paused and resumed │ │ │ │ Live migration → clock skew │ │ │ │ Hypervisor scheduling affects clock reads │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. LEAP SECONDS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ June 30 or December 31, 11:59:60 PM │ │ │ │ Some systems: repeat second │ │ │ │ Some systems: smear across hours │ │ │ │ 2012 leap second: crashed Reddit, LinkedIn, Yelp │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
1. Physical Clocks and Their Limitations
Types of Physical Clocks
┌─────────────────────────────────────────────────────────────────┐ │ PHYSICAL CLOCK TYPES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ TIME OF DAY CLOCK (Wall Clock): │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ time.Now() // Returns: 2024-01-15 10:30:45.123456789 UTC│ │ │ │ │ │ │ │ Properties: │ │ │ │ • Synchronized via NTP │ │ │ │ • Can jump forward or backward! │ │ │ │ • Subject to leap seconds │ │ │ │ • Resolution: typically microseconds to nanoseconds │ │ │ │ │ │ │ │ Use for: Human-readable timestamps, logging, auditing │ │ │ │ NEVER use for: Measuring elapsed time, ordering events │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ MONOTONIC CLOCK: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Go: time.Now() includes monotonic reading │ │ │ │ Java: System.nanoTime() │ │ │ │ C: clock_gettime(CLOCK_MONOTONIC, ...) │ │ │ │ │ │ │ │ Properties: │ │ │ │ • Never goes backward │ │ │ │ • Not synchronized across machines │ │ │ │ • Only meaningful for durations on same machine │ │ │ │ • Resets on system reboot │ │ │ │ │ │ │ │ Use for: Measuring elapsed time, timeouts, rate limiting │ │ │ │ NEVER use for: Timestamps, cross-machine ordering │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Real-World Clock Failures
┌─────────────────────────────────────────────────────────────────┐ │ PRODUCTION CLOCK DISASTERS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ CLOUDFLARE (2017): │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Leap second caused rand.Seed(time.Now().UnixNano()) │ │ │ │ to generate same sequence across servers │ │ │ │ → DNS randomization broken │ │ │ │ → Potential security vulnerability │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ AWS (2017): │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ S3 outage caused by clock skew during maintenance │ │ │ │ Servers removed from rotation couldn't rejoin │ │ │ │ because their clocks were "in the future" │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ GOOGLE SPANNER INTERNAL INCIDENT: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ GPS receiver lost signal for extended period │ │ │ │ Atomic clocks drifted slightly │ │ │ │ TrueTime uncertainty bounds widened │ │ │ │ Write latency increased (had to wait out uncertainty) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Lesson: Even the best clock systems fail. │ │ Design your system to handle clock uncertainty. │ │ │ └─────────────────────────────────────────────────────────────────┘
2. The Happens-Before Relationship
Leslie Lamport's 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System" introduced the fundamental concept of logical time.
Definition of Happens-Before (→)
┌─────────────────────────────────────────────────────────────────┐ │ HAPPENS-BEFORE RELATION (→) │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Event A "happens before" Event B (written A → B) if: │ │ │ │ Rule 1: Same process, A occurs before B │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Process P: [A] ────────► [B] │ │ │ │ A → B (program order) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Rule 2: A is send, B is corresponding receive │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Process P: [send(m)] ─────────────────┐ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ Process Q: [receive(m)] │ │ │ │ send(m) → receive(m) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Rule 3: Transitivity │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ If A → B and B → C, then A → C │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ CONCURRENT EVENTS: │ │ If neither A → B nor B → A, then A and B are concurrent (A ∥ B)│ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Process P: [A] │ │ │ │ │ │ │ │ Process Q: [B] │ │ │ │ A ∥ B (no causal relationship) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Why Happens-Before Matters
┌─────────────────────────────────────────────────────────────────┐ │ HAPPENS-BEFORE IN REAL SYSTEMS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Example: Bank Account Update │ │ │ │ Initial balance: $1000 │ │ │ │ Node 1 (ATM): Node 2 (Online Banking): │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ Read balance: $1000 │ │ Read balance: $1000 │ │ │ │ Withdraw $100 │ │ Withdraw $500 │ │ │ │ Write: $900 │ │ Write: $500 │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ │ │ Final balance could be $900 OR $500 (not $400!) │ │ This is the Lost Update problem. │ │ │ │ If we had happens-before: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Node 1: Read($1000) → Withdraw → Write($900) │ │ │ │ ↓ │ │ │ │ Node 2: (waits) Read($900) → Write($400)│ │ │ │ │ │ │ │ Final balance: $400 (correct!) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
3. Lamport Clocks (Logical Clocks)
Lamport's elegant solution: Don't try to synchronize wall clocks. Instead, use logical counters that respect causality.
Algorithm
┌─────────────────────────────────────────────────────────────────┐ │ LAMPORT CLOCK ALGORITHM │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Each process maintains a counter C (initially 0) │ │ │ │ Rule 1: Before any event, increment counter │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ C = C + 1 │ │ │ │ timestamp(event) = C │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Rule 2: When sending message, include counter │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ C = C + 1 │ │ │ │ send(message, C) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Rule 3: When receiving message with timestamp t │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ C = max(C, t) + 1 │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Key Property: │ │ If A → B, then LC(A) < LC(B) │ │ (Causality is preserved) │ │ │ │ BUT: LC(A) < LC(B) does NOT imply A → B │ │ (Can't detect concurrency!) │ │ │ └─────────────────────────────────────────────────────────────────┘
Visual Example
┌─────────────────────────────────────────────────────────────────┐ │ LAMPORT CLOCK EXECUTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Process A (C=0) Process B (C=0) Process C (C=0) │ │ │ │ │ │ │ │ [1] │ │ ← A increments │ │ │═══════════════► │ │ ← Send msg(1) │ │ │ │ [2] │ ← B: max(0,1)+1 │ │ │ │═════════════════►│ ← Send msg(2) │ │ │ │ │ [3] ← C: max(0,2)+1│ │ │ [2] │ │ ← A increments │ │ │◄═════════════════│═════════════════ │ ← C sends to A │ │ │ [4] │ │ ← A: max(2,3)+1 │ │ │ │ [3] │ ← B increments │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ │ Timeline of Lamport timestamps: │ │ A: 1 ────────────────────────────── 2 ─────── 4 │ │ B: 2 ─────────────────────────────────── 3 │ │ C: 3 │ │ │ │ Notice: Events with same LC value might be concurrent │ │ or causally related - Lamport clocks can't distinguish! │ │ │ └─────────────────────────────────────────────────────────────────┘
Lamport Clock Implementation
gopackage lamport import "sync" type LamportClock struct { mu sync.Mutex counter uint64 } func NewLamportClock() *LamportClock { return &LamportClock{counter: 0} } // Tick increments clock for local event func (lc *LamportClock) Tick() uint64 { lc.mu.Lock() defer lc.mu.Unlock() lc.counter++ return lc.counter } // Send returns timestamp to include in message func (lc *LamportClock) Send() uint64 { return lc.Tick() } // Receive updates clock based on received timestamp func (lc *LamportClock) Receive(msgTimestamp uint64) uint64 { lc.mu.Lock() defer lc.mu.Unlock() if msgTimestamp > lc.counter { lc.counter = msgTimestamp } lc.counter++ return lc.counter } // Current returns current timestamp (without incrementing) func (lc *LamportClock) Current() uint64 { lc.mu.Lock() defer lc.mu.Unlock() return lc.counter }
Lamport Clock Limitations
┌─────────────────────────────────────────────────────────────────┐ │ LAMPORT CLOCK LIMITATIONS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Problem: Can't detect concurrent events │ │ │ │ Process A: [1] ─────────────────── [2] │ │ │ │ Process B: [1] ─────── [2] │ │ │ │ Both A's first event and B's first event have timestamp 1 │ │ Are they concurrent? Or did one cause the other? │ │ Lamport clocks CAN'T TELL! │ │ │ │ This matters for: │ │ • Conflict detection in replicated systems │ │ • Determining if events need reconciliation │ │ • Building causally consistent systems │ │ │ │ Solution: Vector Clocks │ │ │ └─────────────────────────────────────────────────────────────────┘
4. Vector Clocks
Vector clocks extend Lamport clocks to detect concurrent events.
The Insight
┌─────────────────────────────────────────────────────────────────┐ │ VECTOR CLOCK INSIGHT │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Lamport Clock: Single integer │ │ Problem: Can't detect concurrency │ │ │ │ Vector Clock: Array of integers (one per process) │ │ Solution: Track what each process knows about others │ │ │ │ For N processes, each maintains vector V[0..N-1] │ │ V[i] = "latest event I know about from process i" │ │ │ │ Example with 3 processes: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Process A: [3, 1, 2] │ │ │ │ │ │ │ │ │ │ │ │ │ └── A knows C has done 2 events │ │ │ │ │ └───── A knows B has done 1 event │ │ │ │ └──────── A has done 3 events itself │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Algorithm
┌─────────────────────────────────────────────────────────────────┐ │ VECTOR CLOCK ALGORITHM │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Each process i maintains vector V (all zeros initially) │ │ │ │ Rule 1: Before any event on process i │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ V[i] = V[i] + 1 │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Rule 2: When sending message from process i │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ V[i] = V[i] + 1 │ │ │ │ send(message, V) // Include entire vector │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Rule 3: When process i receives message with vector V_msg │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ For each j: V[j] = max(V[j], V_msg[j]) │ │ │ │ V[i] = V[i] + 1 │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ COMPARING VECTOR CLOCKS: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ V1 < V2 iff ∀i: V1[i] ≤ V2[i] AND ∃j: V1[j] < V2[j] │ │ │ │ (V1 happens before V2) │ │ │ │ │ │ │ │ V1 ∥ V2 iff neither V1 < V2 nor V2 < V1 │ │ │ │ (V1 and V2 are concurrent) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Visual Example
┌─────────────────────────────────────────────────────────────────┐ │ VECTOR CLOCK EXECUTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Process A Process B Process C │ │ V=[0,0,0] V=[0,0,0] V=[0,0,0] │ │ │ │ │ │ │ V=[1,0,0] ←─ event │ │ │ │ │════════════════► │ │ │ │ │ V=[1,1,0] ←─ receive │ │ │ │ │ │ │ │ │ │════════════════► │ │ │ │ │ V=[1,1,1] ←─ receive │ │ │ │ │ │ │ V=[2,0,0] ←─ event │ │ │ │ │ │ │ │ │ │◄════════════════════════════════════ │ │ │ V=[2,1,1] ←─ receive │ │ │ │ │ │ │ │ │ │ V=[1,2,0] ←─ event │ │ │ │ │ │ │ │ │ │ Comparison: │ │ [2,0,0] vs [1,1,1]: Neither dominates → CONCURRENT │ │ [1,0,0] vs [2,1,1]: [1,0,0] < [2,1,1] → HAPPENS BEFORE │ │ [1,2,0] vs [1,1,1]: Neither dominates → CONCURRENT │ │ │ └─────────────────────────────────────────────────────────────────┘
Vector Clock Implementation
gopackage vectorclock import "sync" type VectorClock struct { mu sync.RWMutex nodeID string clock map[string]uint64 } func NewVectorClock(nodeID string) *VectorClock { vc := &VectorClock{ nodeID: nodeID, clock: make(map[string]uint64), } vc.clock[nodeID] = 0 return vc } // Tick increments this node's clock for local event func (vc *VectorClock) Tick() map[string]uint64 { vc.mu.Lock() defer vc.mu.Unlock() vc.clock[vc.nodeID]++ return vc.copy() } // Send returns clock to include in outgoing message func (vc *VectorClock) Send() map[string]uint64 { return vc.Tick() } // Receive merges incoming clock with local clock func (vc *VectorClock) Receive(incoming map[string]uint64) map[string]uint64 { vc.mu.Lock() defer vc.mu.Unlock() // Merge: take max of each component for nodeID, timestamp := range incoming { if timestamp > vc.clock[nodeID] { vc.clock[nodeID] = timestamp } } // Increment own counter vc.clock[vc.nodeID]++ return vc.copy() } // Compare returns ordering relationship // -1: vc < other (vc happens before other) // 0: concurrent // 1: vc > other (vc happens after other) func (vc *VectorClock) Compare(other map[string]uint64) int { vc.mu.RLock() defer vc.mu.RUnlock() less := false greater := false // Check all keys from both clocks allKeys := make(map[string]bool) for k := range vc.clock { allKeys[k] = true } for k := range other { allKeys[k] = true } for k := range allKeys { myVal := vc.clock[k] otherVal := other[k] if myVal < otherVal { less = true } if myVal > otherVal { greater = true } } if less && !greater { return -1 // happens before } if greater && !less { return 1 // happens after } return 0 // concurrent } func (vc *VectorClock) copy() map[string]uint64 { result := make(map[string]uint64) for k, v := range vc.clock { result[k] = v } return result }
Vector Clock Usage in Dynamo
┌─────────────────────────────────────────────────────────────────┐ │ VECTOR CLOCKS IN AMAZON DYNAMO │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Client writes "value1" to key "cart" │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Node A stores: cart = "value1", VC = [A:1] │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Two concurrent updates (network partition): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Client 1 → Node A: cart = "value2", VC = [A:2] │ │ │ │ Client 2 → Node B: cart = "value3", VC = [A:1, B:1] │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ On read, Dynamo returns BOTH versions: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ [A:2] vs [A:1, B:1] │ │ │ │ Neither dominates → CONFLICT │ │ │ │ │ │ │ │ Application must resolve: │ │ │ │ • Shopping cart: merge both items │ │ │ │ • Counter: take max │ │ │ │ • Last-write-wins: use wall clock │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ After resolution: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ cart = "merged_value", VC = [A:2, B:1, C:1] │ │ │ │ (C is the resolving node) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
5. Hybrid Logical Clocks (HLC)
HLC combines the best of physical and logical clocks.
┌─────────────────────────────────────────────────────────────────┐ │ HYBRID LOGICAL CLOCKS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Problem with pure logical clocks: │ │ • No relation to wall clock time │ │ • Can't answer "when did this happen?" │ │ • Hard to implement TTLs, time-based queries │ │ │ │ Problem with pure physical clocks: │ │ • Can go backward (NTP adjustments) │ │ • Different clocks = different times │ │ • Can't guarantee causality │ │ │ │ HLC: Physical time + Logical counter │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ HLC = (physical_time, logical_counter) │ │ │ │ │ │ │ │ • Uses physical time when possible │ │ │ │ • Falls back to logical counter when physical time │ │ │ │ would violate causality │ │ │ │ • Always monotonically increasing │ │ │ │ • Close to wall clock (bounded drift) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
HLC Algorithm
┌─────────────────────────────────────────────────────────────────┐ │ HLC ALGORITHM │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ State: (l, c) where l = physical time, c = logical counter │ │ │ │ LOCAL EVENT or SEND: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ l' = l │ │ │ │ pt = physical_time() │ │ │ │ │ │ │ │ if pt > l': │ │ │ │ l = pt │ │ │ │ c = 0 │ │ │ │ else: │ │ │ │ l = l' // keep old physical time │ │ │ │ c = c + 1 // increment logical │ │ │ │ │ │ │ │ return (l, c) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ RECEIVE message with timestamp (l_m, c_m): │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ l' = l │ │ │ │ pt = physical_time() │ │ │ │ │ │ │ │ if pt > l' and pt > l_m: │ │ │ │ l = pt │ │ │ │ c = 0 │ │ │ │ elif l' > l_m: │ │ │ │ l = l' │ │ │ │ c = c + 1 │ │ │ │ elif l_m > l': │ │ │ │ l = l_m │ │ │ │ c = c_m + 1 │ │ │ │ else: // l' == l_m │ │ │ │ l = l' │ │ │ │ c = max(c, c_m) + 1 │ │ │ │ │ │ │ │ return (l, c) │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
HLC Implementation
gopackage hlc import ( "sync" "time" ) type HLC struct { mu sync.Mutex wallMS int64 // milliseconds since epoch logical uint32 // logical counter } func NewHLC() *HLC { return &HLC{ wallMS: time.Now().UnixMilli(), logical: 0, } } type Timestamp struct { WallMS int64 Logical uint32 } func (t Timestamp) Compare(other Timestamp) int { if t.WallMS < other.WallMS { return -1 } if t.WallMS > other.WallMS { return 1 } if t.Logical < other.Logical { return -1 } if t.Logical > other.Logical { return 1 } return 0 } // Now generates timestamp for local event func (h *HLC) Now() Timestamp { h.mu.Lock() defer h.mu.Unlock() pt := time.Now().UnixMilli() if pt > h.wallMS { h.wallMS = pt h.logical = 0 } else { h.logical++ } return Timestamp{WallMS: h.wallMS, Logical: h.logical} } // Update incorporates received timestamp func (h *HLC) Update(received Timestamp) Timestamp { h.mu.Lock() defer h.mu.Unlock() pt := time.Now().UnixMilli() if pt > h.wallMS && pt > received.WallMS { h.wallMS = pt h.logical = 0 } else if h.wallMS > received.WallMS { h.logical++ } else if received.WallMS > h.wallMS { h.wallMS = received.WallMS h.logical = received.Logical + 1 } else { // h.wallMS == received.WallMS if h.logical > received.Logical { h.logical++ } else { h.logical = received.Logical + 1 } } return Timestamp{WallMS: h.wallMS, Logical: h.logical} }
HLC in CockroachDB
┌─────────────────────────────────────────────────────────────────┐ │ HLC IN COCKROACHDB │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ CockroachDB uses HLC for: │ │ │ │ 1. Transaction timestamps │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ BEGIN TRANSACTION at HLC = (1705315800000, 5) │ │ │ │ • All reads see data as of this timestamp │ │ │ │ • Writes get this timestamp │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. MVCC versioning │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Key: "user:123" │ │ │ │ Versions: │ │ │ │ (1705315800000, 5): {"name": "Alice"} │ │ │ │ (1705315700000, 0): {"name": "Alicia"} │ │ │ │ (1705315600000, 3): {"name": "A"} │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. Time-travel queries │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ SELECT * FROM users │ │ │ │ AS OF SYSTEM TIME '2024-01-15 10:00:00' │ │ │ │ │ │ │ │ HLC makes this possible while maintaining causality │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
6. Google Spanner's TrueTime
The most sophisticated time system in production: hardware-assisted distributed time.
┌─────────────────────────────────────────────────────────────────┐ │ GOOGLE TRUETIME │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ TrueTime doesn't return a single timestamp. │ │ It returns an INTERVAL: [earliest, latest] │ │ │ │ API: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ TT.now() → TTinterval [earliest, latest] │ │ │ │ TT.after(t) → true if t has definitely passed │ │ │ │ TT.before(t) → true if t has definitely not arrived │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Hardware infrastructure: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ │ │ GPS │ │ GPS │ │ Atomic │ Time Masters │ │ │ │ │Receiver │ │Receiver │ │ Clock │ (per datacenter)│ │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ │ │ │ │ └────────────┼────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌─────────────┐ │ │ │ │ │ Time Daemon │ (on every server) │ │ │ │ └─────────────┘ │ │ │ │ │ │ │ │ GPS: Provides absolute time, but can lose signal │ │ │ │ Atomic: Provides stable drift, but no absolute time │ │ │ │ Combined: Bounded uncertainty (typically 1-7ms) │ │ │ │ │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Commit-Wait Protocol
┌─────────────────────────────────────────────────────────────────┐ │ SPANNER COMMIT-WAIT │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ To ensure external consistency (if T1 commits before T2 │ │ starts, T1's timestamp < T2's timestamp): │ │ │ │ Transaction Commit: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ 1. Acquire locks │ │ │ │ │ │ │ │ 2. Pick commit timestamp s │ │ │ │ s = TT.now().latest │ │ │ │ │ │ │ │ 3. WAIT until TT.after(s) is true │ │ │ │ (wait out the uncertainty) │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ Time ───────────────────────────────────────► │ │ │ │ │ │ │ │ │ │ │ │ [───── uncertainty ─────] │ │ │ │ │ │ earliest s latest │ │ │ │ │ │ │ │ │ │ │ │ │ └── WAIT until here │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ 4. Commit and release locks │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Cost: Average wait = ε (uncertainty) ≈ 4ms │ │ Benefit: TRUE external consistency without coordination │ │ │ │ Why it works: │ │ • T1 commits at s1, waits until TT.after(s1) │ │ • T2 starts after T1 commits │ │ • T2's timestamp s2 = TT.now().latest > s1 │ │ • Guaranteed: s1 < s2 (external consistency!) │ │ │ └─────────────────────────────────────────────────────────────────┘
7. Practical Applications
Ordering Events Across Services
┌─────────────────────────────────────────────────────────────────┐ │ ORDERING IN MICROSERVICES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Problem: Events from different services arrive out of order │ │ │ │ Order Service Payment Service Shipping │ │ │ │ │ │ │ OrderCreated ─────────────────────────────────────►│ │ │ (t=100) │ │ │ │ │ │ │ │ │ │ PaymentReceived ──────────────►│ │ │ │ (t=95) │ │ │ │ │ │ │ │ │ │ Shipping receives PaymentReceived before OrderCreated! │ │ Wall clock says Payment (95) before Order (100) │ │ But causally: Order → Payment │ │ │ │ Solution: Include causal context │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ OrderCreated: │ │ │ │ order_id: "123" │ │ │ │ hlc: (1705315800000, 0) │ │ │ │ │ │ │ │ PaymentReceived: │ │ │ │ order_id: "123" │ │ │ │ hlc: (1705315800000, 1) ← causally after │ │ │ │ caused_by: (1705315800000, 0) ← reference │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Shipping can now correctly order events! │ │ │ └─────────────────────────────────────────────────────────────────┘
Distributed Debugging
┌─────────────────────────────────────────────────────────────────┐ │ DISTRIBUTED TRACING WITH TIME │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Correlating logs across services requires consistent time │ │ │ │ Without synchronized clocks: │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Service A: 10:00:00.000 - Request received │ │ │ │ Service B: 10:00:00.005 - Processing │ │ │ │ Service A: 10:00:00.002 - Response sent │ │ │ │ ↑ IMPOSSIBLE! A sent before B? │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ Solution: Trace IDs + HLC + Span context │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Span: { │ │ │ │ trace_id: "abc123", │ │ │ │ span_id: "span1", │ │ │ │ parent_span_id: null, │ │ │ │ hlc_start: (1705315800000, 0), │ │ │ │ hlc_end: (1705315800000, 5), │ │ │ │ service: "api-gateway" │ │ │ │ } │ │ │ │ │ │ │ │ Child Span: { │ │ │ │ trace_id: "abc123", │ │ │ │ span_id: "span2", │ │ │ │ parent_span_id: "span1", │ │ │ │ hlc_start: (1705315800000, 1), ← causally after parent │ │ │ │ hlc_end: (1705315800000, 3), │ │ │ │ service: "user-service" │ │ │ │ } │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Interview Questions
Conceptual Questions
-
Why can't we use wall clocks to order events in distributed systems?
- Clocks drift (crystal oscillator imprecision)
- NTP can jump backward
- Different machines have different times
- No global reference frame
-
What's the difference between Lamport clocks and vector clocks?
- Lamport: Single counter, preserves causality but can't detect concurrency
- Vector: Array of counters, can detect both causality AND concurrency
- Trade-off: Vector clocks need O(n) space per message
-
Explain the happens-before relationship.
- A → B if: same process (A before B), or A is send and B is receive, or transitive
- If neither A → B nor B → A, events are concurrent
- Foundation for all logical time algorithms
-
Why does Google Spanner wait before committing?
- Commit-wait ensures external consistency
- Wait out the TrueTime uncertainty interval
- After waiting, guaranteed that real time has passed commit timestamp
- Any subsequent transaction will have higher timestamp
System Design Questions
-
Design a conflict detection system for a collaborative editor.
- Use vector clocks to track each client's edits
- Concurrent edits detected when neither dominates
- Operational transformation or CRDT to resolve
- HLC for reasonable wall-clock ordering in UI
-
How would you implement consistent snapshots across microservices?
- Chandy-Lamport algorithm with logical clocks
- Each service records state when receiving marker
- Marker propagates through all channels
- Resulting snapshot is consistent cut
Common Mistakes
┌─────────────────────────────────────────────────────────────────┐ │ TIME-RELATED ANTI-PATTERNS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ✗ Using wall clock for distributed ordering │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // DON'T │ │ │ │ if (event1.timestamp < event2.timestamp) { │ │ │ │ // event1 happened first (WRONG!) │ │ │ │ } │ │ │ │ │ │ │ │ // DO: Use logical/vector clocks │ │ │ │ if (vectorClock1.happensBefore(vectorClock2)) { ... } │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ✗ Assuming monotonic time across machines │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // DON'T │ │ │ │ token_expiry = remote_server.getTime() + 3600 │ │ │ │ │ │ │ │ // DO: Use duration-based or HLC-based expiry │ │ │ │ token_expiry_hlc = current_hlc.add(Duration::hours(1)) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ✗ Comparing timestamps from different sources │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // DON'T │ │ │ │ if (db_timestamp > cache_timestamp) { ... } │ │ │ │ │ │ │ │ // DO: Use same clock source or version numbers │ │ │ │ if (db_version > cache_version) { ... } │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ✗ Ignoring clock skew in TTL calculations │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // DON'T │ │ │ │ cache.set(key, value, ttl=60s) // Might expire early! │ │ │ │ │ │ │ │ // DO: Add buffer for clock skew │ │ │ │ cache.set(key, value, ttl=60s + max_clock_skew) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Summary
┌─────────────────────────────────────────────────────────────────┐ │ MODULE 4 KEY TAKEAWAYS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. Physical clocks are unreliable for ordering │ │ • Drift, NTP jumps, leap seconds │ │ • Never use wall clock to determine causality │ │ │ │ 2. Lamport clocks: Simple but limited │ │ • Preserves causality: A → B implies LC(A) < LC(B) │ │ • Cannot detect concurrency │ │ │ │ 3. Vector clocks: Full causality tracking │ │ • Detects both causality and concurrency │ │ • O(n) space per message │ │ • Used in Dynamo, Riak │ │ │ │ 4. HLC: Best of both worlds │ │ • Physical time when possible │ │ • Logical counter for causality │ │ • Used in CockroachDB, MongoDB │ │ │ │ 5. TrueTime: Hardware-assisted certainty │ │ • Returns interval, not point │ │ • Enables commit-wait protocol │ │ • Google Spanner's secret sauce │ │ │ └─────────────────────────────────────────────────────────────────┘
Next Module: Consistency Models - Understanding linearizability, eventual consistency, and everything in between.