Module 22: Failure Detection

Why Failure Detection Matters

In distributed systems, detecting failures quickly and accurately is essential for maintaining availability.
┌─────────────────────────────────────────────────────────────────┐ │ FAILURE DETECTION OVERVIEW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ The Challenge: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Node A ──────────────────────────► Node B │ │ │ │ (no response) │ │ │ │ │ │ │ │ Is B dead? Or just slow? Or network issue? │ │ │ │ │ │ │ │ We cannot distinguish: │ │ │ │ • Crashed node │ │ │ │ • Network partition │ │ │ │ • Slow node (high load) │ │ │ │ • Message loss │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Trade-offs: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Fast Detection ◄─────────────► Accurate │ │ │ │ (quick timeouts) (fewer false+) │ │ │ │ │ │ │ │ False Positives: Declare alive node as dead │ │ │ │ - Unnecessary failovers │ │ │ │ - Resource waste │ │ │ │ │ │ │ │ False Negatives: Fail to detect dead node │ │ │ │ - Requests to dead nodes │ │ │ │ - Delayed recovery │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Heartbeat-Based Detection

┌─────────────────────────────────────────────────────────────────┐ │ HEARTBEAT DETECTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Push-based (node sends heartbeats): │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Node ──── heartbeat ────► Monitor │ │ │ │ ──── heartbeat ────► │ │ │ │ ──── heartbeat ────► │ │ │ │ ──── (timeout) ────► DEAD! │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Pull-based (monitor checks nodes): │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Monitor ──── ping ────► Node │ │ │ │ ◄──── pong ──── │ │ │ │ ──── ping ────► │ │ │ │ ◄──── pong ──── │ │ │ │ ──── ping ────► │ │ │ │ ◄──── (timeout) DEAD! │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Parameters: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ Heartbeat interval: How often to send heartbeats │ │ │ │ Timeout: When to consider node dead │ │ │ │ Failure threshold: Missed heartbeats before dead │ │ │ │ │ │ │ │ Example: │ │ │ │ interval = 1s, timeout = 5s │ │ │ │ Node marked dead after 5 missed heartbeats │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Basic Heartbeat Implementation

go
package failuredetection import ( "context" "log" "sync" "time" ) // NodeStatus represents node health status type NodeStatus int const ( StatusAlive NodeStatus = iota StatusSuspect StatusDead ) // Node represents a monitored node type Node struct { ID string Address string Status NodeStatus LastSeen time.Time FailureCount int } // HeartbeatDetector implements heartbeat-based failure detection type HeartbeatDetector struct { mu sync.RWMutex nodes map[string]*Node heartbeatInterval time.Duration timeout time.Duration failureThreshold int onStatusChange func(nodeID string, status NodeStatus) stopCh chan struct{} } func NewHeartbeatDetector( heartbeatInterval, timeout time.Duration, failureThreshold int, ) *HeartbeatDetector { return &HeartbeatDetector{ nodes: make(map[string]*Node), heartbeatInterval: heartbeatInterval, timeout: timeout, failureThreshold: failureThreshold, stopCh: make(chan struct{}), } } // Register adds a node to monitoring func (d *HeartbeatDetector) Register(nodeID, address string) { d.mu.Lock() defer d.mu.Unlock() d.nodes[nodeID] = &Node{ ID: nodeID, Address: address, Status: StatusAlive, LastSeen: time.Now(), } } // Heartbeat records a heartbeat from a node func (d *HeartbeatDetector) Heartbeat(nodeID string) { d.mu.Lock() defer d.mu.Unlock() node, ok := d.nodes[nodeID] if !ok { return } node.LastSeen = time.Now() node.FailureCount = 0 if node.Status != StatusAlive { node.Status = StatusAlive if d.onStatusChange != nil { go d.onStatusChange(nodeID, StatusAlive) } } } // Start begins the failure detection loop func (d *HeartbeatDetector) Start(ctx context.Context) { ticker := time.NewTicker(d.heartbeatInterval) defer ticker.Stop() for { select { case <-ctx.Done(): return case <-d.stopCh: return case <-ticker.C: d.checkNodes() } } } func (d *HeartbeatDetector) checkNodes() { d.mu.Lock() defer d.mu.Unlock() now := time.Now() for _, node := range d.nodes { elapsed := now.Sub(node.LastSeen) if elapsed > d.timeout { node.FailureCount++ if node.FailureCount >= d.failureThreshold && node.Status != StatusDead { node.Status = StatusDead log.Printf("Node %s marked as DEAD", node.ID) if d.onStatusChange != nil { go d.onStatusChange(node.ID, StatusDead) } } else if node.Status == StatusAlive { node.Status = StatusSuspect log.Printf("Node %s is SUSPECT", node.ID) if d.onStatusChange != nil { go d.onStatusChange(node.ID, StatusSuspect) } } } } } // GetStatus returns the status of a node func (d *HeartbeatDetector) GetStatus(nodeID string) NodeStatus { d.mu.RLock() defer d.mu.RUnlock() node, ok := d.nodes[nodeID] if !ok { return StatusDead } return node.Status } // GetAliveNodes returns all alive nodes func (d *HeartbeatDetector) GetAliveNodes() []string { d.mu.RLock() defer d.mu.RUnlock() var alive []string for _, node := range d.nodes { if node.Status == StatusAlive { alive = append(alive, node.ID) } } return alive } // OnStatusChange sets callback for status changes func (d *HeartbeatDetector) OnStatusChange(fn func(nodeID string, status NodeStatus)) { d.onStatusChange = fn } // Stop stops the detector func (d *HeartbeatDetector) Stop() { close(d.stopCh) }

Phi Accrual Failure Detector

A more sophisticated approach that provides a suspicion level rather than binary alive/dead.
┌─────────────────────────────────────────────────────────────────┐ │ PHI ACCRUAL FAILURE DETECTOR │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Instead of: alive / dead │ │ We get: phi (φ) = suspicion level │ │ │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ φ = -log10(P(mistake)) │ │ │ │ │ │ │ │ φ = 1 → 10% chance we're wrong │ │ │ │ φ = 2 → 1% chance we're wrong │ │ │ │ φ = 3 → 0.1% chance we're wrong │ │ │ │ φ = 8 → 0.000001% chance we're wrong │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ How it works: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ 1. Track heartbeat arrival times │ │ │ │ 2. Model inter-arrival times as normal distribution │ │ │ │ 3. Calculate probability of delay given history │ │ │ │ 4. Convert to phi value │ │ │ │ │ │ │ │ If heartbeat interval varies (mean=1s, stddev=0.2s) │ │ │ │ And we haven't heard in 3 seconds │ │ │ │ φ = high (very suspicious!) │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Benefits: │ │ • Adapts to network conditions │ │ • Application chooses threshold │ │ • Used by Cassandra, Akka │ │ │ └─────────────────────────────────────────────────────────────────┘

Phi Accrual Implementation

go
package failuredetection import ( "math" "sync" "time" ) // PhiAccrualDetector implements the Phi Accrual failure detector type PhiAccrualDetector struct { mu sync.RWMutex // Per-node state nodes map[string]*PhiNodeState // Configuration threshold float64 // Phi threshold for failure minStdDev float64 // Minimum standard deviation acceptableHeartbeatPause time.Duration firstHeartbeatEstimate time.Duration // Window of samples to keep maxSamples int } // PhiNodeState tracks state for a single node type PhiNodeState struct { // Heartbeat arrival intervals intervals []time.Duration // Statistics mean float64 stdDev float64 // Last heartbeat time lastHeartbeat time.Time } func NewPhiAccrualDetector(threshold float64) *PhiAccrualDetector { return &PhiAccrualDetector{ nodes: make(map[string]*PhiNodeState), threshold: threshold, minStdDev: 500 * time.Millisecond.Seconds(), acceptableHeartbeatPause: 0, firstHeartbeatEstimate: 1 * time.Second, maxSamples: 200, } } // Heartbeat records a heartbeat from a node func (d *PhiAccrualDetector) Heartbeat(nodeID string) { d.mu.Lock() defer d.mu.Unlock() now := time.Now() state, ok := d.nodes[nodeID] if !ok { state = &PhiNodeState{ lastHeartbeat: now, } d.nodes[nodeID] = state return } // Calculate interval since last heartbeat interval := now.Sub(state.lastHeartbeat) state.lastHeartbeat = now // Add to samples state.intervals = append(state.intervals, interval) if len(state.intervals) > d.maxSamples { state.intervals = state.intervals[1:] } // Recalculate statistics state.mean, state.stdDev = d.calculateStats(state.intervals) } func (d *PhiAccrualDetector) calculateStats(intervals []time.Duration) (mean, stdDev float64) { if len(intervals) == 0 { return d.firstHeartbeatEstimate.Seconds(), d.minStdDev } // Calculate mean var sum float64 for _, interval := range intervals { sum += interval.Seconds() } mean = sum / float64(len(intervals)) // Calculate standard deviation var sumSquares float64 for _, interval := range intervals { diff := interval.Seconds() - mean sumSquares += diff * diff } stdDev = math.Sqrt(sumSquares / float64(len(intervals))) // Ensure minimum stddev if stdDev < d.minStdDev { stdDev = d.minStdDev } return mean, stdDev } // Phi calculates the phi value for a node func (d *PhiAccrualDetector) Phi(nodeID string) float64 { d.mu.RLock() defer d.mu.RUnlock() state, ok := d.nodes[nodeID] if !ok { return 0 // Unknown node } timeSinceLastHB := time.Since(state.lastHeartbeat).Seconds() // Use mean and stddev to calculate phi mean := state.mean if mean == 0 { mean = d.firstHeartbeatEstimate.Seconds() } stdDev := state.stdDev if stdDev == 0 { stdDev = d.minStdDev } return d.phi(timeSinceLastHB, mean, stdDev) } // phi calculates phi using normal distribution CDF func (d *PhiAccrualDetector) phi(timeSince, mean, stdDev float64) float64 { // Calculate probability using normal distribution // P(X > timeSince) = 1 - CDF(timeSince) y := (timeSince - mean) / stdDev p := 1.0 - cdf(y) if p == 0 { return 16.0 // Max phi value } return -math.Log10(p) } // cdf calculates the cumulative distribution function for standard normal func cdf(x float64) float64 { return 0.5 * math.Erfc(-x/math.Sqrt2) } // IsAlive returns true if phi is below threshold func (d *PhiAccrualDetector) IsAlive(nodeID string) bool { return d.Phi(nodeID) < d.threshold } // GetStatus returns alive nodes based on phi threshold func (d *PhiAccrualDetector) GetAliveNodes() []string { d.mu.RLock() defer d.mu.RUnlock() var alive []string for nodeID := range d.nodes { if d.Phi(nodeID) < d.threshold { alive = append(alive, nodeID) } } return alive } // GetSuspicionLevel returns a human-readable suspicion level func (d *PhiAccrualDetector) GetSuspicionLevel(nodeID string) string { phi := d.Phi(nodeID) switch { case phi < 1: return "healthy" case phi < 2: return "slightly suspicious" case phi < 4: return "suspicious" case phi < 8: return "very suspicious" default: return "probably dead" } }

SWIM Protocol

Scalable Weakly-consistent Infection-style Membership protocol.
┌─────────────────────────────────────────────────────────────────┐ │ SWIM PROTOCOL │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Problem with direct heartbeats: │ │ • O(n²) messages for n nodes │ │ • Single point of failure in monitoring │ │ │ │ SWIM Approach: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ 1. Each node picks random target │ │ │ │ 2. Send PING, wait for ACK │ │ │ │ 3. If no ACK, ask k other nodes to ping target │ │ │ │ 4. If still no ACK, mark suspect │ │ │ │ 5. Gossip membership changes │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Message Flow: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ A ──── ping ────► B │ │ │ │ ◄──── ack ───── (success: B is alive) │ │ │ │ │ │ │ │ A ──── ping ────► B │ │ │ │ ◄──── X ───── (no response) │ │ │ │ │ │ │ │ A ──── ping-req(B) ──► C │ │ │ │ C ──── ping ────► B │ │ │ │ ◄──── ack ───── │ │ │ │ C ──── ack(B) ──► A (B is alive via C) │ │ │ │ │ │ │ │ A ──── ping-req(B) ──► C, D, E │ │ │ │ No responses │ │ │ │ A marks B as SUSPECT │ │ │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ Benefits: │ │ • O(1) per-node overhead │ │ • Tolerates network partitions │ │ • Probabilistic completeness │ │ │ └─────────────────────────────────────────────────────────────────┘

SWIM Implementation

go
package failuredetection import ( "context" "math/rand" "net" "sync" "time" ) // SWIMNode represents a node in the SWIM protocol type SWIMNode struct { mu sync.RWMutex // Local node info id string address string // Membership list members map[string]*Member // Configuration pingInterval time.Duration pingTimeout time.Duration indirectPings int // Number of indirect pings (k) suspectTimeout time.Duration // Communication transport Transport // Callbacks onJoin func(memberID string) onLeave func(memberID string) onUpdate func(memberID string, status MemberStatus) stopCh chan struct{} } type Member struct { ID string Address string Status MemberStatus Incarnation uint64 LastUpdate time.Time } type MemberStatus int const ( MemberAlive MemberStatus = iota MemberSuspect MemberDead ) type Transport interface { Send(address string, msg Message) error Listen(handler func(msg Message)) } type Message struct { Type MessageType FromID string TargetID string Incarnation uint64 Members []Member // For gossip } type MessageType int const ( MsgPing MessageType = iota MsgAck MsgPingReq MsgAckIndirect MsgSuspect MsgAlive MsgDead ) func NewSWIMNode(id, address string, transport Transport) *SWIMNode { return &SWIMNode{ id: id, address: address, members: make(map[string]*Member), pingInterval: 1 * time.Second, pingTimeout: 500 * time.Millisecond, indirectPings: 3, suspectTimeout: 5 * time.Second, transport: transport, stopCh: make(chan struct{}), } } // Start begins the SWIM protocol func (n *SWIMNode) Start(ctx context.Context) { // Listen for messages n.transport.Listen(n.handleMessage) // Start protocol loop go n.protocolLoop(ctx) // Start suspect timeout checker go n.suspectChecker(ctx) } func (n *SWIMNode) protocolLoop(ctx context.Context) { ticker := time.NewTicker(n.pingInterval) defer ticker.Stop() for { select { case <-ctx.Done(): return case <-n.stopCh: return case <-ticker.C: n.probe() } } } func (n *SWIMNode) probe() { n.mu.RLock() members := make([]*Member, 0, len(n.members)) for _, m := range n.members { if m.Status != MemberDead && m.ID != n.id { members = append(members, m) } } n.mu.RUnlock() if len(members) == 0 { return } // Pick random member to ping target := members[rand.Intn(len(members))] // Send direct ping msg := Message{ Type: MsgPing, FromID: n.id, } if err := n.sendWithTimeout(target.Address, msg); err == nil { // Got ACK, node is alive return } // Direct ping failed, try indirect n.indirectProbe(target, members) } func (n *SWIMNode) indirectProbe(target *Member, members []*Member) { // Select k random members for indirect ping k := n.indirectPings if k > len(members)-1 { k = len(members) - 1 } // Shuffle and take first k shuffled := make([]*Member, len(members)) copy(shuffled, members) rand.Shuffle(len(shuffled), func(i, j int) { shuffled[i], shuffled[j] = shuffled[j], shuffled[i] }) var wg sync.WaitGroup ackCh := make(chan bool, k) for i := 0; i < k; i++ { if shuffled[i].ID == target.ID { continue } wg.Add(1) go func(intermediary *Member) { defer wg.Done() msg := Message{ Type: MsgPingReq, FromID: n.id, TargetID: target.ID, } if err := n.sendWithTimeout(intermediary.Address, msg); err == nil { ackCh <- true } }(shuffled[i]) } // Wait for results with timeout go func() { wg.Wait() close(ackCh) }() select { case <-ackCh: // Got indirect ACK, target is alive return case <-time.After(n.pingTimeout * 2): // No responses, mark as suspect n.markSuspect(target.ID) } } func (n *SWIMNode) markSuspect(nodeID string) { n.mu.Lock() defer n.mu.Unlock() member, ok := n.members[nodeID] if !ok || member.Status == MemberDead { return } if member.Status == MemberAlive { member.Status = MemberSuspect member.LastUpdate = time.Now() if n.onUpdate != nil { go n.onUpdate(nodeID, MemberSuspect) } // Broadcast suspect message go n.broadcast(Message{ Type: MsgSuspect, FromID: n.id, TargetID: nodeID, Incarnation: member.Incarnation, }) } } func (n *SWIMNode) markDead(nodeID string) { n.mu.Lock() defer n.mu.Unlock() member, ok := n.members[nodeID] if !ok || member.Status == MemberDead { return } member.Status = MemberDead member.LastUpdate = time.Now() if n.onLeave != nil { go n.onLeave(nodeID) } // Broadcast dead message go n.broadcast(Message{ Type: MsgDead, FromID: n.id, TargetID: nodeID, Incarnation: member.Incarnation, }) } func (n *SWIMNode) suspectChecker(ctx context.Context) { ticker := time.NewTicker(n.suspectTimeout / 2) defer ticker.Stop() for { select { case <-ctx.Done(): return case <-n.stopCh: return case <-ticker.C: n.checkSuspects() } } } func (n *SWIMNode) checkSuspects() { n.mu.RLock() var suspects []string for id, m := range n.members { if m.Status == MemberSuspect && time.Since(m.LastUpdate) > n.suspectTimeout { suspects = append(suspects, id) } } n.mu.RUnlock() for _, id := range suspects { n.markDead(id) } } func (n *SWIMNode) handleMessage(msg Message) { switch msg.Type { case MsgPing: // Reply with ACK n.transport.Send(n.getMemberAddress(msg.FromID), Message{ Type: MsgAck, FromID: n.id, }) case MsgPingReq: // Forward ping to target go func() { targetAddr := n.getMemberAddress(msg.TargetID) pingMsg := Message{Type: MsgPing, FromID: n.id} if err := n.sendWithTimeout(targetAddr, pingMsg); err == nil { // Send indirect ACK back to requester n.transport.Send(n.getMemberAddress(msg.FromID), Message{ Type: MsgAckIndirect, FromID: n.id, TargetID: msg.TargetID, }) } }() case MsgSuspect: n.handleSuspect(msg) case MsgAlive: n.handleAlive(msg) case MsgDead: n.handleDead(msg) } // Process piggybacked membership info n.processMembershipInfo(msg.Members) } func (n *SWIMNode) handleSuspect(msg Message) { n.mu.Lock() defer n.mu.Unlock() // If we're being suspected, refute it if msg.TargetID == n.id { // TODO: Increment incarnation and broadcast alive return } member, ok := n.members[msg.TargetID] if !ok { return } // Only update if incarnation is current if msg.Incarnation >= member.Incarnation { if member.Status == MemberAlive { member.Status = MemberSuspect member.LastUpdate = time.Now() member.Incarnation = msg.Incarnation if n.onUpdate != nil { go n.onUpdate(msg.TargetID, MemberSuspect) } } } } func (n *SWIMNode) handleAlive(msg Message) { n.mu.Lock() defer n.mu.Unlock() member, ok := n.members[msg.TargetID] if !ok { return } // Only update if newer incarnation if msg.Incarnation > member.Incarnation { member.Status = MemberAlive member.Incarnation = msg.Incarnation member.LastUpdate = time.Now() if n.onUpdate != nil { go n.onUpdate(msg.TargetID, MemberAlive) } } } func (n *SWIMNode) handleDead(msg Message) { n.mu.Lock() defer n.mu.Unlock() member, ok := n.members[msg.TargetID] if !ok { return } if msg.Incarnation >= member.Incarnation && member.Status != MemberDead { member.Status = MemberDead member.Incarnation = msg.Incarnation member.LastUpdate = time.Now() if n.onLeave != nil { go n.onLeave(msg.TargetID) } } } func (n *SWIMNode) sendWithTimeout(address string, msg Message) error { // Simplified - in practice use proper timeout handling return n.transport.Send(address, msg) } func (n *SWIMNode) getMemberAddress(id string) string { n.mu.RLock() defer n.mu.RUnlock() if m, ok := n.members[id]; ok { return m.Address } return "" } func (n *SWIMNode) broadcast(msg Message) { n.mu.RLock() defer n.mu.RUnlock() for _, m := range n.members { if m.ID != n.id && m.Status != MemberDead { go n.transport.Send(m.Address, msg) } } } func (n *SWIMNode) processMembershipInfo(members []Member) { // Process piggybacked membership updates for _, m := range members { n.updateMember(m) } } func (n *SWIMNode) updateMember(m Member) { n.mu.Lock() defer n.mu.Unlock() existing, ok := n.members[m.ID] if !ok { // New member n.members[m.ID] = &m if n.onJoin != nil { go n.onJoin(m.ID) } return } // Update if newer incarnation if m.Incarnation > existing.Incarnation { *existing = m } } // GetMembers returns current membership list func (n *SWIMNode) GetMembers() []Member { n.mu.RLock() defer n.mu.RUnlock() members := make([]Member, 0, len(n.members)) for _, m := range n.members { members = append(members, *m) } return members }

Best Practices

┌─────────────────────────────────────────────────────────────────┐ │ FAILURE DETECTION BEST PRACTICES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. TUNE FOR YOUR ENVIRONMENT │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Local network: shorter timeouts (100-500ms) │ │ │ │ • Cross-datacenter: longer timeouts (1-5s) │ │ │ │ • Cloud: account for variable latency │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 2. USE MULTIPLE DETECTION MECHANISMS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Heartbeats + application-level health checks │ │ │ │ • Combine push and pull methods │ │ │ │ • Cross-validate with peer opinions │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 3. HANDLE NETWORK PARTITIONS │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Don't assume failure = death │ │ │ │ • Use suspicion before declaring dead │ │ │ │ • Allow nodes to refute suspicion │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 4. AVOID CASCADE FAILURES │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Rate limit failover actions │ │ │ │ • Don't react to all failures simultaneously │ │ │ │ • Add jitter to detection intervals │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ 5. MONITOR DETECTION ACCURACY │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ • Track false positive rate │ │ │ │ • Measure detection latency │ │ │ │ • Alert on detection anomalies │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Summary

┌─────────────────────────────────────────────────────────────────┐ │ FAILURE DETECTION SUMMARY │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Approaches: │ │ • Heartbeat: Simple, effective, configurable │ │ • Phi Accrual: Adaptive, probabilistic │ │ • SWIM: Scalable, decentralized │ │ │ │ Key Trade-offs: │ │ • Speed vs accuracy │ │ • False positives vs false negatives │ │ • Simplicity vs sophistication │ │ │ │ Best Practices: │ │ • Use suspicion states │ │ • Allow refutation │ │ • Tune for environment │ │ • Monitor accuracy │ │ │ │ Key Insight: │ │ "In distributed systems, we can't tell the difference │ │ between slow and dead. Design accordingly." │ │ │ └─────────────────────────────────────────────────────────────────┘

All Blogs
Tags:failure-detectionfault-tolerancegossip-protocolavailability