Module 22: Failure Detection

Why Failure Detection Matters

In distributed systems, detecting failures quickly and accurately is essential for maintaining availability.

┌─────────────────────────────────────────────────────────────────┐
│                FAILURE DETECTION OVERVIEW                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  The Challenge:                                                 │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Node A ──────────────────────────► Node B            │     │
│  │           (no response)                               │     │
│  │                                                       │     │
│  │ Is B dead? Or just slow? Or network issue?           │     │
│  │                                                       │     │
│  │ We cannot distinguish:                                │     │
│  │ • Crashed node                                        │     │
│  │ • Network partition                                   │     │
│  │ • Slow node (high load)                              │     │
│  │ • Message loss                                        │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Trade-offs:                                                    │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Fast Detection    ◄─────────────►    Accurate       │     │
│  │  (quick timeouts)                     (fewer false+) │     │
│  │                                                       │     │
│  │  False Positives: Declare alive node as dead         │     │
│  │  - Unnecessary failovers                             │     │
│  │  - Resource waste                                    │     │
│  │                                                       │     │
│  │  False Negatives: Fail to detect dead node          │     │
│  │  - Requests to dead nodes                           │     │
│  │  - Delayed recovery                                 │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Heartbeat-Based Detection

┌─────────────────────────────────────────────────────────────────┐
│                  HEARTBEAT DETECTION                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Push-based (node sends heartbeats):                           │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Node ──── heartbeat ────► Monitor                   │     │
│  │       ──── heartbeat ────►                           │     │
│  │       ──── heartbeat ────►                           │     │
│  │       ──── (timeout) ────► DEAD!                     │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Pull-based (monitor checks nodes):                            │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  Monitor ──── ping ────► Node                        │     │
│  │          ◄──── pong ────                             │     │
│  │          ──── ping ────►                             │     │
│  │          ◄──── pong ────                             │     │
│  │          ──── ping ────►                             │     │
│  │          ◄──── (timeout) DEAD!                       │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Parameters:                                                    │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ Heartbeat interval:   How often to send heartbeats   │     │
│  │ Timeout:              When to consider node dead     │     │
│  │ Failure threshold:    Missed heartbeats before dead  │     │
│  │                                                       │     │
│  │ Example:                                              │     │
│  │   interval = 1s, timeout = 5s                        │     │
│  │   Node marked dead after 5 missed heartbeats         │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Basic Heartbeat Implementation

go
package failuredetection

import (
    "context"
    "log"
    "sync"
    "time"
)

// NodeStatus represents node health status
type NodeStatus int

const (
    StatusAlive NodeStatus = iota
    StatusSuspect
    StatusDead
)

// Node represents a monitored node
type Node struct {
    ID           string
    Address      string
    Status       NodeStatus
    LastSeen     time.Time
    FailureCount int
}

// HeartbeatDetector implements heartbeat-based failure detection
type HeartbeatDetector struct {
    mu sync.RWMutex

    nodes             map[string]*Node
    heartbeatInterval time.Duration
    timeout           time.Duration
    failureThreshold  int

    onStatusChange func(nodeID string, status NodeStatus)

    stopCh chan struct{}
}

func NewHeartbeatDetector(
    heartbeatInterval, timeout time.Duration,
    failureThreshold int,
) *HeartbeatDetector {
    return &HeartbeatDetector{
        nodes:             make(map[string]*Node),
        heartbeatInterval: heartbeatInterval,
        timeout:           timeout,
        failureThreshold:  failureThreshold,
        stopCh:            make(chan struct{}),
    }
}

// Register adds a node to monitoring
func (d *HeartbeatDetector) Register(nodeID, address string) {
    d.mu.Lock()
    defer d.mu.Unlock()

    d.nodes[nodeID] = &Node{
        ID:       nodeID,
        Address:  address,
        Status:   StatusAlive,
        LastSeen: time.Now(),
    }
}

// Heartbeat records a heartbeat from a node
func (d *HeartbeatDetector) Heartbeat(nodeID string) {
    d.mu.Lock()
    defer d.mu.Unlock()

    node, ok := d.nodes[nodeID]
    if !ok {
        return
    }

    node.LastSeen = time.Now()
    node.FailureCount = 0

    if node.Status != StatusAlive {
        node.Status = StatusAlive
        if d.onStatusChange != nil {
            go d.onStatusChange(nodeID, StatusAlive)
        }
    }
}

// Start begins the failure detection loop
func (d *HeartbeatDetector) Start(ctx context.Context) {
    ticker := time.NewTicker(d.heartbeatInterval)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-d.stopCh:
            return
        case <-ticker.C:
            d.checkNodes()
        }
    }
}

func (d *HeartbeatDetector) checkNodes() {
    d.mu.Lock()
    defer d.mu.Unlock()

    now := time.Now()

    for _, node := range d.nodes {
        elapsed := now.Sub(node.LastSeen)

        if elapsed > d.timeout {
            node.FailureCount++

            if node.FailureCount >= d.failureThreshold && node.Status != StatusDead {
                node.Status = StatusDead
                log.Printf("Node %s marked as DEAD", node.ID)
                if d.onStatusChange != nil {
                    go d.onStatusChange(node.ID, StatusDead)
                }
            } else if node.Status == StatusAlive {
                node.Status = StatusSuspect
                log.Printf("Node %s is SUSPECT", node.ID)
                if d.onStatusChange != nil {
                    go d.onStatusChange(node.ID, StatusSuspect)
                }
            }
        }
    }
}

// GetStatus returns the status of a node
func (d *HeartbeatDetector) GetStatus(nodeID string) NodeStatus {
    d.mu.RLock()
    defer d.mu.RUnlock()

    node, ok := d.nodes[nodeID]
    if !ok {
        return StatusDead
    }
    return node.Status
}

// GetAliveNodes returns all alive nodes
func (d *HeartbeatDetector) GetAliveNodes() []string {
    d.mu.RLock()
    defer d.mu.RUnlock()

    var alive []string
    for _, node := range d.nodes {
        if node.Status == StatusAlive {
            alive = append(alive, node.ID)
        }
    }
    return alive
}

// OnStatusChange sets callback for status changes
func (d *HeartbeatDetector) OnStatusChange(fn func(nodeID string, status NodeStatus)) {
    d.onStatusChange = fn
}

// Stop stops the detector
func (d *HeartbeatDetector) Stop() {
    close(d.stopCh)
}

Phi Accrual Failure Detector

A more sophisticated approach that provides a suspicion level rather than binary alive/dead.

┌─────────────────────────────────────────────────────────────────┐
│              PHI ACCRUAL FAILURE DETECTOR                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Instead of: alive / dead                                      │
│  We get:     phi (φ) = suspicion level                         │
│                                                                 │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  φ = -log10(P(mistake))                              │     │
│  │                                                       │     │
│  │  φ = 1  →  10% chance we're wrong                    │     │
│  │  φ = 2  →  1% chance we're wrong                     │     │
│  │  φ = 3  →  0.1% chance we're wrong                   │     │
│  │  φ = 8  →  0.000001% chance we're wrong              │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  How it works:                                                  │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ 1. Track heartbeat arrival times                     │     │
│  │ 2. Model inter-arrival times as normal distribution  │     │
│  │ 3. Calculate probability of delay given history      │     │
│  │ 4. Convert to phi value                              │     │
│  │                                                       │     │
│  │ If heartbeat interval varies (mean=1s, stddev=0.2s)  │     │
│  │ And we haven't heard in 3 seconds                    │     │
│  │ φ = high (very suspicious!)                          │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Benefits:                                                      │
│  • Adapts to network conditions                                │
│  • Application chooses threshold                               │
│  • Used by Cassandra, Akka                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Phi Accrual Implementation

go
package failuredetection

import (
    "math"
    "sync"
    "time"
)

// PhiAccrualDetector implements the Phi Accrual failure detector
type PhiAccrualDetector struct {
    mu sync.RWMutex

    // Per-node state
    nodes map[string]*PhiNodeState

    // Configuration
    threshold        float64 // Phi threshold for failure
    minStdDev        float64 // Minimum standard deviation
    acceptableHeartbeatPause time.Duration
    firstHeartbeatEstimate   time.Duration

    // Window of samples to keep
    maxSamples int
}

// PhiNodeState tracks state for a single node
type PhiNodeState struct {
    // Heartbeat arrival intervals
    intervals []time.Duration

    // Statistics
    mean   float64
    stdDev float64

    // Last heartbeat time
    lastHeartbeat time.Time
}

func NewPhiAccrualDetector(threshold float64) *PhiAccrualDetector {
    return &PhiAccrualDetector{
        nodes:                    make(map[string]*PhiNodeState),
        threshold:                threshold,
        minStdDev:                500 * time.Millisecond.Seconds(),
        acceptableHeartbeatPause: 0,
        firstHeartbeatEstimate:   1 * time.Second,
        maxSamples:               200,
    }
}

// Heartbeat records a heartbeat from a node
func (d *PhiAccrualDetector) Heartbeat(nodeID string) {
    d.mu.Lock()
    defer d.mu.Unlock()

    now := time.Now()

    state, ok := d.nodes[nodeID]
    if !ok {
        state = &PhiNodeState{
            lastHeartbeat: now,
        }
        d.nodes[nodeID] = state
        return
    }

    // Calculate interval since last heartbeat
    interval := now.Sub(state.lastHeartbeat)
    state.lastHeartbeat = now

    // Add to samples
    state.intervals = append(state.intervals, interval)
    if len(state.intervals) > d.maxSamples {
        state.intervals = state.intervals[1:]
    }

    // Recalculate statistics
    state.mean, state.stdDev = d.calculateStats(state.intervals)
}

func (d *PhiAccrualDetector) calculateStats(intervals []time.Duration) (mean, stdDev float64) {
    if len(intervals) == 0 {
        return d.firstHeartbeatEstimate.Seconds(), d.minStdDev
    }

    // Calculate mean
    var sum float64
    for _, interval := range intervals {
        sum += interval.Seconds()
    }
    mean = sum / float64(len(intervals))

    // Calculate standard deviation
    var sumSquares float64
    for _, interval := range intervals {
        diff := interval.Seconds() - mean
        sumSquares += diff * diff
    }
    stdDev = math.Sqrt(sumSquares / float64(len(intervals)))

    // Ensure minimum stddev
    if stdDev < d.minStdDev {
        stdDev = d.minStdDev
    }

    return mean, stdDev
}

// Phi calculates the phi value for a node
func (d *PhiAccrualDetector) Phi(nodeID string) float64 {
    d.mu.RLock()
    defer d.mu.RUnlock()

    state, ok := d.nodes[nodeID]
    if !ok {
        return 0 // Unknown node
    }

    timeSinceLastHB := time.Since(state.lastHeartbeat).Seconds()

    // Use mean and stddev to calculate phi
    mean := state.mean
    if mean == 0 {
        mean = d.firstHeartbeatEstimate.Seconds()
    }
    stdDev := state.stdDev
    if stdDev == 0 {
        stdDev = d.minStdDev
    }

    return d.phi(timeSinceLastHB, mean, stdDev)
}

// phi calculates phi using normal distribution CDF
func (d *PhiAccrualDetector) phi(timeSince, mean, stdDev float64) float64 {
    // Calculate probability using normal distribution
    // P(X > timeSince) = 1 - CDF(timeSince)
    y := (timeSince - mean) / stdDev
    p := 1.0 - cdf(y)

    if p == 0 {
        return 16.0 // Max phi value
    }

    return -math.Log10(p)
}

// cdf calculates the cumulative distribution function for standard normal
func cdf(x float64) float64 {
    return 0.5 * math.Erfc(-x/math.Sqrt2)
}

// IsAlive returns true if phi is below threshold
func (d *PhiAccrualDetector) IsAlive(nodeID string) bool {
    return d.Phi(nodeID) < d.threshold
}

// GetStatus returns alive nodes based on phi threshold
func (d *PhiAccrualDetector) GetAliveNodes() []string {
    d.mu.RLock()
    defer d.mu.RUnlock()

    var alive []string
    for nodeID := range d.nodes {
        if d.Phi(nodeID) < d.threshold {
            alive = append(alive, nodeID)
        }
    }
    return alive
}

// GetSuspicionLevel returns a human-readable suspicion level
func (d *PhiAccrualDetector) GetSuspicionLevel(nodeID string) string {
    phi := d.Phi(nodeID)
    switch {
    case phi < 1:
        return "healthy"
    case phi < 2:
        return "slightly suspicious"
    case phi < 4:
        return "suspicious"
    case phi < 8:
        return "very suspicious"
    default:
        return "probably dead"
    }
}

SWIM Protocol

Scalable Weakly-consistent Infection-style Membership protocol.

┌─────────────────────────────────────────────────────────────────┐
│                    SWIM PROTOCOL                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Problem with direct heartbeats:                               │
│  • O(n²) messages for n nodes                                  │
│  • Single point of failure in monitoring                       │
│                                                                 │
│  SWIM Approach:                                                │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  1. Each node picks random target                     │     │
│  │  2. Send PING, wait for ACK                          │     │
│  │  3. If no ACK, ask k other nodes to ping target      │     │
│  │  4. If still no ACK, mark suspect                    │     │
│  │  5. Gossip membership changes                         │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Message Flow:                                                  │
│  ┌───────────────────────────────────────────────────────┐     │
│  │                                                       │     │
│  │  A ──── ping ────► B                                 │     │
│  │    ◄──── ack ─────   (success: B is alive)           │     │
│  │                                                       │     │
│  │  A ──── ping ────► B                                 │     │
│  │    ◄──── X ─────     (no response)                   │     │
│  │                                                       │     │
│  │  A ──── ping-req(B) ──► C                           │     │
│  │  C ──── ping ────► B                                 │     │
│  │    ◄──── ack ─────                                   │     │
│  │  C ──── ack(B) ──► A   (B is alive via C)           │     │
│  │                                                       │     │
│  │  A ──── ping-req(B) ──► C, D, E                     │     │
│  │  No responses                                        │     │
│  │  A marks B as SUSPECT                                │     │
│  │                                                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  Benefits:                                                      │
│  • O(1) per-node overhead                                      │
│  • Tolerates network partitions                                │
│  • Probabilistic completeness                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

SWIM Implementation

go
package failuredetection

import (
    "context"
    "math/rand"
    "net"
    "sync"
    "time"
)

// SWIMNode represents a node in the SWIM protocol
type SWIMNode struct {
    mu sync.RWMutex

    // Local node info
    id      string
    address string

    // Membership list
    members map[string]*Member

    // Configuration
    pingInterval    time.Duration
    pingTimeout     time.Duration
    indirectPings   int // Number of indirect pings (k)
    suspectTimeout  time.Duration

    // Communication
    transport Transport

    // Callbacks
    onJoin   func(memberID string)
    onLeave  func(memberID string)
    onUpdate func(memberID string, status MemberStatus)

    stopCh chan struct{}
}

type Member struct {
    ID         string
    Address    string
    Status     MemberStatus
    Incarnation uint64
    LastUpdate time.Time
}

type MemberStatus int

const (
    MemberAlive MemberStatus = iota
    MemberSuspect
    MemberDead
)

type Transport interface {
    Send(address string, msg Message) error
    Listen(handler func(msg Message))
}

type Message struct {
    Type        MessageType
    FromID      string
    TargetID    string
    Incarnation uint64
    Members     []Member // For gossip
}

type MessageType int

const (
    MsgPing MessageType = iota
    MsgAck
    MsgPingReq
    MsgAckIndirect
    MsgSuspect
    MsgAlive
    MsgDead
)

func NewSWIMNode(id, address string, transport Transport) *SWIMNode {
    return &SWIMNode{
        id:              id,
        address:         address,
        members:         make(map[string]*Member),
        pingInterval:    1 * time.Second,
        pingTimeout:     500 * time.Millisecond,
        indirectPings:   3,
        suspectTimeout:  5 * time.Second,
        transport:       transport,
        stopCh:          make(chan struct{}),
    }
}

// Start begins the SWIM protocol
func (n *SWIMNode) Start(ctx context.Context) {
    // Listen for messages
    n.transport.Listen(n.handleMessage)

    // Start protocol loop
    go n.protocolLoop(ctx)

    // Start suspect timeout checker
    go n.suspectChecker(ctx)
}

func (n *SWIMNode) protocolLoop(ctx context.Context) {
    ticker := time.NewTicker(n.pingInterval)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-n.stopCh:
            return
        case <-ticker.C:
            n.probe()
        }
    }
}

func (n *SWIMNode) probe() {
    n.mu.RLock()
    members := make([]*Member, 0, len(n.members))
    for _, m := range n.members {
        if m.Status != MemberDead && m.ID != n.id {
            members = append(members, m)
        }
    }
    n.mu.RUnlock()

    if len(members) == 0 {
        return
    }

    // Pick random member to ping
    target := members[rand.Intn(len(members))]

    // Send direct ping
    msg := Message{
        Type:   MsgPing,
        FromID: n.id,
    }

    if err := n.sendWithTimeout(target.Address, msg); err == nil {
        // Got ACK, node is alive
        return
    }

    // Direct ping failed, try indirect
    n.indirectProbe(target, members)
}

func (n *SWIMNode) indirectProbe(target *Member, members []*Member) {
    // Select k random members for indirect ping
    k := n.indirectPings
    if k > len(members)-1 {
        k = len(members) - 1
    }

    // Shuffle and take first k
    shuffled := make([]*Member, len(members))
    copy(shuffled, members)
    rand.Shuffle(len(shuffled), func(i, j int) {
        shuffled[i], shuffled[j] = shuffled[j], shuffled[i]
    })

    var wg sync.WaitGroup
    ackCh := make(chan bool, k)

    for i := 0; i < k; i++ {
        if shuffled[i].ID == target.ID {
            continue
        }

        wg.Add(1)
        go func(intermediary *Member) {
            defer wg.Done()

            msg := Message{
                Type:     MsgPingReq,
                FromID:   n.id,
                TargetID: target.ID,
            }

            if err := n.sendWithTimeout(intermediary.Address, msg); err == nil {
                ackCh <- true
            }
        }(shuffled[i])
    }

    // Wait for results with timeout
    go func() {
        wg.Wait()
        close(ackCh)
    }()

    select {
    case <-ackCh:
        // Got indirect ACK, target is alive
        return
    case <-time.After(n.pingTimeout * 2):
        // No responses, mark as suspect
        n.markSuspect(target.ID)
    }
}

func (n *SWIMNode) markSuspect(nodeID string) {
    n.mu.Lock()
    defer n.mu.Unlock()

    member, ok := n.members[nodeID]
    if !ok || member.Status == MemberDead {
        return
    }

    if member.Status == MemberAlive {
        member.Status = MemberSuspect
        member.LastUpdate = time.Now()

        if n.onUpdate != nil {
            go n.onUpdate(nodeID, MemberSuspect)
        }

        // Broadcast suspect message
        go n.broadcast(Message{
            Type:        MsgSuspect,
            FromID:      n.id,
            TargetID:    nodeID,
            Incarnation: member.Incarnation,
        })
    }
}

func (n *SWIMNode) markDead(nodeID string) {
    n.mu.Lock()
    defer n.mu.Unlock()

    member, ok := n.members[nodeID]
    if !ok || member.Status == MemberDead {
        return
    }

    member.Status = MemberDead
    member.LastUpdate = time.Now()

    if n.onLeave != nil {
        go n.onLeave(nodeID)
    }

    // Broadcast dead message
    go n.broadcast(Message{
        Type:        MsgDead,
        FromID:      n.id,
        TargetID:    nodeID,
        Incarnation: member.Incarnation,
    })
}

func (n *SWIMNode) suspectChecker(ctx context.Context) {
    ticker := time.NewTicker(n.suspectTimeout / 2)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-n.stopCh:
            return
        case <-ticker.C:
            n.checkSuspects()
        }
    }
}

func (n *SWIMNode) checkSuspects() {
    n.mu.RLock()
    var suspects []string
    for id, m := range n.members {
        if m.Status == MemberSuspect &&
           time.Since(m.LastUpdate) > n.suspectTimeout {
            suspects = append(suspects, id)
        }
    }
    n.mu.RUnlock()

    for _, id := range suspects {
        n.markDead(id)
    }
}

func (n *SWIMNode) handleMessage(msg Message) {
    switch msg.Type {
    case MsgPing:
        // Reply with ACK
        n.transport.Send(n.getMemberAddress(msg.FromID), Message{
            Type:   MsgAck,
            FromID: n.id,
        })

    case MsgPingReq:
        // Forward ping to target
        go func() {
            targetAddr := n.getMemberAddress(msg.TargetID)
            pingMsg := Message{Type: MsgPing, FromID: n.id}

            if err := n.sendWithTimeout(targetAddr, pingMsg); err == nil {
                // Send indirect ACK back to requester
                n.transport.Send(n.getMemberAddress(msg.FromID), Message{
                    Type:     MsgAckIndirect,
                    FromID:   n.id,
                    TargetID: msg.TargetID,
                })
            }
        }()

    case MsgSuspect:
        n.handleSuspect(msg)

    case MsgAlive:
        n.handleAlive(msg)

    case MsgDead:
        n.handleDead(msg)
    }

    // Process piggybacked membership info
    n.processMembershipInfo(msg.Members)
}

func (n *SWIMNode) handleSuspect(msg Message) {
    n.mu.Lock()
    defer n.mu.Unlock()

    // If we're being suspected, refute it
    if msg.TargetID == n.id {
        // TODO: Increment incarnation and broadcast alive
        return
    }

    member, ok := n.members[msg.TargetID]
    if !ok {
        return
    }

    // Only update if incarnation is current
    if msg.Incarnation >= member.Incarnation {
        if member.Status == MemberAlive {
            member.Status = MemberSuspect
            member.LastUpdate = time.Now()
            member.Incarnation = msg.Incarnation

            if n.onUpdate != nil {
                go n.onUpdate(msg.TargetID, MemberSuspect)
            }
        }
    }
}

func (n *SWIMNode) handleAlive(msg Message) {
    n.mu.Lock()
    defer n.mu.Unlock()

    member, ok := n.members[msg.TargetID]
    if !ok {
        return
    }

    // Only update if newer incarnation
    if msg.Incarnation > member.Incarnation {
        member.Status = MemberAlive
        member.Incarnation = msg.Incarnation
        member.LastUpdate = time.Now()

        if n.onUpdate != nil {
            go n.onUpdate(msg.TargetID, MemberAlive)
        }
    }
}

func (n *SWIMNode) handleDead(msg Message) {
    n.mu.Lock()
    defer n.mu.Unlock()

    member, ok := n.members[msg.TargetID]
    if !ok {
        return
    }

    if msg.Incarnation >= member.Incarnation && member.Status != MemberDead {
        member.Status = MemberDead
        member.Incarnation = msg.Incarnation
        member.LastUpdate = time.Now()

        if n.onLeave != nil {
            go n.onLeave(msg.TargetID)
        }
    }
}

func (n *SWIMNode) sendWithTimeout(address string, msg Message) error {
    // Simplified - in practice use proper timeout handling
    return n.transport.Send(address, msg)
}

func (n *SWIMNode) getMemberAddress(id string) string {
    n.mu.RLock()
    defer n.mu.RUnlock()
    if m, ok := n.members[id]; ok {
        return m.Address
    }
    return ""
}

func (n *SWIMNode) broadcast(msg Message) {
    n.mu.RLock()
    defer n.mu.RUnlock()

    for _, m := range n.members {
        if m.ID != n.id && m.Status != MemberDead {
            go n.transport.Send(m.Address, msg)
        }
    }
}

func (n *SWIMNode) processMembershipInfo(members []Member) {
    // Process piggybacked membership updates
    for _, m := range members {
        n.updateMember(m)
    }
}

func (n *SWIMNode) updateMember(m Member) {
    n.mu.Lock()
    defer n.mu.Unlock()

    existing, ok := n.members[m.ID]
    if !ok {
        // New member
        n.members[m.ID] = &m
        if n.onJoin != nil {
            go n.onJoin(m.ID)
        }
        return
    }

    // Update if newer incarnation
    if m.Incarnation > existing.Incarnation {
        *existing = m
    }
}

// GetMembers returns current membership list
func (n *SWIMNode) GetMembers() []Member {
    n.mu.RLock()
    defer n.mu.RUnlock()

    members := make([]Member, 0, len(n.members))
    for _, m := range n.members {
        members = append(members, *m)
    }
    return members
}

Best Practices

┌─────────────────────────────────────────────────────────────────┐
│           FAILURE DETECTION BEST PRACTICES                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. TUNE FOR YOUR ENVIRONMENT                                  │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Local network: shorter timeouts (100-500ms)         │     │
│  │ • Cross-datacenter: longer timeouts (1-5s)           │     │
│  │ • Cloud: account for variable latency                │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  2. USE MULTIPLE DETECTION MECHANISMS                          │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Heartbeats + application-level health checks       │     │
│  │ • Combine push and pull methods                      │     │
│  │ • Cross-validate with peer opinions                  │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  3. HANDLE NETWORK PARTITIONS                                  │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Don't assume failure = death                       │     │
│  │ • Use suspicion before declaring dead                │     │
│  │ • Allow nodes to refute suspicion                    │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  4. AVOID CASCADE FAILURES                                     │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Rate limit failover actions                        │     │
│  │ • Don't react to all failures simultaneously         │     │
│  │ • Add jitter to detection intervals                  │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
│  5. MONITOR DETECTION ACCURACY                                 │
│  ┌───────────────────────────────────────────────────────┐     │
│  │ • Track false positive rate                          │     │
│  │ • Measure detection latency                          │     │
│  │ • Alert on detection anomalies                       │     │
│  └───────────────────────────────────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Summary

┌─────────────────────────────────────────────────────────────────┐
│             FAILURE DETECTION SUMMARY                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Approaches:                                                    │
│  • Heartbeat: Simple, effective, configurable                  │
│  • Phi Accrual: Adaptive, probabilistic                        │
│  • SWIM: Scalable, decentralized                               │
│                                                                 │
│  Key Trade-offs:                                                │
│  • Speed vs accuracy                                           │
│  • False positives vs false negatives                          │
│  • Simplicity vs sophistication                                │
│                                                                 │
│  Best Practices:                                                │
│  • Use suspicion states                                        │
│  • Allow refutation                                            │
│  • Tune for environment                                        │
│  • Monitor accuracy                                            │
│                                                                 │
│  Key Insight:                                                   │
│  "In distributed systems, we can't tell the difference         │
│   between slow and dead. Design accordingly."                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Next Module: Module 23 - Circuit Breakers

Previous Module: Module 21 - Stream Processing