Module 8: Partitioning and Sharding

Why Partition Data?

When a dataset becomes too large for a single machine, we must split it across multiple nodes. This is called partitioning (or sharding).

┌─────────────────────────────────────────────────────────────────┐
│                  WHY PARTITION?                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Single Node Limits:                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Storage: 10TB SSD max (practical limit)               │   │
│  │ • Memory: 1-2TB RAM max                                 │   │
│  │ • CPU: ~100K queries/sec                                │   │
│  │ • IOPS: ~100K/sec (NVMe)                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  With Partitioning:                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 100 nodes × 10TB = 1PB storage                          │   │
│  │ 100 nodes × 100K QPS = 10M queries/sec                  │   │
│  │ Each node handles 1/100th of data                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│                    ┌──────────────┐                             │
│                    │   Dataset    │                             │
│                    │   (1 PB)     │                             │
│                    └──────┬───────┘                             │
│                           │                                     │
│         ┌─────────────────┼─────────────────┐                   │
│         │                 │                 │                   │
│         ▼                 ▼                 ▼                   │
│    ┌─────────┐       ┌─────────┐       ┌─────────┐             │
│    │Partition│       │Partition│       │Partition│             │
│    │   1     │       │   2     │  ...  │   N     │             │
│    │(users   │       │(users   │       │(users   │             │
│    │ A-H)    │       │ I-P)    │       │ Q-Z)    │             │
│    └─────────┘       └─────────┘       └─────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Partitioning vs Replication

┌─────────────────────────────────────────────────────────────────┐
│          PARTITIONING vs REPLICATION                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  REPLICATION: Same data on multiple nodes                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Node A: [Users 1-1M]                                    │   │
│  │ Node B: [Users 1-1M]  (same data, different nodes)      │   │
│  │ Node C: [Users 1-1M]                                    │   │
│  │                                                          │   │
│  │ Purpose: Fault tolerance, read scalability              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PARTITIONING: Different data on different nodes                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Node A: [Users 1-333K]                                  │   │
│  │ Node B: [Users 333K-666K]  (different data)             │   │
│  │ Node C: [Users 666K-1M]                                 │   │
│  │                                                          │   │
│  │ Purpose: Storage scalability, write scalability         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  COMBINED (Typical production setup):                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition 1: [Users 1-333K]                             │   │
│  │   ├── Replica A (leader)                                │   │
│  │   ├── Replica B                                         │   │
│  │   └── Replica C                                         │   │
│  │                                                          │   │
│  │ Partition 2: [Users 333K-666K]                          │   │
│  │   ├── Replica D (leader)                                │   │
│  │   ├── Replica E                                         │   │
│  │   └── Replica F                                         │   │
│  │                                                          │   │
│  │ Each partition is replicated for fault tolerance        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

1. Key-Range Partitioning

Assign contiguous ranges of keys to each partition.

┌─────────────────────────────────────────────────────────────────┐
│              KEY-RANGE PARTITIONING                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Like encyclopedia volumes: A-D, E-H, I-L, ...                  │
│                                                                 │
│  Example: User IDs                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition 1: user_id 1 - 1,000,000                      │   │
│  │ Partition 2: user_id 1,000,001 - 2,000,000              │   │
│  │ Partition 3: user_id 2,000,001 - 3,000,000              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Example: Timestamps                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition 1: Jan 2024                                   │   │
│  │ Partition 2: Feb 2024                                   │   │
│  │ Partition 3: Mar 2024                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PROS:                                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✓ Range queries are efficient                           │   │
│  │   "Get all users 1-1000" → single partition             │   │
│  │ ✓ Sorted data within partition                          │   │
│  │ ✓ Easy to understand                                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  CONS:                                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✗ Hotspots: Sequential IDs hammer one partition         │   │
│  │   New users (highest IDs) → all to latest partition    │   │
│  │ ✗ Manual rebalancing needed                             │   │
│  │ ✗ Uneven distribution if key distribution is skewed     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Used by: HBase, Bigtable, MongoDB (range sharding)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Hotspot Problem

┌─────────────────────────────────────────────────────────────────┐
│              THE HOTSPOT PROBLEM                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Time-series data partitioned by timestamp:                     │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition 1 (Jan): ░░░░░░ (cold - rarely accessed)      │   │
│  │ Partition 2 (Feb): ░░░░░░ (cold)                        │   │
│  │ Partition 3 (Mar): ████████████████████ (HOT!)          │   │
│  │                    All writes go here!                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Solutions:                                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 1. Compound partition key: (sensor_id, timestamp)       │   │
│  │    Different sensors write to different partitions      │   │
│  │                                                          │   │
│  │ 2. Add random prefix: "5_2024-03-15" instead of date   │   │
│  │    Spreads writes across partitions                     │   │
│  │    But: Range queries now need to query all prefixes   │   │
│  │                                                          │   │
│  │ 3. Use hash partitioning (loses range queries)          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Celebrity Problem (similar):                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Justin Bieber posts → millions of writes to one key     │   │
│  │ All followers on one partition → hotspot               │   │
│  │                                                          │   │
│  │ Solution: Shard popular users' data differently         │   │
│  │ "bieber_0", "bieber_1", ... "bieber_99"                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2. Hash Partitioning

Use a hash function to determine partition.

┌─────────────────────────────────────────────────────────────────┐
│                 HASH PARTITIONING                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  partition = hash(key) % num_partitions                         │
│                                                                 │
│  Example: 4 partitions                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ hash("user:123") = 2847382 % 4 = 2 → Partition 2        │   │
│  │ hash("user:456") = 1293847 % 4 = 3 → Partition 3        │   │
│  │ hash("user:789") = 8473629 % 4 = 1 → Partition 1        │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Key Distribution (if hash is good):                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition 0: ████████████ (25%)                         │   │
│  │ Partition 1: ████████████ (25%)                         │   │
│  │ Partition 2: ████████████ (25%)                         │   │
│  │ Partition 3: ████████████ (25%)                         │   │
│  │                                                          │   │
│  │ Even distribution regardless of key patterns!            │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  PROS:                                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✓ Even distribution (prevents hotspots)                 │   │
│  │ ✓ Any key pattern works                                 │   │
│  │ ✓ Simple to implement                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  CONS:                                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✗ Range queries impossible                              │   │
│  │   "Get users 1-1000" → must query ALL partitions       │   │
│  │ ✗ Adjacent keys on different partitions                 │   │
│  │ ✗ Rebalancing on partition count change is expensive    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Used by: Cassandra, DynamoDB, Riak                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Hash Functions

┌─────────────────────────────────────────────────────────────────┐
│              CHOOSING A HASH FUNCTION                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Requirements:                                                  │
│  • Deterministic (same key → same partition always)            │
│  • Uniform distribution                                        │
│  • Fast to compute                                             │
│                                                                 │
│  DON'T USE cryptographic hashes (too slow):                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✗ MD5, SHA-1, SHA-256                                   │   │
│  │   Designed for security, not speed                      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  DO USE fast non-cryptographic hashes:                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✓ MurmurHash3 (used by Cassandra)                       │   │
│  │ ✓ xxHash (extremely fast)                               │   │
│  │ ✓ CityHash (Google)                                     │   │
│  │ ✓ FNV-1a (simple, decent distribution)                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Performance comparison (hashing 1M keys):                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ xxHash:     ~50ms                                       │   │
│  │ MurmurHash: ~80ms                                       │   │
│  │ MD5:        ~400ms                                      │   │
│  │ SHA-256:    ~600ms                                      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3. Consistent Hashing

Solves the rebalancing problem when adding/removing nodes.

┌─────────────────────────────────────────────────────────────────┐
│                 CONSISTENT HASHING                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Problem with simple hash:                                      │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ hash(key) % 3 → Partition 0, 1, or 2                    │   │
│  │ hash(key) % 4 → DIFFERENT partition assignments!        │   │
│  │                                                          │   │
│  │ Adding 1 node moves ~75% of keys!                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Consistent Hashing: Hash ring                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │                    0° (= 360°)                           │   │
│  │                        │                                 │   │
│  │               ┌────────┼────────┐                        │   │
│  │              Node A    │      Node B                     │   │
│  │            (45°)       │       (120°)                    │   │
│  │                 \      │      /                          │   │
│  │                  \     │     /                           │   │
│  │          270°─────●────●────●──────90°                   │   │
│  │                  /     │     \                           │   │
│  │                 /      │      \                          │   │
│  │            Node D      │      Node C                     │   │
│  │            (315°)      │      (200°)                     │   │
│  │               └────────┼────────┘                        │   │
│  │                        │                                 │   │
│  │                      180°                                │   │
│  │                                                          │   │
│  │  Key assignment: Walk clockwise to find first node       │   │
│  │  Key at 50° → Node B (120°)                             │   │
│  │  Key at 300° → Node D (315°)                            │   │
│  │                                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Adding Node E at 180°:                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Only keys between 120° and 180° move to E               │   │
│  │ ~1/N of keys move (not all keys!)                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Virtual Nodes (Vnodes)

┌─────────────────────────────────────────────────────────────────┐
│                  VIRTUAL NODES                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Problem: With few physical nodes, distribution is uneven       │
│                                                                 │
│  4 nodes on ring:                                               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Node A: 45° - 120° (75°, 20.8% of ring)                 │   │
│  │ Node B: 120° - 200° (80°, 22.2%)                        │   │
│  │ Node C: 200° - 315° (115°, 31.9%)   ← Uneven!          │   │
│  │ Node D: 315° - 45° (90°, 25.0%)                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Solution: Each physical node has multiple virtual nodes        │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Physical Node A → Virtual: A1, A2, A3, ..., A256        │   │
│  │ Physical Node B → Virtual: B1, B2, B3, ..., B256        │   │
│  │                                                          │   │
│  │ 4 physical × 256 virtual = 1024 points on ring          │   │
│  │                                                          │   │
│  │ Distribution becomes nearly uniform:                     │   │
│  │ Node A: ~25% ± 1%                                       │   │
│  │ Node B: ~25% ± 1%                                       │   │
│  │ Node C: ~25% ± 1%                                       │   │
│  │ Node D: ~25% ± 1%                                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Additional benefits:                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Heterogeneous nodes: More vnodes for powerful nodes   │   │
│  │ • Smoother rebalancing: Move vnodes one at a time       │   │
│  │ • Better fault handling: Failed node's vnodes spread out│   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Used by: Cassandra, DynamoDB, Riak                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Consistent Hashing Implementation

go
package consistenthash

import (
    "hash/crc32"
    "sort"
    "strconv"
    "sync"
)

type ConsistentHash struct {
    mu       sync.RWMutex
    ring     []uint32          // Sorted hash values
    nodes    map[uint32]string // Hash → node mapping
    vnodes   int               // Virtual nodes per physical node
}

func New(vnodes int) *ConsistentHash {
    return &ConsistentHash{
        ring:   []uint32{},
        nodes:  make(map[uint32]string),
        vnodes: vnodes,
    }
}

func (ch *ConsistentHash) AddNode(node string) {
    ch.mu.Lock()
    defer ch.mu.Unlock()

    for i := 0; i < ch.vnodes; i++ {
        // Create virtual node identifier
        vnode := node + "_" + strconv.Itoa(i)
        hash := ch.hash(vnode)
        ch.ring = append(ch.ring, hash)
        ch.nodes[hash] = node
    }

    // Keep ring sorted
    sort.Slice(ch.ring, func(i, j int) bool {
        return ch.ring[i] < ch.ring[j]
    })
}

func (ch *ConsistentHash) RemoveNode(node string) {
    ch.mu.Lock()
    defer ch.mu.Unlock()

    for i := 0; i < ch.vnodes; i++ {
        vnode := node + "_" + strconv.Itoa(i)
        hash := ch.hash(vnode)
        delete(ch.nodes, hash)

        // Remove from ring
        for j, h := range ch.ring {
            if h == hash {
                ch.ring = append(ch.ring[:j], ch.ring[j+1:]...)
                break
            }
        }
    }
}

func (ch *ConsistentHash) GetNode(key string) string {
    ch.mu.RLock()
    defer ch.mu.RUnlock()

    if len(ch.ring) == 0 {
        return ""
    }

    hash := ch.hash(key)

    // Binary search for first node with hash >= key hash
    idx := sort.Search(len(ch.ring), func(i int) bool {
        return ch.ring[i] >= hash
    })

    // Wrap around
    if idx == len(ch.ring) {
        idx = 0
    }

    return ch.nodes[ch.ring[idx]]
}

func (ch *ConsistentHash) hash(key string) uint32 {
    return crc32.ChecksumIEEE([]byte(key))
}

4. Compound Partition Keys

Best of both worlds: hash for distribution, range for queries.

┌─────────────────────────────────────────────────────────────────┐
│              COMPOUND PARTITION KEYS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary key: (partition_key, clustering_key)                   │
│                                                                 │
│  Example: Time-series sensor data                               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition key: sensor_id (hashed → distributed)         │   │
│  │ Clustering key: timestamp (sorted within partition)     │   │
│  │                                                          │   │
│  │ PRIMARY KEY ((sensor_id), timestamp)                    │   │
│  │              └── hashed   └── sorted                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Data distribution:                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Partition 0: sensor_1 data (sorted by timestamp)        │   │
│  │              [10:00, value1]                             │   │
│  │              [10:01, value2]                             │   │
│  │              [10:02, value3]                             │   │
│  │                                                          │   │
│  │ Partition 1: sensor_47 data (sorted by timestamp)       │   │
│  │              [10:00, value4]                             │   │
│  │              [10:01, value5]                             │   │
│  │                                                          │   │
│  │ Each sensor's data is together and sorted!              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Efficient queries:                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ ✓ Get sensor_1 data from 10:00 to 11:00                 │   │
│  │   → Single partition, range scan                        │   │
│  │                                                          │   │
│  │ ✓ Get latest 100 readings from sensor_1                 │   │
│  │   → Single partition, sorted access                     │   │
│  │                                                          │   │
│  │ ✗ Get all sensor data from 10:00 to 11:00               │   │
│  │   → Must query ALL partitions (scatter-gather)          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Used by: Cassandra, DynamoDB, ScyllaDB                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

5. Rebalancing Strategies

When you add or remove nodes, data must be redistributed.

┌─────────────────────────────────────────────────────────────────┐
│              REBALANCING STRATEGIES                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. FIXED PARTITIONS                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Create more partitions than nodes upfront               │   │
│  │                                                          │   │
│  │ 1000 partitions, 10 nodes → 100 partitions/node         │   │
│  │ Add node 11 → move ~90 partitions (9/node)              │   │
│  │                                                          │   │
│  │ Pros: Simple, predictable                               │   │
│  │ Cons: Must choose partition count upfront               │   │
│  │       Too few → large partitions                        │   │
│  │       Too many → metadata overhead                      │   │
│  │                                                          │   │
│  │ Used by: Elasticsearch, Riak, Couchbase                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  2. DYNAMIC PARTITIONS                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Split partitions when they get too large                │   │
│  │ Merge partitions when they get too small                │   │
│  │                                                          │   │
│  │ Partition A grows to 10GB → split into A1, A2           │   │
│  │ Partitions B, C shrink → merge into BC                  │   │
│  │                                                          │   │
│  │ Pros: Adapts to data distribution                       │   │
│  │ Cons: More complex, split during write spike            │   │
│  │                                                          │   │
│  │ Used by: HBase, MongoDB, TiDB                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  3. PROPORTIONAL TO NODES                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Fixed number of partitions per node                     │   │
│  │ Adding node creates new partitions                      │   │
│  │                                                          │   │
│  │ 256 partitions/node:                                    │   │
│  │ 4 nodes = 1024 partitions                               │   │
│  │ 5 nodes = 1280 partitions                               │   │
│  │                                                          │   │
│  │ Pros: Scales naturally                                  │   │
│  │ Cons: Partition boundaries change on add/remove         │   │
│  │                                                          │   │
│  │ Used by: Cassandra                                      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Rebalancing Best Practices

┌─────────────────────────────────────────────────────────────────┐
│              REBALANCING BEST PRACTICES                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. AUTOMATIC vs MANUAL                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Automatic: System decides when to rebalance             │   │
│  │   Risk: May rebalance at worst time (peak load)        │   │
│  │   Risk: Node flapping triggers repeated rebalance       │   │
│  │                                                          │   │
│  │ Manual: Human initiates rebalance                       │   │
│  │   Safer: Choose timing (low traffic)                    │   │
│  │   But: Requires human attention                         │   │
│  │                                                          │   │
│  │ Recommendation: Manual trigger, automatic execution     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  2. THROTTLING                                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Rebalancing moves lots of data → network/disk load      │   │
│  │                                                          │   │
│  │ Throttle to avoid impacting production traffic:         │   │
│  │ • Limit bandwidth (e.g., 50 MB/s per node)             │   │
│  │ • Limit concurrent partition moves                      │   │
│  │ • Pause during peak hours                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  3. GRADUAL MIGRATION                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Don't move all data at once:                            │   │
│  │                                                          │   │
│  │ 1. Copy partition to new node                           │   │
│  │ 2. Dual-write to both during transition                 │   │
│  │ 3. Switch reads to new node                             │   │
│  │ 4. Stop writes to old node                              │   │
│  │ 5. Delete from old node                                 │   │
│  │                                                          │   │
│  │ Allows rollback if issues arise                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6. Request Routing

How does a client know which partition holds its data?

┌─────────────────────────────────────────────────────────────────┐
│                 REQUEST ROUTING                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Approach 1: ANY NODE (Gossip-based)                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  Client ──► Node A ──► Node C (has data)                │   │
│  │                 │                                        │   │
│  │                 └── Forwards if not responsible          │   │
│  │                                                          │   │
│  │  All nodes know the full routing table                   │   │
│  │  Client can connect to any node                          │   │
│  │                                                          │   │
│  │  Used by: Cassandra, Riak                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Approach 2: ROUTING TIER                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  Client ──► Router ──► Correct Node                     │   │
│  │                │                                         │   │
│  │                └── Knows which partition is where        │   │
│  │                                                          │   │
│  │  Dedicated routing layer (proxy)                        │   │
│  │  Client doesn't need routing logic                      │   │
│  │                                                          │   │
│  │  Used by: MongoDB (mongos), Redis Cluster (proxy mode)  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Approach 3: CLIENT-SIDE (Smart client)                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                                                          │   │
│  │  Client ──────────────► Correct Node                    │   │
│  │    │                                                     │   │
│  │    └── Has routing table, connects directly              │   │
│  │                                                          │   │
│  │  Client maintains partition-to-node mapping             │   │
│  │  Lowest latency (no proxy hop)                          │   │
│  │                                                          │   │
│  │  Used by: Cassandra (DataStax driver), Redis Cluster    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Routing table maintenance:                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • ZooKeeper/etcd: Centralized metadata (HBase, Kafka)   │   │
│  │ • Gossip protocol: Nodes share with each other          │   │
│  │ • Config server: Dedicated metadata nodes (MongoDB)     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Interview Questions

Conceptual Questions

What's the difference between partitioning and replication?
- Partitioning: Different data on different nodes (scalability)
- Replication: Same data on multiple nodes (fault tolerance)
- Usually combined: Partitioned data is replicated
Compare key-range vs hash partitioning.
- Key-range: Sorted, supports range queries, but hotspots possible
- Hash: Even distribution, no hotspots, but no range queries
- Compound keys: Hash for distribution, sort within partition
What is consistent hashing? Why use virtual nodes?
- Consistent hashing: Only ~1/N keys move when adding node
- Virtual nodes: Even distribution, smoother rebalancing
- Each physical node has many positions on ring
How do you handle the celebrity problem (hot partition)?
- Shard hot keys with random suffix
- Application-level caching
- Denormalize data
- Custom partitioning for hot keys

System Design Questions

Design a partition strategy for an e-commerce catalog.
- Partition by product_category (range for browsing)
- Or hash by product_id (even distribution)
- Secondary index for search by name
- Consider: Read/write ratio, query patterns
How would you migrate from 10 to 15 nodes with minimal downtime?
- Use consistent hashing (minimal data movement)
- Add nodes one at a time
- Throttle rebalancing to avoid overload
- Dual-write during transition
- Rollback plan for each node

Summary

┌─────────────────────────────────────────────────────────────────┐
│              MODULE 8 KEY TAKEAWAYS                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Partitioning enables horizontal scaling                     │
│     • Beyond single-node limits                                │
│     • Combined with replication for fault tolerance            │
│                                                                 │
│  2. Key-range partitioning                                      │
│     • Supports range queries                                   │
│     • Prone to hotspots (sequential writes)                    │
│                                                                 │
│  3. Hash partitioning                                           │
│     • Even distribution                                        │
│     • No range queries                                         │
│     • Use consistent hashing for stable rebalancing            │
│                                                                 │
│  4. Compound keys: Best of both worlds                          │
│     • Hash partition key for distribution                      │
│     • Clustering key for sorting within partition              │
│                                                                 │
│  5. Rebalancing must be handled carefully                       │
│     • Throttle to avoid impacting production                   │
│     • Manual trigger often safer than automatic                │
│                                                                 │
│  6. Routing: Client needs to find correct partition             │
│     • Gossip, routing tier, or smart client                    │
│     • Centralized metadata (ZK) or distributed (gossip)        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Next Module: Consensus Algorithms - Paxos, Raft, and how distributed systems agree.