Part 36: Consistent Hashing - Distributing Data with Minimal Disruption

"When a warehouse adds a new aisle or removes an old one, you don't want to reorganize the entire inventory. Consistent hashing is the art of organizing data so that changes in capacity require moving only what's necessary—nothing more."

The Problem with Simple Hashing

Imagine you're distributing data across ten servers. The simplest approach is to hash each item's key and take the result modulo ten: server = hash(key) % 10. Item with hash 17 goes to server 7. Item with hash 42 goes to server 2. This works perfectly—until you add an eleventh server.
Now you compute server = hash(key) % 11. That item with hash 17? It now belongs to server 6, not server 7. The item with hash 42 goes to server 9, not server 2. In fact, when you change from N servers to N+1, the probability that an item stays on its current server is only 1/(N+1). For our ten-server cluster, adding one server means 91% of all items need to move.
This is catastrophic for a cache. If you have a billion cached items and add a server, you need to move 910 million items. During this migration, cache hit rates plummet. For a distributed database, the story is even worse: 910 million items need to be copied, potentially while serving requests.
Consistent hashing solves this problem. When you add a server to a consistent hashing ring, only about 1/N of the items need to move. When you remove a server, again only about 1/N of items are affected. Changes cause minimal disruption.

The Hash Ring

Picture a circle—a ring—representing the entire range of possible hash values. If your hash function produces 128-bit values, this ring represents all 2^128 possible values, arranged in a circle where 0 and the maximum value are adjacent.
Both servers and data items are placed on this ring. Each server is assigned a position by hashing its identifier—perhaps its hostname or IP address. Each data item is assigned a position by hashing its key.
To find which server owns a data item, start at the item's position on the ring and walk clockwise until you hit a server. That server is responsible for the item.
When a new server joins, it takes a position on the ring and becomes responsible for items between its position and the previous server's position. Only those items—roughly 1/N of the total if servers are evenly distributed—need to move to the new server.
When a server leaves, its items need to find new homes. They flow clockwise to the next server on the ring. Again, only the departing server's items—roughly 1/N of the total—are affected.

Virtual Nodes: Solving the Distribution Problem

The basic consistent hashing approach has a problem: if servers are assigned single positions on the ring, the distribution of items among servers might be uneven. By chance, some servers might get large arcs of the ring while others get small arcs. Worse, when a server leaves, all its items go to a single neighbor, potentially overloading it.
Virtual nodes solve this problem. Instead of each server having one position on the ring, each server has many positions—typically hundreds. These positions are determined by hashing variations of the server identifier: hash("server1-0"), hash("server1-1"), hash("server1-2"), and so on.
With virtual nodes, each server owns many small arcs scattered around the ring rather than one large arc. The law of large numbers kicks in: the total length of a server's arcs converges to 1/N of the ring, regardless of random variations.
When a server leaves, its virtual nodes are scattered around the ring, so its items are distributed among many other servers rather than all flowing to one neighbor. When a server joins, it takes items from many neighbors rather than heavily impacting just one.
Virtual nodes also enable weighted distribution. A more powerful server can be assigned more virtual nodes, giving it proportionally more of the ring and thus more items. A server with twice the capacity gets twice the virtual nodes and roughly twice the data.

The Mechanics of Lookup

Finding the server responsible for a key requires two steps: hashing the key to get a ring position, then finding the first server position clockwise from that location.
Naively, this second step requires scanning through all server positions, which is O(N) or O(N×V) where V is the number of virtual nodes per server. For a large cluster with many virtual nodes, this is too slow.
The solution is to store server positions in a sorted data structure. A sorted array or balanced tree allows binary search. Given a key's ring position, you binary search for the smallest server position greater than or equal to the key position. If none exists (the key position is beyond all server positions), you wrap around to the smallest server position.
With binary search, lookup is O(log(N×V)), which is fast enough for most purposes. Some implementations use skip lists or other structures optimized for their specific access patterns.
Membership changes require updating this data structure. Adding a server means inserting its virtual node positions. Removing a server means deleting its positions. These operations also need to be efficient, especially if membership changes frequently.

Replication and Consistent Hashing

Consistent hashing naturally supports replication. Instead of storing an item on just the first server clockwise from its position, you store it on the first K servers, where K is your replication factor.
When looking up an item for reading, you can contact any of the K replicas. When writing, you might require acknowledgment from some number of replicas before considering the write complete, following the quorum concepts we explored earlier.
The clockwise assignment ensures that replicas are different servers. But with virtual nodes, there's a subtlety: the next virtual node clockwise might belong to the same physical server, which wouldn't provide true redundancy. Implementations need to skip virtual nodes until they find ones belonging to distinct physical servers.
Some systems add awareness of failure domains to replica placement. Rather than just finding the next K distinct servers, they find servers in different racks, or different availability zones, or different data centers. This provides resilience against correlated failures.

Consistent Hashing in Practice: Dynamo and Its Descendants

Amazon's Dynamo paper popularized consistent hashing for distributed storage. Dynamo stores key-value pairs across a cluster of nodes, using consistent hashing to determine which nodes own which keys.
In Dynamo, the ring is divided into fixed partitions rather than using continuous positions. Each node owns some number of partitions. This discretization simplifies management—partitions are the unit of data transfer and ownership changes.
When a node joins, it takes ownership of some partitions from existing nodes. When a node leaves, its partitions are reassigned to other nodes. The mapping from partitions to nodes is maintained in a gossip-propagated membership table.
Cassandra, Riak, Voldemort, and many other distributed databases descended from Dynamo's design. They all use variations of consistent hashing for data distribution. The specifics vary—different virtual node schemes, different partition strategies, different replication placement—but the core idea remains: use consistent hashing to minimize data movement when cluster membership changes.

Beyond Key-Value: Consistent Hashing for Load Balancing

Consistent hashing isn't just for data storage. It's equally useful for load balancing, where you're distributing requests rather than data.
Consider a load balancer distributing requests to a pool of web servers. You could hash request properties—perhaps the session ID or user ID—and use consistent hashing to select a server. This provides session affinity: requests from the same user consistently go to the same server, enabling server-side session state without shared storage.
When a server fails, only sessions that were on that server need to find new homes. When a server is added, only a fraction of sessions migrate to it. This is much gentler than round-robin or random load balancing, where adding or removing servers affects the distribution of all sessions.
The same principle applies to caching layers. CDN nodes, Memcached clusters, and web caches use consistent hashing to determine which cache server handles which content. When cache servers come and go, the impact on cache hit rates is minimized.

Jump Hash: A Simpler Alternative

While consistent hashing with virtual nodes is flexible and widely used, it has overhead: maintaining the ring structure, managing virtual node mappings, and performing binary search for lookups.
Jump hash is an alternative that achieves consistent hashing with simpler mathematics. Given a key and a number of buckets, jump hash produces a bucket number. When the number of buckets increases by one, each key either stays in its current bucket or moves to the new bucket—never to any other bucket. The algorithm is remarkably simple and fast.
The limitation of jump hash is that it works with sequentially numbered buckets. You can't remove bucket 3 while keeping buckets 0, 1, 2, 4, and 5. This makes it less suitable for systems where arbitrary nodes can fail or be removed. But for systems where membership changes are adding nodes at the end or removing nodes from the end, jump hash is elegant and efficient.

Rendezvous Hashing: An Alternative Perspective

Rendezvous hashing, also called highest random weight hashing, achieves similar goals through different means. For each key, compute a score for every server based on hashing the key with the server identifier. Assign the key to the server with the highest score.
When a server is removed, keys that were assigned to it get reassigned based on which remaining server has the highest score—which is the server that had the second-highest score before. Keys assigned to other servers are unaffected because their highest-scoring server hasn't changed.
Rendezvous hashing has some nice properties. It doesn't require a ring structure or virtual nodes. It handles weighted servers naturally by adjusting the scoring function. It provides perfect load distribution if the hash function is uniform.
The main downside is lookup cost. Computing scores for every server is O(N), compared to O(log N) for consistent hashing with a sorted structure. For small clusters, this is fine. For very large clusters, it becomes expensive.

The Philosophy of Minimal Movement

Consistent hashing embodies a design principle that applies broadly: minimize the impact of changes. When the system changes—nodes added, nodes removed, capacity adjusted—affect only what's necessary, not everything.
This principle has parallels elsewhere in systems design. In version control, a good merge affects only the changed code, not the entire codebase. In database schema changes, online schema migrations change only what's needed, not the whole table. In API design, versioning and backward compatibility minimize the impact of changes on clients.
The insight is that change is constant. Systems grow and shrink. Components fail and recover. Capacity is added and retired. A system designed assuming static membership will constantly struggle with membership changes. A system designed for change handles it gracefully.
Consistent hashing makes this graceful handling concrete and mathematical. The fraction of affected items is bounded by the fraction of changed capacity. Add 10% capacity, move about 10% of items. Remove a failed node representing 5% of capacity, move about 5% of items.
This predictability is valuable for capacity planning. You know how much data will move when you add a node. You know how much traffic will shift when a node fails. You can plan for these events rather than fearing them.

"Consistent hashing is not about hash functions or ring structures; it's about the principle that change should be proportional. When one thing changes, we should move one thing's worth of data—not reorganize the entire system."
All Blogs
Tags:consistent-hashingpartitioningscalability