Part 33: Distributed Locks - Coordination Across Boundaries

"In a world where multiple processes compete for shared resources across network boundaries, the humble lock—so simple in single-machine programming—becomes a profound challenge requiring careful thought about failures, time, and the very nature of agreement."

The Need for Distributed Locking

Consider a seemingly simple problem: you have multiple instances of an application that periodically need to perform some critical task—perhaps processing accumulated data, rotating logs, or sending a scheduled batch of notifications. This task must execute exactly once per interval, not multiple times and not zero times.

On a single machine, the solution is straightforward: use a mutex or a file lock. The operating system guarantees that only one process can hold the lock at any time. But when your application runs across multiple machines, connected only by a network, there is no operating system to coordinate for you. Each machine has its own locks, its own clocks, its own view of reality.

This is where distributed locks come in. A distributed lock is a mechanism that allows multiple processes across multiple machines to coordinate so that only one of them holds the lock at any given time. It sounds simple, but as we'll see, achieving this correctly in the face of failures is surprisingly subtle.

The Fundamental Challenges

Distributed locks must contend with challenges that don't exist in single-machine locking.

Network failures are the most obvious challenge. The connection between a client and the lock service might fail at any point. The client might acquire a lock, do some work, and then find itself unable to release the lock because the network has partitioned. Meanwhile, other clients might be waiting for a lock that will never be released.

Process failures compound the problem. A client might acquire a lock and then crash before releasing it. Without some mechanism to recover, the lock would be held forever, blocking all other clients. The solution—automatic lock expiration—introduces its own complications that we'll explore shortly.

Clock skew affects any time-based mechanism. If locks expire based on time, and different machines have different notions of what time it is, a lock might expire from the perspective of the lock service while the client still believes it holds the lock. The client continues executing under the assumption it has exclusive access, but another client has already acquired the lock.

The lack of shared memory means there's no single source of truth that all processes can directly observe. Every piece of information must be communicated over the network, with all the delays and failures that entails. When you check whether you hold a lock, the answer you receive describes a past state, not necessarily the present.

The Redis Lock: A Simple Approach

The simplest approach to distributed locking uses a single Redis instance. To acquire a lock, you set a key with a unique value (identifying your client) and an expiration time. You set it only if it doesn't already exist. To release the lock, you delete the key, but only if it still holds your value.

This approach is easy to understand and implement. The expiration time handles crashed clients—their locks eventually expire, allowing others to proceed. The unique value ensures you only release your own lock, not someone else's.

But this simple approach has significant limitations. Redis replication is asynchronous, so if the master fails right after you acquire a lock, the failover might lose your lock. Another client could then acquire the same lock, violating mutual exclusion. The single Redis instance is a single point of failure; if it's unavailable, no one can acquire or release locks.

For many use cases, these limitations are acceptable. If you're coordinating batch jobs where occasional double-execution is tolerable, a simple Redis lock might be fine. But for scenarios requiring strong mutual exclusion—financial transactions, inventory updates, or anything where correctness is critical—we need something more robust.

Redlock: An Attempt at Robustness

Martin Kleppmann's famous critique notwithstanding, the Redlock algorithm is an instructive case study in distributed lock design. Proposed by the creator of Redis, Redlock attempts to provide stronger guarantees by using multiple independent Redis instances.

To acquire a Redlock, the client tries to set the lock on a majority of instances, typically five. It measures the time taken for these acquisitions and only considers itself to hold the lock if it acquired a majority and the total time taken was less than the lock's validity time. To release, it removes the lock from all instances.

The intuition is that failures of individual instances don't compromise the lock because a majority is required. Even if some instances fail or their responses are lost, as long as a majority agrees, the lock is valid.

However, this algorithm has been criticized for not providing the guarantees it claims under certain failure scenarios. Clock jumps, garbage collection pauses, or network delays can create situations where a client believes it holds a lock when it doesn't, or where two clients both believe they hold the lock. These criticisms highlight a fundamental issue: building reliable distributed locks requires either synchronized clocks (which can't be guaranteed) or consensus protocols (which are more expensive).

Fencing Tokens: Protecting Against Stale Locks

One technique that strengthens any distributed lock implementation is fencing tokens. The insight is that even if our lock mechanism occasionally allows two clients to think they hold the lock, we can prevent actual damage by requiring clients to prove their lock is current when they access the protected resource.

When a lock is acquired, the lock service issues a fencing token—a monotonically increasing number. The client includes this token in every request to the protected resource. The resource rejects any request with a token lower than one it has already seen.

Suppose Client A acquires a lock with token 42, then experiences a long garbage collection pause. The lock expires, and Client B acquires it with token 43. Client A wakes up, still believing it holds the lock, and tries to write to the resource. But the resource has already seen token 43 from Client B, so it rejects Client A's request with token 42.

Fencing tokens don't prevent lock conflicts—they prevent lock conflicts from causing incorrect outcomes. They require support from the protected resource, which must track and enforce tokens. Not all resources support this pattern, but when they do, it provides a crucial safety layer.

ZooKeeper: Consensus-Based Coordination

For applications requiring strong guarantees, ZooKeeper provides distributed coordination based on consensus. ZooKeeper maintains a hierarchical namespace of nodes, similar to a file system, and uses the ZAB consensus protocol to replicate this namespace across servers.

Locks in ZooKeeper leverage several features: ephemeral nodes, sequential nodes, and watches. An ephemeral node exists only as long as the session that created it is active. If the client disconnects or crashes, the ephemeral node is automatically deleted. Sequential nodes have a unique sequence number appended to their name, provided by ZooKeeper.

To acquire a lock, a client creates an ephemeral sequential node under a designated lock path—something like "/locks/my-resource/lock-0000000042". The client then lists all children of the lock path and checks if its node has the lowest sequence number. If so, it holds the lock. If not, it watches the next-lowest node for deletion, then rechecks.

When the client finishes, it deletes its node, releasing the lock. If the client crashes, its session eventually expires, and ZooKeeper automatically deletes the ephemeral node, releasing the lock.

This approach provides strong guarantees because ZooKeeper's consensus protocol ensures that all clients see a consistent view of the namespace. The "herd effect" is avoided by having each client watch only the specific node ahead of it, rather than watching the lock path for any changes.

etcd: Modern Distributed Coordination

etcd, used as the coordination layer for Kubernetes, provides distributed locking through leases. A lease is a time-limited grant that must be periodically renewed. Keys can be associated with leases; when a lease expires, all associated keys are deleted.

To acquire a lock, a client creates a lease, then creates a key under the lock path with that lease attached. The key creation is conditional—it succeeds only if the key doesn't exist or if using a transaction to ensure the client's entry is first.

The client must keep the lease alive by periodically sending keepalive requests. If the client crashes or loses network connectivity, it stops sending keepalives, the lease expires, and the lock is released. This provides automatic recovery from client failures.

etcd's use of Raft consensus provides strong consistency guarantees. All operations go through the leader and are replicated to a majority before being acknowledged. This prevents the issues that can plague weaker implementations.

The Problem of Time

Time is perhaps the most subtle challenge in distributed locking. Every lock implementation that features automatic expiration relies on time in some way. But time in distributed systems is treacherous.

Clocks on different machines drift apart. NTP can correct drift, but it can also jump the clock forward or backward. A clock jump backward can make an unexpired lock appear expired from the client's perspective. A clock jump forward can make a client believe it still holds a lock that has already expired on the server.

Even without clock drift, there are delays between checking if you hold a lock and acting on that knowledge. In the time it takes to read the lock state and process the response, the lock might have expired. This is called the "check-then-act" race condition, and it's endemic to distributed systems.

Some systems try to minimize these issues by using local clocks only for lease management while relying on server-controlled sequence numbers for correctness. The client doesn't trust its own clock to determine if it holds the lock; instead, it relies on fencing tokens to ensure safety even if its timing assumptions are wrong.

When to Use Distributed Locks

Distributed locks are often overused. Many problems that seem to require locking can be solved better through other means.

If you're coordinating access to a database, the database itself likely provides better coordination primitives—transactions, row-level locks, or atomic operations. Using an external distributed lock to protect database access adds complexity and failure modes without adding value.

If you're trying to ensure exactly-once processing, idempotency might be a better approach. Design your operations so they can be safely executed multiple times with the same result. Then occasional duplicate execution—which distributed locks can't perfectly prevent anyway—doesn't matter.

If you're coordinating background jobs across instances, consider whether you truly need exactly-once semantics or whether at-least-once with idempotent processing would suffice. Job queues with acknowledgment provide this pattern more robustly than distributed locks.

Distributed locks are appropriate when you have a resource that doesn't support internal coordination, when the resource can't be made idempotent, and when occasional double-execution would cause significant problems. Even then, consider whether the complexity of distributed locking is justified by the risk of double-execution.

Safety vs. Liveness

Distributed lock implementations make tradeoffs between safety and liveness. Safety means never allowing two clients to hold the lock simultaneously. Liveness means always eventually allowing a client to acquire the lock.

In the presence of arbitrary network delays and failures, you cannot guarantee both properties perfectly. This is related to the FLP impossibility result: you can't have consensus (required for safe locking) with guaranteed termination in an asynchronous system with potential failures.

Practical systems choose points on this spectrum. Some prioritize safety—better to have no client hold the lock than to have two clients hold it. Others prioritize liveness—better to risk occasional overlap than to have locks that never release.

The right choice depends on your use case. For coordinating access to a bank account, you want safety even at the cost of liveness. For rate-limiting API calls, occasional overlap might be acceptable, and liveness is more important.

Implementing Lock Safely

If you must use distributed locks, follow these principles for safety.

Always use timeouts. Never wait indefinitely for a lock. Set acquisition timeouts and hold timeouts appropriate to your operation. If you can't acquire a lock quickly, something might be wrong.

Always use unique identifiers. Each lock acquisition should be identifiable. This allows you to release only your own lock and enables debugging when things go wrong.

Always consider what happens if you don't hold the lock. Before any critical operation, verify you still hold the lock. But remember that this verification might be stale by the time you act on it. Use fencing tokens where possible.

Always handle failures explicitly. What happens if you can't release the lock? What if the lock service is unavailable? What if your process pauses longer than the lock timeout? Have answers to these questions before you deploy.

Never assume atomicity across lock acquisition and your operation. Acquiring a lock and then doing work are separate steps. Failures can happen between them. Design your system to handle partial completion.

Beyond Mutual Exclusion

Distributed locks are often used for mutual exclusion, but coordination services like ZooKeeper and etcd provide richer primitives. Leader election selects one node from a group to be the leader, handling failover when the leader fails. Distributed queues coordinate task distribution across workers. Configuration management distributes settings changes to all nodes atomically. Service discovery allows nodes to find each other dynamically.

These higher-level abstractions are often better choices than raw locks. They encapsulate common patterns, handle edge cases, and provide clearer semantics. When you reach for a distributed lock, consider whether a higher-level primitive better matches your actual need.

"A distributed lock is a promise made across a network—a promise that failures, delays, and clock skew all conspire to break. Use them sparingly, implement them carefully, and always have a plan for when the promise fails."