Introduction to Distributed Systems: Why Everything Breaks and How We Build Anyway

🔥 The Problem

We had a monolithic e-commerce application running on a single powerful server. For three years, it worked beautifully. Then our user base grew from 10,000 to 500,000 in eight months after a viral marketing campaign. The symptoms started slowly: checkout latency increased from 200ms to 800ms, then to 2 seconds. Database queries that once took 5ms now took 500ms. We threw hardware at it upgraded to a 128-core machine with 1TB of RAM. That bought us three months.

Then Black Friday arrived. At 9:47 AM, the server hit 100% CPU. The database connection pool exhausted at 9:51 AM. By 9:53 AM, the health check started failing. At 9:54 AM, the load balancer marked the server as unhealthy and stopped routing traffic. The entire platform went down. Customers saw a blank white page. We lost $47,000 in revenue per minute for the 23 minutes it took to restart the server and clear the queued requests.

The root cause was not a bug. The root cause was physics. A single machine, no matter how powerful, has finite resources. A single CPU can only execute so many instructions per second. A single disk can only perform so many I/O operations. A single network card can only handle so much bandwidth. When all 500,000 users tried to browse products, add items to carts, and check out simultaneously, the server simply ran out of everything.

But the deeper problem was architectural. With everything on one machine, we had a single point of failure. When that machine failed, there was no backup, no failover, no graceful degradation. The entire business stopped. Every team engineering, customer support, marketing sat helpless while one overloaded server decided our fate.

We also faced a geographic problem. Our server was in Virginia. When a user in Tokyo loaded a product page, their request traveled 10,800 kilometers to Virginia, the server processed it, and the response traveled 10,800 kilometers back. Physics tells us that light in fiber optic cable travels at roughly 200,000 kilometers per second. That means the absolute minimum round-trip time, ignoring all processing, is 108 milliseconds. In practice, with routing hops, the TCP handshake, TLS negotiation, and server processing, our Tokyo users experienced 400-600ms page loads. Our competitors with servers in Asia loaded in under 100ms. We were losing market share to physics.

The business was growing, but our infrastructure was a ticking time bomb. We needed a fundamental rethinking of how we built systems. We needed distribution.

💡 Inspiration

The inspiration came from studying how the largest internet companies operated. Netflix serves over 15% of global internet bandwidth during peak hours. Google handles over 8.5 billion searches per day. Amazon processes over $1 billion in sales on peak days. None of these companies run on a single server. They run on thousands of servers spread across the globe, failing and recovering constantly, yet appearing to users as a single seamless system.

We studied Amazon's 2004 paper on Dynamo, which described how they built a highly available key-value store that could survive entire datacenter failures. We read Google's papers on Bigtable and Spanner, describing systems that scaled to petabytes while maintaining consistency across continents. We looked at Netflix's architecture blog posts describing how they intentionally killed servers in production using Chaos Monkey to ensure their system could handle failures gracefully.

The core insight from all this research was a paradigm shift: instead of building one perfect machine that never fails, build many imperfect machines that fail constantly but work together so that the overall system stays up. Accept that hardware fails. Accept that networks are unreliable. Accept that software has bugs. Design the system so that no single failure can bring everything down.

A second insight came from Peter Deutsch's Eight Fallacies of Distributed Computing, written at Sun Microsystems in the 1990s. Deutsch observed that programmers new to distributed systems consistently made the same false assumptions: that the network is reliable, that latency is zero, that bandwidth is infinite, that the network is secure. These assumptions work fine when everything runs on one machine. They become catastrophic when you distribute.

We realized we needed to rebuild our mental model. A distributed system is not a single system that happens to run on multiple machines. A distributed system is a collection of independent machines that must cooperate despite unreliable communication, independent failures, and no shared clock. This is fundamentally different from local programming, and treating it as merely "programming but on more computers" is the root cause of most distributed system failures.

🛠️ The Solution (Overview)

A distributed system is a collection of independent computers that appears to its users as a single coherent system. The users whether human or other software interact with what seems like one unified application, but behind the scenes, many machines are collaborating to serve the request.

Think of it like a restaurant with multiple chefs. A customer orders a meal from a single waiter. The waiter takes the order to the kitchen. In the kitchen, one chef prepares the appetizer, another prepares the main course, a third prepares the dessert. Each chef works independently, using their own station and equipment. Yet to the customer, they receive what feels like a single coordinated meal. If one chef calls in sick, another can cover. If the grill breaks, the kitchen can still make salads and desserts. The customer might experience a slightly reduced menu or longer wait, but they still get fed.

We rebuilt our e-commerce platform using this model. Instead of one monolithic server, we created multiple specialized services: a user service that handled authentication and profiles, a product service that managed inventory and search, an order service that handled cart and checkout, a payment service that processed transactions, and a notification service that sent emails and push notifications.

Each service ran on multiple machines across multiple datacenters. If one machine failed, traffic automatically routed to the healthy ones. If an entire datacenter went down, traffic routed to another region. We deployed servers in Virginia, Frankfurt, Tokyo, and São Paulo, so users anywhere in the world connected to a nearby server.

The key insight that made this work was treating failure as normal, not exceptional. We assumed every network call could fail. We assumed every machine could crash at any moment. We built retries, timeouts, circuit breakers, and fallbacks into every interaction. When the payment service was slow, the order service didn't hang indefinitely it timed out after 5 seconds and showed the user a graceful error message with a retry button.

🔍 Detailed Explanation

What Exactly Is a Distributed System?

A distributed system has four defining characteristics that distinguish it from a program running on a single machine.

First: Concurrency. Multiple components execute simultaneously. In a single-threaded program, only one thing happens at a time. In a distributed system, your user service in Virginia might be authenticating user A while your product service in Tokyo is loading products for user B while your order service in Frankfurt is processing a checkout for user C. All of these happen at the exact same moment in wall-clock time. This concurrency is both a feature it's how we achieve scale and a source of complexity, because concurrent operations on shared data require careful coordination to avoid conflicts.

Second: No Global Clock. Each machine in a distributed system has its own clock, and these clocks drift. On a single machine, you can ask "what time is it?" and get a definitive answer. In a distributed system, machine A might think it's 10:00:00.000, machine B might think it's 10:00:00.037, and machine C might think it's 9:59:59.982. This matters enormously when you try to order events. If machine A writes a value at its local time 10:00:00.000 and machine B writes a different value at its local time 10:00:00.001, which write happened first? You cannot answer this question using wall-clock time alone. This is why distributed systems use logical clocks Lamport timestamps, vector clocks to establish ordering without relying on synchronized physical time.

Third: Independent Failures. In a single machine, either the whole machine works or it doesn't. In a distributed system, machine A can crash while machine B continues operating perfectly. The network between A and B can fail while both machines are healthy. A can reach B, but B cannot reach A due to asymmetric network failure. This partial failure is the defining challenge of distributed systems. You cannot assume that if your request succeeded once, it will succeed again. You cannot assume that if your request failed, the operation didn't happen maybe the operation succeeded but the response was lost.

Fourth: Message Passing. Components communicate by sending messages over the network, not by calling functions in shared memory. When you call a local function, it either returns a result or throws an exception, and this happens essentially instantaneously. When you send a message to another machine, you have no idea how long it will take to arrive. It might take 1 millisecond or 1 second or never arrive at all. The receiving machine might process your message and send a response, but that response might get lost on the way back. From your perspective, you cannot distinguish "the other machine is slow" from "the other machine is dead" from "the network between us is broken."

Why Do We Build Distributed Systems?

Given all this complexity, why would anyone choose to build a distributed system? There are three compelling reasons.

Reason One: Scalability. A single machine has physical limits. The largest commercially available servers today might have 448 CPU cores, 24 terabytes of RAM, and 300 terabytes of SSD storage. This sounds like a lot, but consider that Netflix processes over 2 billion hours of video streaming per quarter. No single machine, no matter how powerful, can handle that load. The only way to serve millions of concurrent users is to spread the work across many machines. When you need more capacity, you add more machines. This is called horizontal scaling, and in theory, it has no upper limit you can always add more machines.

Vertical scaling making a single machine more powerful hits diminishing returns. Doubling the CPU cores doesn't double the throughput, because memory bandwidth becomes the bottleneck. Doubling the RAM doesn't help if the disk can't read data fast enough. And the most powerful servers cost exponentially more than multiple less powerful ones. Ten servers with 16 cores each often handle more load than one server with 128 cores, at a fraction of the cost.

Reason Two: Fault Tolerance. A single machine is a single point of failure. If that machine dies due to hardware failure, kernel panic, power outage, or fire in the datacenter your entire system dies. With a distributed system, you can survive failures. If you have three replicas of your database and one fails, the other two continue serving requests. Users might not even notice. This isn't theoretical hardware failure is constant at scale. Google reported that in a cluster of 1,800 machines over a year, they expect around 1,000 individual disk failures, 20 machine reboots, and several network maintenance events. Systems must continue operating through this constant churn.

Reason Three: Latency. The speed of light is a fundamental physical constant. Light travels at roughly 300,000 kilometers per second in a vacuum, and roughly 200,000 kilometers per second in fiber optic cable. New York to Tokyo is about 10,800 kilometers. That means the absolute fastest a packet can travel from a user in Tokyo to a server in New York and back is about 108 milliseconds. This is physics, not software you cannot optimize your code to make light travel faster.

The solution is geographic distribution. Put servers close to users. A user in Tokyo connecting to a server in Tokyo might have a round-trip time of 5-10 milliseconds instead of 200+ milliseconds to New York. For a web application that requires several back-and-forth requests to render a page, this difference compounds dramatically. A page that takes 2 seconds to load from across the globe might load in 200 milliseconds from a nearby server. This directly impacts user experience, engagement, and conversion rates.

The Eight Fallacies of Distributed Computing

Peter Deutsch and others at Sun Microsystems identified eight false assumptions that programmers make when first building distributed systems. These fallacies cause bugs, outages, and architectural mistakes.

Fallacy 1: The Network Is Reliable. When you call a local function, it always executes (unless your whole machine crashes). When you send a network request, it might not arrive. It might arrive but the response might not return. It might arrive twice if there's a retry at the network layer you don't control. It might arrive but be corrupted. The network fails all the time: packets are dropped due to congestion, routers crash, cables are unplugged accidentally, DNS servers go down, TLS certificates expire. Every network call in your code must handle failure: timeouts, retries with exponential backoff, idempotency to handle duplicate delivery, circuit breakers to stop hammering a broken service.

Fallacy 2: Latency Is Zero. A local function call takes nanoseconds. A network call takes milliseconds roughly a million times slower. When you refactor a monolith into microservices, you often replace local function calls with network calls. A request that previously required 10 function calls, taking 100 nanoseconds total, now requires 10 network calls, taking 10-50 milliseconds total. This is a 100,000x slowdown. You must batch requests where possible, use caching aggressively, and think carefully about which boundaries require network calls.

Fallacy 3: Bandwidth Is Infinite. Network bandwidth costs money and has limits. Cloud providers charge for data transfer, especially across regions and to the internet. AWS charges $0.09 per gigabyte of data transfer to the internet. If you're serving video and each user watches 2GB, serving a million users costs $180,000 just in bandwidth. Beyond cost, bandwidth affects latency a 10MB payload takes 800 milliseconds to transfer on a 100Mbps connection, even with zero latency. Use compression, binary formats like Protocol Buffers instead of JSON, pagination, and streaming to minimize data transfer.

Fallacy 4: The Network Is Secure. Every network should be treated as hostile. Man-in-the-middle attacks can intercept unencrypted traffic. Internal networks are not safe once an attacker is inside your network, they can sniff all unencrypted traffic between your services. Use TLS for all communication, even internal. Use mutual TLS (mTLS) for service-to-service authentication. Never transmit credentials in plaintext. Assume every packet is being watched.

Fallacy 5: Topology Doesn't Change. In cloud environments, IP addresses change constantly. Servers are created and destroyed. Autoscaling adds and removes instances. Deployments roll out new containers with new IPs. Never hardcode IP addresses. Use service discovery DNS, Consul, etcd to find services dynamically. Expect that the server you're talking to might not exist in an hour.

Fallacy 6: There Is One Administrator. Your system spans multiple administrative domains. You control your application code, but AWS controls the EC2 instances. Cloudflare controls your CDN. Stripe controls payment processing. Your customer's ISP controls their network path to you. Any of these can fail independently, and you cannot fix them you can only design your system to handle their failures gracefully.

Fallacy 7: Transport Cost Is Zero. Sending data over the network consumes CPU for serialization and deserialization, memory for buffers and connection state, and money for bandwidth and load balancer usage. JSON parsing alone can consume 10-30% of CPU in microservice architectures with high message rates. Use efficient serialization (Protocol Buffers), connection pooling (reuse TCP connections), and batch requests to amortize transport costs.

Fallacy 8: The Network Is Homogeneous. Your request travels through diverse infrastructure: the client's device, their WiFi router, their ISP, several internet exchange points, your cloud provider's network, your load balancer, your server. Each hop has different characteristics, different failure modes, different security policies. Don't assume specific network features are universally available.

Partial Failures: The Defining Challenge

The most difficult aspect of distributed systems is partial failure. Some components fail while others continue operating. This creates situations that cannot occur in a single-machine system.

Consider a simple scenario: you send a request to update a user's email address. You set a timeout of 5 seconds. The timeout fires. What happened?

Possibility A: The server never received your request. The email is unchanged.

Possibility B: The server received the request, updated the email, but the response was lost on the way back. The email is changed.

Possibility C: The server received the request, started processing, but crashed mid-processing. The email might be changed, might not be, or might be in a corrupted state.

Possibility D: The server is still processing, just slowly. The email will be changed eventually.

From the client's perspective, you cannot distinguish these cases. This uncertainty is fundamental it cannot be eliminated by better error messages or more logging. The Two Generals Problem, proven unsolvable in the 1970s, shows that achieving consensus over an unreliable network is impossible with certainty.

Practical distributed systems deal with this uncertainty through several mechanisms. Idempotent operations allow safe retries updating an email to the same value twice has the same effect as doing it once. Idempotency keys let you track whether a specific request has already been processed. Timeouts prevent indefinite waiting. Health checks determine if services are alive. Circuit breakers stop sending requests to failing services. But none of these eliminate uncertainty they just make it manageable.

🏗️ Architecture

Let me describe a concrete distributed system architecture and explain each component in detail.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DISTRIBUTED E-COMMERCE SYSTEM                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                               │
│  USERS (Global)                                                               │
│    │                                                                          │
│    ▼                                                                          │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                              CDN                                      │   │
│  │  (Cloudflare/Fastly - 200+ Points of Presence)                       │   │
│  │  • Static assets cached at edge                                       │   │
│  │  • DDoS protection                                                    │   │
│  │  • TLS termination                                                    │   │
│  └───────────────────────────────┬──────────────────────────────────────┘   │
│                                  │                                           │
│         ┌────────────────────────┼────────────────────────┐                 │
│         │                        │                        │                 │
│         ▼                        ▼                        ▼                 │
│  ┌─────────────┐          ┌─────────────┐          ┌─────────────┐         │
│  │ US-EAST     │          │ EU-WEST     │          │ AP-NORTHEAST│         │
│  │ (Virginia)  │◄────────►│ (Frankfurt) │◄────────►│ (Tokyo)     │         │
│  └──────┬──────┘          └──────┬──────┘          └──────┬──────┘         │
│         │                        │                        │                 │
│  ┌──────┴──────────────────────────────────────────────────┴──────┐         │
│  │                     PER-REGION STACK                            │         │
│  │  ┌────────────┐                                                │         │
│  │  │ Load       │ (AWS ALB / NLB)                                │         │
│  │  │ Balancer   │ • Health checks every 10s                      │         │
│  │  └─────┬──────┘ • Automatic unhealthy instance removal         │         │
│  │        │                                                        │         │
│  │  ┌─────┴──────────────────────────────────────────────┐        │         │
│  │  │              API GATEWAY                            │        │         │
│  │  │  • Rate limiting (1000 req/s per user)             │        │         │
│  │  │  • Authentication (JWT verification)                │        │         │
│  │  │  • Request routing                                  │        │         │
│  │  └─────┬──────────────────────────────────────────────┘        │         │
│  │        │                                                        │         │
│  │  ┌─────┴───┬───────────┬───────────┬───────────┐               │         │
│  │  ▼         ▼           ▼           ▼           ▼               │         │
│  │ ┌─────┐ ┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐              │         │
│  │ │User │ │Prod │    │Order│    │Pay  │    │Notif│              │         │
│  │ │Svc  │ │Svc  │    │Svc  │    │Svc  │    │Svc  │              │         │
│  │ │(x3) │ │(x5) │    │(x4) │    │(x2) │    │(x2) │              │         │
│  │ └──┬──┘ └──┬──┘    └──┬──┘    └──┬──┘    └──┬──┘              │         │
│  │    │       │          │          │          │                  │         │
│  │  ┌─┴───────┴──────────┴──────────┴──────────┴─┐                │         │
│  │  │           SERVICE MESH (Istio)             │                │         │
│  │  │  • mTLS between all services               │                │         │
│  │  │  • Retry policies                          │                │         │
│  │  │  • Circuit breakers                        │                │         │
│  │  │  • Distributed tracing                     │                │         │
│  │  └───────────────────────────────────────────┘                 │         │
│  │                                                                 │         │
│  │  ┌─────────────────────────────────────────────────────────┐   │         │
│  │  │              DATA LAYER                                  │   │         │
│  │  │  ┌────────────┐  ┌────────────┐  ┌────────────┐         │   │         │
│  │  │  │ PostgreSQL │  │   Redis    │  │Elasticsearch│        │   │         │
│  │  │  │ (Primary + │  │  (Cluster) │  │  (Cluster)  │        │   │         │
│  │  │  │  Replicas) │  │            │  │             │        │   │         │
│  │  │  └────────────┘  └────────────┘  └────────────┘         │   │         │
│  │  └─────────────────────────────────────────────────────────┘   │         │
│  │                                                                 │         │
│  │  ┌─────────────────────────────────────────────────────────┐   │         │
│  │  │              MESSAGE QUEUE (Kafka)                       │   │         │
│  │  │  • Async communication between services                  │   │         │
│  │  │  • Event sourcing                                        │   │         │
│  │  │  • Cross-region replication                              │   │         │
│  │  └─────────────────────────────────────────────────────────┘   │         │
│  └─────────────────────────────────────────────────────────────────┘         │
│                                                                               │
└─────────────────────────────────────────────────────────────────────────────┘

The CDN Layer. When a user types our URL, their request first hits the CDN a global network of servers positioned close to users worldwide. The CDN serves static assets (images, CSS, JavaScript) from cache without ever touching our origin servers. It also provides DDoS protection by absorbing malicious traffic at the edge before it can overwhelm our infrastructure. TLS termination happens here the CDN handles the expensive cryptographic handshake, so our backend servers don't need to. The CDN decides which regional datacenter to route the request to based on latency measurements. A user in Berlin goes to Frankfurt. A user in Singapore goes to Tokyo.

The Load Balancer. Within each region, the load balancer distributes incoming requests across multiple instances of our API gateway. It performs health checks every 10 seconds sending a request to a /health endpoint on each instance. If an instance fails three consecutive health checks, the load balancer stops sending it traffic. When the instance recovers, traffic resumes automatically. This happens without any manual intervention and usually without users noticing.

The API Gateway. The API gateway is the single entry point for all API requests. It handles cross-cutting concerns that apply to every request: rate limiting (preventing any single user from overwhelming the system), authentication (validating JWT tokens and extracting user identity), request routing (forwarding /users/* requests to the User Service, /products/* to the Product Service, etc.), and request logging (recording every request for debugging and audit purposes). Having these concerns in one place means individual services don't need to implement them.

The Microservices. Each service is independently deployable, scalable, and developable. The User Service (running 3 instances) handles authentication, profiles, and preferences. The Product Service (5 instances, because it handles the most read traffic) manages the product catalog and search. The Order Service (4 instances) handles shopping cart and checkout. The Payment Service (2 instances) processes payments and talks to Stripe. The Notification Service (2 instances) sends emails and push notifications asynchronously.

Each service owns its data completely. The User Service has its own PostgreSQL database that no other service can access directly. If the Order Service needs user information, it calls the User Service's API it cannot read the User Service's database directly. This isolation means services can evolve independently. The User Service could switch from PostgreSQL to MongoDB without any other service knowing or caring.

The Service Mesh. All service-to-service communication goes through the service mesh (Istio in our case). The mesh provides mutual TLS every request between services is encrypted and both sides authenticate each other using certificates. The mesh implements retry policies (automatically retry failed requests up to 3 times with exponential backoff) and circuit breakers (if a service fails 50% of requests in the last 10 seconds, stop sending it requests for 30 seconds to let it recover). The mesh also collects distributed traces, showing the complete path of a request across all services for debugging.

The Data Layer. PostgreSQL stores durable data user accounts, orders, product catalog. Each database has a primary for writes and multiple read replicas for read scaling. Redis provides caching (reducing database load by caching frequently-accessed data) and session storage (user sessions need fast access and can be recreated if lost). Elasticsearch powers product search, enabling full-text search and faceted filtering that would be slow in a relational database.

The Message Queue. Kafka enables asynchronous communication between services. When an order is placed, the Order Service publishes an order.created event to Kafka. The Notification Service subscribes to this topic and sends a confirmation email. The Analytics Service subscribes and updates dashboards. The Inventory Service subscribes and decrements stock. None of these services know about each other they just publish and subscribe to events. This decoupling means we can add new consumers without modifying the Order Service. Kafka also replicates events across regions, enabling cross-region data synchronization.

Request Flow: Happy Path. Let's trace a user loading their profile. (1) User's browser sends request to CDN. (2) CDN doesn't have this in cache (dynamic data), forwards to the nearest regional load balancer. (3) Load balancer selects a healthy API gateway instance. (4) API gateway validates the JWT, extracts user_id=12345, routes to User Service. (5) Service mesh intercepts the call, establishes mTLS connection. (6) User Service receives request, checks Redis cache miss. (7) User Service queries PostgreSQL read replica. (8) User Service caches result in Redis (TTL 5 minutes), returns response. (9) Response flows back through the chain to the user's browser. Total time: 50-100ms.

Request Flow: Failure Scenario. Same request, but the User Service is overloaded. (1-5) Same as above. (6) User Service is slow request times out after 2 seconds. (7) Service mesh retries with a different User Service instance. (8) Second instance responds successfully. (9) Response returns to user. User experienced slight delay but got their data.

Request Flow: Partition Scenario. The network between US-EAST and the global database coordinator fails. (1) User in Virginia makes a request. (2) Request routes to US-EAST. (3) Order Service tries to write to the database but the write requires global coordination that's unavailable. (4) Order Service returns a graceful error: "We're experiencing temporary issues. Your order has been saved and will be processed shortly." (5) The order is stored in a local queue. (6) When connectivity restores, the queue drains and the order processes. User's order completes, just with a delay.

⚖️ Tradeoffs

The distributed architecture provides scalability, fault tolerance, and low latency, but these benefits come with significant costs.

Operational Complexity. A monolith is one thing to deploy, monitor, and debug. Our distributed system has 5 services, each running multiple instances across 3 regions. That's dozens of processes to monitor. When a request fails, you might need to examine logs from 4 different services to understand what happened. Debugging requires distributed tracing. Deployments require coordination to avoid breaking inter-service dependencies. You need expertise in load balancers, service meshes, message queues, multiple databases, and container orchestration. This complexity requires a larger, more skilled operations team.

Cost. Running 3 instances of 5 services across 3 regions means 45 server instances minimum, plus load balancers, databases, caches, message queues, and networking. The CDN, DDoS protection, and cross-region data transfer all cost money. The operational team to manage this infrastructure costs money. For many applications, a single well-optimized server might cost $500/month while the distributed version costs $15,000/month. You must ensure the benefits justify this cost.

Consistency Challenges. In a monolith with a single database, you can use transactions to ensure consistency. Update the order and decrement inventory in a single atomic operation. In a distributed system, the Order Service and Inventory Service have separate databases. You cannot use a single transaction. If the order succeeds but the inventory update fails, you have inconsistent state. You must implement compensating transactions (saga pattern), accept eventual consistency, or use expensive distributed transaction protocols. Each approach has drawbacks.

Latency Overhead. Every network call adds latency typically 1-10ms within a datacenter. If a request requires calling 5 services sequentially, that's 5-50ms just in network overhead. This overhead doesn't exist in a monolith. You can mitigate with parallel calls, caching, and careful service boundary design, but the overhead never fully disappears.

Partial Failure Handling. Every inter-service call might fail. You must handle this everywhere: timeouts, retries, fallbacks, circuit breakers. The code to handle failure often exceeds the code that handles the happy path. Forgetting to handle failure in one place causes cascading failures that bring down the entire system. This requires discipline and extensive testing.

Aspect	Single Server	Distributed System
Scalability	Limited by hardware	Essentially unlimited
Fault tolerance	Single point of failure	Survives multiple failures
Latency (global users)	High (single location)	Low (geo-distributed)
Operational complexity	Low	High
Cost (at small scale)	Low	High
Development complexity	Low	High
Consistency	Easy (transactions)	Hard (eventual consistency)
Debugging	Easy (one process)	Hard (distributed tracing)

The key insight is that distributed systems are not universally better. They solve specific problems scale beyond a single machine, survive failures, serve global users at significant cost. If you don't have those problems, a monolith is often the better choice. Many successful companies run monoliths serving millions of users. The decision to distribute should be driven by specific requirements, not fashion.

📝 Summary

We began with a crisis: a single server architecture that worked perfectly until it didn't. Black Friday traffic overwhelmed our machine, and the resulting outage cost real money and customer trust. The root cause was fundamental: a single machine has finite resources and is a single point of failure.

The distributed systems paradigm offered a solution: instead of one powerful machine, use many ordinary machines working together. This approach provides three key benefits. Scalability: you can handle more load by adding more machines, with essentially no upper limit. Fault tolerance: when individual machines fail and they will fail the system continues operating. Latency: by placing servers close to users around the world, you can provide faster response times than physics would otherwise allow.

But distribution comes with inherent challenges that don't exist in single-machine systems. There is no global clock, so ordering events requires special techniques. Components fail independently, creating partial failures where some parts work and others don't. Communication happens through unreliable networks, meaning every message might be lost, delayed, or duplicated.

The Eight Fallacies of Distributed Computing enumerate the false assumptions that cause most distributed system bugs: the network is reliable (it isn't), latency is zero (it's not), bandwidth is infinite (it costs money), the network is secure (assume breach), topology doesn't change (it changes constantly), there is one administrator (there are many), transport cost is zero (it's significant), the network is homogeneous (it's diverse).

Building reliable distributed systems requires accepting uncertainty as fundamental. You cannot know with certainty whether a remote operation succeeded when a timeout occurs. You must design for idempotency, implement retries with backoff, use circuit breakers to prevent cascade failures, and accept that some operations will require eventual consistency rather than immediate consistency.

Key Takeaways:

Distributed systems are multiple computers appearing as one system to users
We build them for scalability, fault tolerance, and reduced latency not because they're simpler
The Eight Fallacies remind us that networks fail, latency exists, bandwidth costs money, and topology changes
Partial failures are the defining challenge: some parts fail while others work
Consensus over unreliable networks is provably impossible with certainty
Every network call must handle failure: timeouts, retries, circuit breakers
Idempotency enables safe retries
The complexity cost is real: only distribute when you have the specific problems distribution solves

❓ Questions to Think About

1. Your monolith handles 10,000 requests per second on a single $500/month server. A distributed architecture would cost $15,000/month. Under what circumstances should you migrate to the distributed architecture anyway?

This question forces you to think about the non-functional requirements that justify distribution: fault tolerance requirements (can you afford downtime?), latency requirements for global users, team scaling (can you deploy independently?), and future growth projections. Sometimes the $15,000/month is justified by risk mitigation or enablement of business capabilities that the monolith cannot provide.

2. You have a distributed system where Service A calls Service B, which calls Service C. Service C has a 10% failure rate. What is the effective failure rate from Service A's perspective, and how would you improve it?

This question explores failure propagation in distributed systems. If failures are independent, the compound failure rate is higher than any individual service. It leads to thinking about retries, circuit breakers, fallbacks, and whether synchronous chains of calls are the right architecture.

3. Your distributed database claims to be "strongly consistent." During a network partition, you observe that writes in region A are not visible in region B. The vendor says this is expected behavior. How do you reconcile these claims?

This question addresses the CAP theorem and the marketing-vs-reality gap in distributed database claims. It prompts thinking about what consistency guarantees actually mean and how they behave under partition.

4. A junior engineer proposes using wall-clock time to order events in your distributed system. What problems will this cause, and how would you explain the correct approach?

This tests understanding of clock drift and synchronization challenges in distributed systems, leading to discussion of Lamport clocks, vector clocks, or hybrid logical clocks.

5. Your service processes payments. A request times out after charging the customer's credit card but before recording the transaction in your database. How do you handle this situation?

This is the classic partial failure scenario in a high-stakes context. It tests understanding of idempotency, saga pattern, compensating transactions, and the fundamental uncertainty of distributed operations.

6. You're designing a system that must serve users on every continent with sub-100ms latency. What are the minimum number of regions you need, and what consistency model would you use for the data?

This combines geographic distribution with consistency tradeoffs. The speed of light limits force certain architectural decisions, and the consistency model affects the user experience and implementation complexity.

7. Your distributed system has a 99.9% success rate for requests. You receive 1 million requests per day. How many failures is that, and is a 99.9% success rate acceptable?

This question makes abstract percentages concrete: 99.9% means 1,000 failures per day. It prompts discussion of SLOs, user impact, and whether "99.9%" is actually good or terrible depending on context.

8. An outage occurs and you need to determine whether Service A or Service B is at fault. Your services log to their own local files. How would you redesign logging to make such investigations easier?

This leads to discussion of centralized logging, distributed tracing, correlation IDs, and the observability requirements of distributed systems.

9. You're tasked with adding a sixth service to your five-service architecture. The new service needs data from all five existing services. What architectural concerns should you raise, and what alternatives might you propose?

This question addresses service coupling, the distributed monolith anti-pattern, and when adding a new service might actually make the architecture worse.

10. A distributed system with three replicas can tolerate one replica failure. Why not just use two replicas then isn't one replica failure all you need to survive?

This tests understanding of quorum-based systems and why you often need 2f+1 replicas to tolerate f failures (to maintain majority during partition).