Network Fundamentals for Distributed Systems: The Unreliable Pipe That Connects Everything

🔥 The Problem

Our microservices architecture looked beautiful on paper. We had decomposed a monolith into 12 services, each with a clear responsibility, each deployable independently, each owned by a specific team. The architecture diagrams showed clean arrows between boxes. In production, those arrows were nightmares.

Service A called Service B with a 500ms timeout. About 0.1% of the time, the call would fail with a timeout error. That sounded acceptable 99.9% success rate. But Service A was called by the API gateway for every request, and Service A called Service B, which called Service C, which called Service D. Our user-facing request touched 4 services. If each had a 0.1% failure rate, the compound failure rate was roughly 0.4% meaning 1 in 250 user requests failed. At 10,000 requests per second, that was 40 failures per second. Our support queue overflowed.

We investigated the timeout failures. The services themselves were healthy CPU at 20%, memory at 50%, no errors in application logs. The problem was the network. Sometimes packets took 2ms to traverse our datacenter. Sometimes they took 450ms. Sometimes they never arrived at all. Our 500ms timeout seemed generous until we realized that a TCP connection establishment (the three-way handshake) plus TLS negotiation plus actual request/response could legitimately take 400ms in degraded conditions leaving only 100ms margin before timeout.

We also discovered the "thundering herd" problem. When Service B restarted for a deployment, it dropped all existing TCP connections. 200 instances of Service A simultaneously detected the disconnection and attempted to reconnect. Service B received 200 simultaneous connection requests. Each connection required memory allocation, TLS handshake computation, and connection pool management. Service B's CPU spiked to 100% handling just the connection storm, causing it to be slow responding to actual requests, causing more timeouts, causing more reconnection attempts, causing more load. A simple deployment became a cascading outage.

We learned that network behavior is not a detail to be abstracted away. Network behavior is a fundamental design constraint. Every choice we made timeout values, connection pooling strategy, retry policy, service decomposition boundaries required understanding how TCP actually works, how TLS adds latency, how HTTP/2 multiplexing changes connection patterns, and how packet loss affects throughput. Without this knowledge, we were blind architects drawing buildings on quicksand.

💡 Inspiration

The inspiration came from reading the source material that network engineers read. We studied RFC 793 (TCP specification from 1981) and understood that TCP was designed for a different era when networks were unreliable, bandwidth was scarce, and the goal was reliability above all else. TCP's congestion control algorithms assume that packet loss means network congestion an assumption that's wrong on modern wireless networks where packets are lost due to radio interference, not congestion.

We read papers from Google describing their development of QUIC (Quick UDP Internet Connections), which eventually became HTTP/3. Google's engineers observed that TCP's head-of-line blocking caused significant latency on lossy networks when a single packet was lost, the entire connection stalled waiting for retransmission, even though other data streams could have proceeded. They also observed that TCP's handshake added unnecessary round trips, especially painful for mobile users switching between cell towers where connections were frequently re-established.

We studied Netflix's network optimization work, where they described tuning Linux kernel parameters to handle millions of concurrent connections. Their engineers wrote about the surprising ways that default TCP settings designed for general-purpose computing performed poorly under streaming workloads. They documented how increasing socket buffer sizes dramatically improved throughput on high-bandwidth, high-latency paths.

We read Cloudflare's blog posts about operating a global network, where they explained phenomena like bufferbloat (where excessive buffering in network devices adds latency), the relationship between TCP window size and achievable throughput, and the techniques they use to optimize TLS handshakes to shave milliseconds off every connection.

The key insight was that the network is not a simple pipe that data flows through. The network is a complex system with its own behaviors, bottlenecks, and failure modes. Understanding these behaviors allows you to make informed architectural decisions rather than guessing and hoping.

🛠️ The Solution (Overview)

The network that connects distributed system components operates in layers, each providing different capabilities and abstractions. For distributed systems, the critical layers are the transport layer (TCP and UDP) and the application layer (HTTP, gRPC).

TCP, the Transmission Control Protocol, provides reliable, ordered delivery of data between applications. When you send bytes using TCP, those bytes will arrive at the destination in the same order, or you will be notified that delivery failed. TCP achieves this through acknowledgments (the receiver confirms receipt of data), retransmission (the sender resends data that wasn't acknowledged), and sequencing (each byte has a sequence number to maintain order). Think of TCP as a postal service with delivery confirmation you know your letter arrived, and if you sent multiple letters, they arrive in the order you sent them.

UDP, the User Datagram Protocol, provides unreliable, unordered delivery. When you send data using UDP, it might arrive, might not, might arrive out of order, might arrive duplicated. UDP is like dropping letters in mailboxes fast and cheap, but no guarantees. UDP is appropriate when speed matters more than reliability (real-time video, gaming) or when you'll implement your own reliability mechanism on top.

HTTP/2 and HTTP/3 are application-layer protocols that enable request-response communication over these transports. HTTP/2 runs over TCP and provides multiplexing multiple concurrent requests sharing a single TCP connection. HTTP/3 runs over QUIC, which is built on UDP with reliability implemented at the application layer, avoiding some of TCP's limitations.

The key insight for distributed systems is that these protocols have behaviors that significantly impact your application. TCP connection establishment adds latency. TLS adds more latency. TCP congestion control limits throughput on new connections. HTTP/2 multiplexing changes how many connections you need. Understanding these behaviors lets you make informed decisions about timeouts, connection pooling, and protocol selection.

🔍 Detailed Explanation

TCP: The Three-Way Handshake

Before any application data can flow, TCP must establish a connection. This requires a three-way handshake: three messages exchanged between client and server before data transfer begins.

Step 1: SYN. The client sends a SYN (synchronize) packet to the server. This packet contains a randomly chosen initial sequence number (ISN), let's say 1000. This sequence number will be used to track the bytes sent in this connection. The client is saying: "I want to connect. My first byte will be numbered 1001."

Step 2: SYN-ACK. The server responds with a SYN-ACK packet. This acknowledges the client's SYN (ACK=1001, meaning "I'm ready to receive byte 1001") and sends the server's own SYN with its own ISN (let's say 5000). The server is saying: "Acknowledged. I'm ready to receive your byte 1001. My first byte will be numbered 5001."

Step 3: ACK. The client responds with an ACK packet (ACK=5001, meaning "I'm ready to receive byte 5001"). The connection is now established. Data can flow.

Client                                           Server
   |                                                |
   |  ──── SYN (seq=1000) ────────────────────────► |
   |       "I want to connect, starting at 1000"    |
   |                                                |
   |  ◄──── SYN-ACK (seq=5000, ack=1001) ────────── |
   |       "OK, starting at 5000, expecting 1001"   |
   |                                                |
   |  ──── ACK (ack=5001) ────────────────────────► |
   |       "Got it, expecting 5001"                 |
   |                                                |
   |  ═══════ CONNECTION ESTABLISHED ═══════════════|

Latency Impact. The three-way handshake requires 1.5 round-trip times (RTT) before any application data can be sent. Within a datacenter, RTT might be 0.5ms, so the handshake adds 0.75ms negligible. Between continents, RTT might be 150ms, so the handshake adds 225ms significant for user-facing requests.

Connection Reuse. Because of this handshake cost, reusing connections is critical. Connection pooling keeps established TCP connections open and reuses them for multiple requests. Instead of opening a new connection for each request (paying the handshake cost each time), you maintain a pool of open connections and borrow from the pool when needed. This is why HTTP/1.1 introduced "keep-alive" connections, and HTTP/2 takes this further with multiplexing.

TCP Fast Open. TCP Fast Open (TFO) is an optimization that allows data to be sent in the SYN packet on subsequent connections to the same server. After the first connection, the server gives the client a cookie. On future connections, the client includes this cookie in the SYN packet along with application data. The server can process this data immediately without waiting for the full handshake. This reduces effective latency from 1.5 RTT to 1 RTT for repeat connections. However, TFO has security considerations and isn't universally deployed.

TCP Flow Control: Preventing Receiver Overwhelm

TCP's flow control prevents a fast sender from overwhelming a slow receiver. The receiver advertises a "receive window" the amount of data it's willing to accept before the sender must pause and wait for acknowledgment.

Sender's View of Data:
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │10 │11 │12 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
│ACKed│    │───Sent, Awaiting ACK───│   Not Sent   │
      │                              │
      └──────── Window Size ─────────┘

The sender can send bytes 3-8 (the window).
It must wait for ACKs before sending 9-12.

How It Works. Every TCP packet from the receiver includes the current window size. If the receiver's buffer is filling up (it's processing data slowly), it advertises a smaller window. If the receiver is keeping up, it advertises a larger window. When the window reaches zero, the sender stops sending and periodically probes the receiver to check if space has freed up.

Practical Impact. On a high-latency link, the window size limits throughput. Suppose the window size is 64KB (the default on many systems) and the round-trip time is 100ms. The sender can have at most 64KB "in flight" at any time. The maximum throughput is 64KB per 100ms = 640KB/s = 5.12 Mbps. Even on a gigabit link, you're limited to 5 Mbps by the window size.

Window Scaling. To address this, TCP window scaling was introduced. With window scaling enabled, the window size can be up to 1GB instead of 64KB. The bandwidth-delay product (BDP) calculation shows the required window: a 10 Gbps link with 100ms RTT requires a 125MB window to fully utilize the link. Modern systems enable window scaling by default, but older systems or misconfigured networks might not, leading to mysterious throughput limitations.

TCP Congestion Control: Preventing Network Overwhelm

While flow control prevents overwhelming the receiver, congestion control prevents overwhelming the network itself. TCP interprets packet loss as a signal of network congestion and reduces its sending rate in response.

Slow Start. When a TCP connection begins, it doesn't know how much bandwidth is available. It starts conservatively: the sender can send only 1 segment (typically 1460 bytes) before waiting for an acknowledgment. After receiving an ACK, it doubles its sending rate 2 segments, then 4, then 8, then 16. This exponential growth continues until either packet loss occurs or the receiver's window limit is reached.

Congestion Window Growth:

cwnd|
    |                           ┌─────────────
    |                      /\   │ Congestion
    |                 /\  /  \  │ Avoidance
    |            /\  /  \/    \ │ (linear growth)
    |       /\  /  \/          \│
    |  /\  /  \/                 \
    | /  \/                       \
    |/                             \ Packet Loss!
    |  Slow Start                   cwnd halved
    |  (exponential)
    └────────────────────────────────────────► time

Congestion Avoidance. Once the congestion window reaches a threshold (typically half the window size at the last packet loss), growth switches from exponential to linear adding 1 segment per RTT instead of doubling. This conservative approach probes for additional bandwidth slowly.

Packet Loss Response. When packet loss is detected, TCP interprets it as congestion and cuts the congestion window. Classic algorithms like "Reno" halve the window. More modern algorithms like "CUBIC" (Linux default) use a cubic function to recover bandwidth more quickly after a loss.

Practical Impact: The Slow Start Penalty. A new TCP connection starts with a small congestion window (typically 10 segments = ~14KB on modern systems). On a 100ms RTT connection, slow start proceeds: send 10 segments, wait 100ms for ACK, send 20 segments, wait 100ms, send 40 segments, and so on. To transfer 1MB of data, you need about 7 round trips for slow start alone (~700ms) before the connection is fully ramped up.

This is why connection reuse is so important. An established connection maintains its congestion window between requests (if using HTTP keep-alive or HTTP/2), avoiding the slow start penalty.

BBR: A Different Approach. Google's BBR (Bottleneck Bandwidth and Round-trip propagation time) takes a fundamentally different approach. Instead of using packet loss as a signal, BBR tries to estimate the actual bottleneck bandwidth and RTT of the path, then sends at that rate. BBR performs significantly better on networks with random packet loss (like wireless) where loss doesn't indicate congestion. However, BBR can be more aggressive than traditional algorithms and may impact fairness with other traffic.

TCP TIME_WAIT: The Ghost Connections

When a TCP connection closes, the socket doesn't immediately become available for reuse. It enters the TIME_WAIT state for 2 × MSL (Maximum Segment Lifetime), typically 60-120 seconds.

Why TIME_WAIT Exists. Imagine Connection A between client:50000 and server:80 closes. Immediately, a new Connection B opens using the same ports. A delayed packet from Connection A arrives. Without TIME_WAIT, this old packet might be interpreted as belonging to Connection B, corrupting the data stream. TIME_WAIT ensures that all packets from the old connection have expired before the port can be reused.

The Problem at Scale. If you're making 10,000 connections per second and each socket spends 60 seconds in TIME_WAIT, you'll have 600,000 sockets in TIME_WAIT at any time. Each socket consumes kernel memory (~4KB on Linux). That's 2.4GB of kernel memory just for TIME_WAIT sockets. Worse, if connections are to a single destination, you might exhaust the 65,535 available ephemeral ports.

TIME_WAIT Accumulation at Scale:

Connection Rate: 10,000/second
TIME_WAIT Duration: 60 seconds
Steady-State TIME_WAIT Count: 600,000 sockets
Memory Usage: ~2.4 GB
Port Exhaustion Risk: HIGH (only 65,535 ephemeral ports)

Solutions.

Connection Pooling: Reuse connections instead of opening new ones. A pool of 100 connections, each handling hundreds of requests, creates far fewer TIME_WAIT sockets than opening a new connection per request.
Server-Side Close: TIME_WAIT happens on the side that initiates the close. If the server closes connections, TIME_WAIT accumulates on servers (which you control and can tune) rather than clients.
TCP_REUSE: Linux's tcp_tw_reuse option allows reusing TIME_WAIT sockets for outgoing connections when safe to do so (when the new connection's timestamp is greater than the last packet from the old connection).
Shorter TIME_WAIT: Reducing tcp_fin_timeout (not exactly TIME_WAIT, but related) can help, though this risks packet confusion.

HTTP/1.1: The Head-of-Line Blocking Problem

HTTP/1.1 uses TCP in a request-response pattern: send a request, wait for the complete response, then send the next request. If one request is slow, subsequent requests must wait.

HTTP/1.1 on a Single Connection:

Time ──────────────────────────────────────────────────►

  Request 1: ────────────────────────────────►
  Response 1:                                  ────────────► (slow)
  Request 2:                                                  BLOCKED
  Response 2:                                                 BLOCKED
  Request 3:                                                  BLOCKED

Request 2 cannot start until Response 1 completes!

Workaround: Multiple Connections. Browsers work around this by opening 6 parallel connections per domain. Each connection can have its own pending request, providing parallel execution. But this costs 6× the TCP handshakes, 6× the TLS handshakes, 6× the memory for socket buffers, and 6× the slow start penalties.

HTTP Pipelining: The Failed Solution. HTTP/1.1 included "pipelining" sending multiple requests without waiting for responses. However, responses had to arrive in order. If Response 1 was slow, Responses 2 and 3 still had to wait (head-of-line blocking at the application layer instead of the request layer). Pipelining caused so many problems with intermediaries that browsers disabled it.

HTTP/2: Multiplexing on a Single Connection

HTTP/2 solved HTTP/1.1's head-of-line blocking by introducing streams independent, bidirectional sequences of frames, all multiplexed over a single TCP connection.

HTTP/2 Multiplexing:

Single TCP Connection:
┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  Stream 1: [Header Frame][Data Frame][Data Frame][End Stream]  │
│  Stream 3: [Header Frame][Data Frame]                          │
│  Stream 5: [Header Frame][Data Frame][Data Frame]              │
│  Stream 3: [Data Frame][End Stream]                            │
│  Stream 5: [Data Frame][End Stream]                            │
│                                                                │
│  Frames interleaved, all on one TCP connection                 │
│  Streams processed independently                                │
└────────────────────────────────────────────────────────────────┘

How It Works. Each HTTP request gets its own stream ID (odd numbers for client-initiated, even for server-initiated). Frames are tagged with their stream ID and can be interleaved arbitrarily. The receiver reassembles frames into streams. A slow response on Stream 1 doesn't block frames arriving for Stream 3.

Additional Features:

Header Compression (HPACK): HTTP headers are compressed using a dictionary shared between client and server. Common headers like
Content-Type: application/json
are replaced with small integers.
Stream Prioritization: Clients can indicate priority of streams, allowing servers to send critical resources first.
Server Push: Servers can proactively send resources the client will need (though this feature is rarely used in practice).

The Remaining Problem: TCP Head-of-Line Blocking. HTTP/2 solves application-layer head-of-line blocking but introduces a new problem. All streams share one TCP connection. If a single TCP packet is lost, TCP stalls the entire connection waiting for retransmission even though only one stream was affected. On a lossy network (like mobile), this can make HTTP/2 slower than HTTP/1.1 with multiple connections.

HTTP/3 and QUIC: UDP with Application-Layer Reliability

QUIC (Quick UDP Internet Connections) is Google's solution to TCP's limitations. It's now standardized as the transport for HTTP/3.

QUIC's Key Innovation: Stream Independence. QUIC runs over UDP and implements reliability at the application layer. Crucially, each stream has independent reliability. If a packet for Stream 1 is lost, only Stream 1 stalls Stream 3 and Stream 5 continue receiving data.

HTTP/3 over QUIC: Stream Independence

UDP Packets:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Stream 1, Frame │ │ Stream 3, Frame │ │ Stream 1, Frame │  ← Lost!
└─────────────────┘ └─────────────────┘ └─────────────────┘

Result:
• Stream 1: STALLED waiting for retransmission
• Stream 3: CONTINUES processing
• Stream 5: CONTINUES processing

vs HTTP/2 over TCP where ALL streams would stall.

Faster Connection Establishment. TCP requires a handshake (1.5 RTT) before data. TLS requires another handshake (1-2 RTT). QUIC combines transport and crypto handshakes: 1 RTT for a new connection, 0 RTT for resumed connections (sending data immediately with the first packet).

Connection Establishment Comparison:

TCP + TLS 1.3:                 QUIC:
┌──────────────────┐          ┌──────────────────┐
│ TCP SYN          │          │ QUIC Initial     │
│ TCP SYN-ACK      │ 1 RTT    │ (+ TLS + Data)   │
│ TCP ACK          │          │                  │
│ TLS ClientHello  │          │ QUIC Handshake   │ 1 RTT (0 for resumed)
│ TLS ServerHello  │ 1 RTT    │ (+ Data)         │
│ TLS Finished     │          └──────────────────┘
│ HTTP Request     │
└──────────────────┘
Total: 2-3 RTT                 Total: 1 RTT (0 resumed)

Connection Migration. TCP connections are identified by the 4-tuple: source IP, source port, destination IP, destination port. If your phone switches from WiFi to cellular, your IP changes, and all TCP connections break. QUIC connections are identified by a connection ID. When your IP changes, the connection continues QUIC recognizes it's still you.

When to Use HTTP/3:

High-latency networks where connection setup cost matters
Lossy networks where TCP head-of-line blocking hurts
Mobile clients that frequently switch networks
When reduced connection establishment latency impacts user experience

gRPC: High-Performance Service-to-Service Communication

gRPC is a remote procedure call framework that combines HTTP/2 with Protocol Buffers (a binary serialization format) and code generation.

The Stack:

┌────────────────────────────────────┐
│ Application Code                   │
│ (Your business logic)              │
├────────────────────────────────────┤
│ Generated Stubs                    │
│ (Type-safe client/server methods)  │
├────────────────────────────────────┤
│ gRPC Framework                     │
│ (Framing, deadline propagation)    │
├────────────────────────────────────┤
│ Protocol Buffers                   │
│ (Binary serialization)             │
├────────────────────────────────────┤
│ HTTP/2                             │
│ (Multiplexing, flow control)       │
├────────────────────────────────────┤
│ TLS                                │
├────────────────────────────────────┤
│ TCP                                │
└────────────────────────────────────┘

Why gRPC for Microservices:

Binary Protocol: Protocol Buffers are 3-10× smaller than JSON and faster to parse. When services exchange millions of messages per second, this matters.
Strong Typing: You define services in .proto files and generate client/server code. Compile-time checking catches mismatches that JSON wouldn't catch until runtime.
Streaming: gRPC natively supports four patterns:
- Unary: Request → Response (like REST)
- Server streaming: Request → Stream of Responses
- Client streaming: Stream of Requests → Response
- Bidirectional streaming: Stream ↔ Stream
Deadline Propagation: When Service A calls Service B with a 5-second deadline, and Service B calls Service C, the remaining deadline propagates. If 4 seconds have already elapsed, Service C knows it only has 1 second.
Built-in Load Balancing: gRPC clients can perform client-side load balancing across backend instances.

gRPC Example:

protobuf
// user.proto
syntax = "proto3";
package user;

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListUsers(ListUsersRequest) returns (stream User);
}

message GetUserRequest {
  string user_id = 1;
}

message User {
  string user_id = 1;
  string name = 2;
  string email = 3;
}

message ListUsersRequest {
  int32 page_size = 1;
}

Run

protoc --go_out=. --go-grpc_out=. user.proto

and you get generated Go code with type-safe client stubs and server interfaces.

gRPC vs REST Tradeoffs:

Aspect	gRPC	REST
Payload size	Smaller (Protobuf)	Larger (JSON)
Parse speed	Faster	Slower
Browser support	Requires grpc-web proxy	Native
Debuggability	Harder (binary)	Easier (text)
Tooling	Specialized	Abundant
API evolution	Schema-driven	Ad-hoc

Use gRPC for internal services where performance matters and teams can share proto definitions. Use REST for public APIs where broad client support and debuggability matter.

Network Partitions: When the Network Splits

A network partition occurs when network failure divides nodes into groups that cannot communicate with each other.

Before Partition:
┌─────────────────────────────────────┐
│  A ◄───► B ◄───► C ◄───► D ◄───► E  │
│  All nodes can reach all others     │
└─────────────────────────────────────┘

During Partition:
┌─────────────────────────────────────┐
│  A ◄───► B ◄───► C    ║    D ◄───► E│
│                       ║             │
│  Partition 1: A,B,C   ║  Part 2: D,E│
│  Cannot reach D,E     ║  Cannot reach│
│                       ║  A,B,C      │
└─────────────────────────────────────┘

Types of Partitions:

Complete: No communication between partitions
Partial: Some paths work, others don't
Asymmetric: A can reach B, but B cannot reach A

Real-World Causes:

Switch or router failure
Misconfigured firewall rules
Network congestion causing massive packet loss
Submarine cable damage (2008 Mediterranean cable cut affected millions)
Cloud provider issues (AWS us-east-1 has had multiple network partitions)

Handling Strategies:

Strategy 1: Majority Quorum (CP Systems) Only the partition containing a majority of nodes continues accepting writes. Minority partitions become read-only or unavailable.

5-node cluster during partition:
[A B C] | [D E]
  3/5   |  2/5

A,B,C have majority → continue accepting writes
D,E have minority → reject writes, may serve stale reads

Used by: ZooKeeper, etcd, Consul

Strategy 2: Accept All Writes (AP Systems) Both partitions accept writes. When the partition heals, conflicts are resolved through merge strategies (last-write-wins, vector clocks, CRDTs).

[A B C] | [D E]
  ✓     |   ✓
writes  | writes

When healed: merge changes, resolve conflicts

Used by: Cassandra, DynamoDB, CouchDB

Strategy 3: Designated Primary Pre-assign one region as primary. During partition, only primary accepts writes; others become read-only.

[Primary A B] | [Secondary C D]
     ✓        |       ✗
   writes     |   read-only

Simple but primary is single point of write availability

Used by: MySQL replication, Redis Sentinel

DNS: Service Discovery Foundation

DNS (Domain Name System) translates human-readable names to IP addresses. In distributed systems, DNS is often the foundation of service discovery.

DNS Resolution for api.example.com:

Browser                    Resolver                    Root DNS         TLD DNS        Auth DNS
   │                          │                           │                │               │
   │  "api.example.com?"      │                           │                │               │
   │ ────────────────────────►│                           │                │               │
   │                          │  "Where's .com?"          │                │               │
   │                          │ ─────────────────────────►│                │               │
   │                          │  "Ask a.gtld-servers.net" │                │               │
   │                          │ ◄─────────────────────────│                │               │
   │                          │                           │                │               │
   │                          │  "Where's example.com?"   │                │               │
   │                          │ ──────────────────────────────────────────►│               │
   │                          │  "Ask ns1.example.com"    │                │               │
   │                          │ ◄──────────────────────────────────────────│               │
   │                          │                           │                │               │
   │                          │  "Where's api.example.com?"                │               │
   │                          │ ─────────────────────────────────────────────────────────►│
   │                          │  "192.0.2.1"              │                │               │
   │                          │ ◄─────────────────────────────────────────────────────────│
   │  "192.0.2.1"             │                           │                │               │
   │ ◄────────────────────────│                           │                │               │

DNS for Load Balancing: Return multiple A records; clients pick one:

api.example.com → 192.0.2.1, 192.0.2.2, 192.0.2.3

DNS for Failover: Health-checked DNS returns only healthy endpoints.

DNS for Geo-Routing: Return different IPs based on client location:

Client in Europe: api.example.com → 203.0.113.1 (Frankfurt)
Client in Asia:   api.example.com → 198.51.100.1 (Tokyo)

DNS Caveats:

TTL Compliance: Clients are supposed to honor the Time-To-Live, but many don't. Java's JVM caches DNS forever by default unless configured otherwise. Browsers have their own caching logic.
Propagation Delay: DNS changes take time to propagate. Even with a 60-second TTL, some clients won't see changes for minutes or hours.
Not for Rapid Failover: DNS isn't designed for sub-second failover. Use load balancers for that.

TLS and mTLS: Securing Communication

TLS (Transport Layer Security) encrypts communication and authenticates servers. mTLS (mutual TLS) adds client authentication.

TLS 1.3 Handshake:

Client                                        Server
   │                                             │
   │  ClientHello + KeyShare                     │
   │  (supported ciphers, key material)          │
   │ ───────────────────────────────────────────►│
   │                                             │
   │  ServerHello + KeyShare                     │
   │  Certificate (server's identity)            │
   │  CertificateVerify (signature)              │
   │  Finished                                   │
   │ ◄───────────────────────────────────────────│
   │                                             │
   │  Finished                                   │
   │ ───────────────────────────────────────────►│
   │                                             │
   │  ════════ Encrypted Communication ══════════│

TLS 1.3 improvements over TLS 1.2:

1-RTT handshake (vs 2-RTT)
0-RTT resumption for repeat connections
Removed weak ciphers
Forward secrecy required

mTLS for Service-to-Service:

In mTLS, both sides present certificates:

Service A                                     Service B
   │                                             │
   │  ClientHello                                │
   │ ───────────────────────────────────────────►│
   │                                             │
   │  ServerHello + ServerCertificate            │
   │  CertificateRequest                         │
   │ ◄───────────────────────────────────────────│
   │                                             │
   │  ClientCertificate                          │
   │  (A proves its identity to B)               │
   │ ───────────────────────────────────────────►│
   │                                             │
   │  ════════ Both sides authenticated ══════════│

Why mTLS for Microservices:

Zero Trust: Don't assume internal network is safe
Identity: Services prove who they are, not just what IP they have
Encryption: Even internal traffic is encrypted
Authorization: Based on verified service identity, not IP

Service meshes like Istio and Linkerd automate mTLS certificate management, issuing short-lived certificates and rotating them automatically.

Linux Network Tuning

Default Linux network settings are designed for general-purpose computing. High-performance distributed systems often need tuning.

Connection Handling:

bash
# Maximum connections waiting to be accepted
net.core.somaxconn = 65535

# SYN backlog (incomplete connections during handshake)
net.ipv4.tcp_max_syn_backlog = 65535

# Enable SYN cookies (DDoS protection)
net.ipv4.tcp_syncookies = 1

Buffer Sizes (for high-bandwidth links):

bash
# Maximum socket receive/send buffer
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# Default socket receive/send buffer
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576

# TCP-specific buffers (min, default, max)
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 1048576 16777216

TIME_WAIT Handling:

bash
# Reuse TIME_WAIT sockets for new connections
net.ipv4.tcp_tw_reuse = 1

# Maximum TIME_WAIT sockets (prevent memory exhaustion)
net.ipv4.tcp_max_tw_buckets = 1440000

Keepalive (detecting dead connections):

bash
# Start keepalive probes after 600s idle
net.ipv4.tcp_keepalive_time = 600

# Send probes every 60s
net.ipv4.tcp_keepalive_intvl = 60

# Drop connection after 3 failed probes
net.ipv4.tcp_keepalive_probes = 3

🏗️ Architecture

Let me show how these network concepts come together in a real microservices architecture:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    NETWORK-AWARE MICROSERVICES ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                               │
│  INTERNET                                                                     │
│      │                                                                        │
│      ▼                                                                        │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  EDGE LAYER                                                             │ │
│  │  • CDN (Cloudflare): TLS termination, DDoS protection, caching         │ │
│  │  • Protocol: HTTP/3 (QUIC) for browsers, HTTP/2 fallback               │ │
│  │  • Global anycast: user → nearest PoP                                  │ │
│  └───────────────────────────────┬────────────────────────────────────────┘ │
│                                  │ HTTPS (TLS 1.3)                          │
│                                  ▼                                           │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  LOAD BALANCER (AWS NLB/ALB)                                            │ │
│  │  • Connection draining on deploy                                        │ │
│  │  • Health checks every 10s                                              │ │
│  │  • Least-connections routing                                            │ │
│  │  • Cross-zone load balancing                                            │ │
│  └───────────────────────────────┬────────────────────────────────────────┘ │
│                                  │                                           │
│                                  ▼                                           │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  API GATEWAY (Envoy / Kong)                                             │ │
│  │  • Protocol: HTTP/2 (connection pooling to backends)                    │ │
│  │  • Connection pool: 100 connections per backend                         │ │
│  │  • Timeout: 30s gateway timeout                                         │ │
│  │  • Rate limiting: 1000 req/s per API key                               │ │
│  └───────────────────────────────┬────────────────────────────────────────┘ │
│                                  │                                           │
│      ┌───────────────────────────┼───────────────────────────┐              │
│      │                           │                           │              │
│      ▼                           ▼                           ▼              │
│  ┌──────────┐             ┌──────────┐              ┌──────────┐           │
│  │ User Svc │◄──gRPC─────►│Order Svc │◄───gRPC────►│Payment   │           │
│  │          │             │          │              │ Svc      │           │
│  │ Pool: 50 │             │ Pool: 50 │              │ Pool: 20 │           │
│  │ Timeout: │             │ Timeout: │              │ Timeout: │           │
│  │  5s      │             │  10s     │              │  30s     │           │
│  └────┬─────┘             └────┬─────┘              └────┬─────┘           │
│       │                        │                         │                  │
│       │ TCP connection pool    │                         │                  │
│       │ to database            │                         │                  │
│       ▼                        ▼                         ▼                  │
│  ┌──────────┐             ┌──────────┐              ┌──────────┐           │
│  │PostgreSQL│             │PostgreSQL│              │  Stripe  │           │
│  │ Primary  │             │ Primary  │              │   API    │           │
│  │ + 2      │             │ + 2      │              │          │           │
│  │ Replicas │             │ Replicas │              │(External)│           │
│  └──────────┘             └──────────┘              └──────────┘           │
│                                                                               │
│  INTERNAL NETWORK DETAILS:                                                    │
│  • All internal: mTLS (mutual TLS)                                           │
│  • Service discovery: Kubernetes DNS (service.namespace.svc.cluster.local)   │
│  • Protocol: gRPC (HTTP/2 + Protobuf)                                        │
│  • Circuit breaker: 50% errors in 10s → open for 30s                        │
│  • Retry policy: 3 retries, exponential backoff, 100ms-1s                   │
│                                                                               │
└─────────────────────────────────────────────────────────────────────────────┘

Connection Flow: User loads order history.

Browser → CDN (HTTP/3 QUIC): User's browser connects to nearest CDN PoP using HTTP/3. If browser doesn't support HTTP/3, falls back to HTTP/2. QUIC's 0-RTT resumption means repeat visits establish connection instantly.
CDN → Load Balancer (HTTPS): CDN forwards to origin over persistent TLS 1.3 connection. CDN maintains a pool of connections to avoid handshake latency for each request.
Load Balancer → API Gateway (HTTP/2): Load balancer picks an API Gateway instance based on least connections. Connection is reused across requests.
API Gateway → Order Service (gRPC): Gateway maintains a pool of 100 gRPC connections to Order Service. Request uses existing connection (no handshake). Timeout set to 10 seconds.
Order Service → User Service (gRPC): Order Service needs user details to enrich the order data. Calls User Service over gRPC. Pool of 50 connections. Timeout 5 seconds. If User Service is slow, Order Service might proceed with partial data.
Services → Databases (TCP): Each service maintains a connection pool to its database. Pool size 20-50 connections. Connections are persistent and reused.

Failure Handling:

User Service down: Circuit breaker opens after 50% failures in 10s. Order Service returns order data without user enrichment. Returns partial result with warning.
Database partition: If primary is unreachable, connection fails immediately (no long timeout). Service returns error to caller. Caller can retry to a different instance.
Network congestion: TCP congestion control kicks in, reducing throughput. Requests take longer. Eventually, requests timeout. Client gets clear timeout error.

⚖️ Tradeoffs

Every network optimization involves tradeoffs.

Connection Pooling: Complexity vs Efficiency

Connection pooling eliminates handshake latency but requires careful management. Pools can leak connections if errors aren't handled properly. Pools can exhaust if concurrency exceeds pool size, causing requests to wait. Pools must handle unhealthy connections (what if a connection is open but the server behind it has crashed?). The pool size must be tuned too small and requests queue, too large and you waste resources.

HTTP/2 Multiplexing: Simplicity vs Head-of-Line Blocking Risk

HTTP/2's single connection is simpler to manage than HTTP/1.1's multiple connections. But on lossy networks, packet loss stalls all streams. For high-reliability needs on unreliable networks, you might want multiple HTTP/2 connections or HTTP/3.

gRPC: Performance vs Debuggability

gRPC's binary protocol is efficient but hard to inspect. You can't curl a gRPC endpoint. You need special tools to decode Protobuf. For public APIs where developers need to debug easily, REST is often better despite the performance cost.

Aggressive Timeouts: Fast Failure vs False Positives

Short timeouts (100ms) prevent slow services from blocking callers. But if legitimate requests occasionally take 150ms, you'll have spurious failures. You must measure actual latency distributions before setting timeouts.

TLS Everywhere: Security vs Latency

TLS adds handshake latency (1-2 RTT) and CPU cost (encryption). For internal services, some argue plain TCP is fine since the network is "trusted." But network trust is increasingly considered outdated. The security benefit usually outweighs the cost, especially with TLS 1.3's reduced overhead.

Choice	Benefit	Cost
Connection pooling	No handshake latency	Pool management complexity
HTTP/2	Multiplexing, fewer connections	TCP head-of-line blocking
HTTP/3 (QUIC)	Stream independence, faster setup	Less mature, UDP blocking
gRPC	Performance, streaming	Browser incompatibility, debugging
mTLS	Identity, encryption	Certificate management
Short timeouts	Fast failure detection	False positive failures
Large buffers	High throughput on fat pipes	Memory usage

📝 Summary

We started with a microservices architecture that looked good on paper but failed in production because we didn't understand network behavior. Every arrow between boxes on our architecture diagram represented complex network interactions: TCP handshakes, TLS negotiations, congestion control, potential packet loss, and latency variability. By treating the network as a simple pipe, we were blind to the failure modes that brought down our system.

The network stack has layers that matter for distributed systems. TCP provides reliability through acknowledgments and retransmissions but adds latency through the three-way handshake and slow start. TCP's flow control and congestion control adapt to receiver and network capacity but can limit throughput on high-latency links. TIME_WAIT state prevents packet confusion but can exhaust ports at high connection rates.

HTTP evolved to address real problems. HTTP/1.1's head-of-line blocking forced browsers to open multiple connections. HTTP/2 multiplexes streams on a single connection but inherits TCP's head-of-line blocking. HTTP/3 runs on QUIC over UDP, providing stream independence and faster connection establishment.

gRPC combines HTTP/2 with Protocol Buffers for efficient, type-safe service-to-service communication. It excels for internal services but isn't suitable for browser-facing APIs.

Network partitions are inevitable at scale. Systems must choose how to handle them: majority quorum sacrifices availability for consistency, accept-all-writes sacrifices consistency for availability, designated primary keeps it simple at the cost of single-region write availability.

TLS secures communication and should be used everywhere. mTLS adds client authentication, enabling identity-based security for service-to-service calls. Service meshes automate mTLS deployment.

Linux network tuning optimizes for high-performance workloads: larger socket buffers for high-bandwidth links, connection handling parameters for high connection rates, TIME_WAIT settings for high churn.

Key Takeaways:

TCP's three-way handshake adds 1.5 RTT latency connection pooling eliminates this for persistent connections
TCP congestion control limits throughput on new connections connection reuse maintains ramp-up state
HTTP/2 multiplexes streams but TCP packet loss stalls everything HTTP/3 provides stream independence
gRPC (HTTP/2 + Protobuf) is ideal for internal services but not for browser clients
Network partitions require explicit handling: choose CP (quorum), AP (accept writes), or primary-based
TLS everywhere; mTLS for service-to-service authentication
Tune Linux network parameters for your workload: buffers, backlogs, TIME_WAIT
Measure actual network behavior; don't assume based on theory

❓ Questions to Think About

1. Your gRPC service has a 100-connection pool to a backend. You observe that during traffic spikes, requests wait for available connections. Increasing the pool size doesn't help. What might be happening, and how would you investigate?

This explores connection pool saturation, but the twist that increasing pool size doesn't help suggests the bottleneck might be elsewhere perhaps the backend can't handle more concurrent requests, or there's a network limit. Investigating requires understanding whether the pool is actually exhausted or if requests are slow for other reasons.

2. Your HTTP/2 service shows high latency variance p50 is 20ms but p99 is 500ms. The backend service has consistent 15ms response time. What network factors could explain the variance?

This prompts thinking about packet loss causing retransmission delays, TCP slow start on fresh connections, TLS handshake on new connections, or kernel scheduling delays. Understanding where variance comes from is crucial for optimization.

3. You're choosing between HTTP/2 and HTTP/3 for a mobile app with global users. Users frequently switch between WiFi and cellular. What factors would influence your decision?

This tests understanding of QUIC's connection migration capability (survives network changes), HTTP/2's TCP connection breakage on IP change, and the relative maturity of HTTP/3 support. Real-world factors like firewall UDP blocking also matter.

4. Your service-to-service call has a 5-second timeout. During an incident, you see that calls are taking 4.9 seconds before returning success. Is this a problem, and how might it escalate?

This explores timeout margins and cascading delays. If every call in a chain uses nearly its full timeout, the overall request time becomes the sum of timeouts. A slowdown in one service can cause the entire call chain to timeout.

5. After deploying a new service version, your database connection pool reports many broken connections. What happened, and how would you prevent this in the future?

This is about connection draining during deployments. When you restart a service, in-flight requests are aborted, and connection pools detect broken connections. Solutions include graceful shutdown, connection validation, and load balancer connection draining.

6. You're designing a system where Region A must remain operational if Region B goes down, and vice versa. Your current design has a single globally-consistent database. What changes are needed?

This requires reasoning about network partitions between regions. A single consistent database can't be writable in both regions during partition. The solution requires either accepting eventual consistency or pre-designating one region as primary for writes.

7. Your team proposes increasing TCP socket buffer sizes from 64KB to 16MB to improve throughput. What could go wrong, and what would you measure before making this change?

This tests understanding of memory usage tradeoffs (16MB × thousands of connections = significant memory), whether the bottleneck is actually buffer-related (only matters for high-BDP paths), and the importance of measuring before optimizing.

8. A service uses DNS for discovery with a 60-second TTL. After a failover, it takes 5 minutes for traffic to fully shift. What's happening?

This explores DNS caching beyond TTL (JVM caches forever by default, some applications ignore TTL), recursive resolver behavior, and why DNS isn't suitable for rapid failover. Solutions include shorter TTLs, health-checked DNS, or switching to service mesh discovery.

9. Your mTLS deployment uses certificates with 1-year validity. A security audit recommends 24-hour certificates. What operational challenges does this create?

Short-lived certificates require automated renewal. If renewal fails, services lose identity and can't communicate. This requires robust certificate infrastructure, monitoring, and graceful handling of renewal failures. The tradeoff is between blast radius of a compromised certificate and operational complexity.

10. You observe that your service has 100,000 sockets in TIME_WAIT state, using 400MB of kernel memory. Your team proposes setting tcp_tw_recycle=1 to fix this. Why is this dangerous, and what alternatives would you suggest?

tcp_tw_recycle is dangerous because it uses timestamps for TIME_WAIT recycling, which breaks for clients behind NAT (which share source IPs but have different timestamps). It was removed from Linux 4.12. Better solutions: connection pooling, server-side close, tcp_tw_reuse (safer), or more source IPs.