Module 3: Serialization and Data Exchange

Why Serialization Matters in Distributed Systems

In distributed systems, data must travel between processes, machines, and data centers. Serialization converts in-memory data structures into a format that can be transmitted and reconstructed on the receiving end.
┌─────────────────────────────────────────────────────────────────┐ │ SERIALIZATION PIPELINE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Service A Network Service B │ │ ┌─────────┐ ┌─────────┐│ │ │ Object │ │ Object ││ │ │ in │──► Serialize ──► [Bytes] ──► Deserialize ──►│ in ││ │ │ Memory │ │ │ Memory ││ │ └─────────┘ │ └─────────┘│ │ │ │ │ ┌──────┴──────┐ │ │ │ Wire Format │ │ │ │ (JSON/Proto │ │ │ │ /Avro/etc) │ │ │ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

The Cost of Serialization

Serialization isn't free. In high-throughput systems, it can consume:
  • 10-30% of CPU cycles in microservices
  • Significant memory for intermediate buffers
  • Latency especially for large payloads
┌─────────────────────────────────────────────────────────────────┐ │ SERIALIZATION OVERHEAD IN REAL SYSTEMS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ LinkedIn (2015): 15-20% of CPU spent on JSON serialization │ │ Twitter: Moved to Thrift, saved 10-15% CPU │ │ Uber: Protocol Buffers reduced payload size by 60% │ │ Netflix: Custom serialization for specific hot paths │ │ │ └─────────────────────────────────────────────────────────────────┘

1. JSON: The Universal Format

Why JSON Dominates

JSON (JavaScript Object Notation) became the de facto standard for web APIs because:
  • Human-readable
  • Language-agnostic
  • No schema required
  • Native JavaScript support
  • Excellent tooling
json
{ "user_id": 12345, "name": "Alice", "email": "alice@example.com", "created_at": "2024-01-15T10:30:00Z", "roles": ["admin", "developer"], "settings": { "theme": "dark", "notifications": true } }

JSON's Hidden Costs

┌─────────────────────────────────────────────────────────────────┐ │ JSON OVERHEAD ANALYSIS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Raw Data: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ user_id: 12345 (4 bytes as int32) │ │ │ │ name: "Alice" (5 bytes + length prefix) │ │ │ │ active: true (1 bit logically, 1 byte typically) │ │ │ └───────────────────────────────────────────────────────┘ │ │ Optimal binary: ~15 bytes │ │ │ │ JSON Representation: │ │ ┌───────────────────────────────────────────────────────┐ │ │ │ {"user_id":12345,"name":"Alice","active":true} │ │ │ └───────────────────────────────────────────────────────┘ │ │ JSON size: 47 bytes │ │ │ │ OVERHEAD: 3.1x larger than optimal binary! │ │ │ │ Sources of overhead: │ │ • Field names repeated in every message │ │ • Numbers encoded as ASCII strings │ │ • Quotes, colons, commas, braces │ │ • No native binary support (must base64 encode) │ │ │ └─────────────────────────────────────────────────────────────────┘

JSON Parsing Performance

go
// Benchmark: JSON vs alternatives in Go // Parsing 1 million messages type User struct { UserID int64 `json:"user_id"` Name string `json:"name"` Email string `json:"email"` Roles []string `json:"roles"` CreatedAt string `json:"created_at"` } // Results on M1 MacBook Pro: // // JSON (encoding/json): ~850ms, 450MB allocations // JSON (json-iterator): ~400ms, 280MB allocations // JSON (easyjson codegen): ~180ms, 120MB allocations // Protocol Buffers: ~120ms, 80MB allocations // MessagePack: ~150ms, 100MB allocations // FlatBuffers: ~20ms, 0 allocations (zero-copy)

When to Use JSON

Good for:
  • Public REST APIs (ubiquitous client support)
  • Configuration files (human readability)
  • Development/debugging (easy inspection)
  • Low-volume internal APIs
  • Browser-to-server communication
Avoid for:
  • High-throughput internal services (>10K RPS)
  • Large payloads (>100KB)
  • Binary data transfer
  • Latency-critical paths
  • Mobile apps with bandwidth constraints

2. Protocol Buffers: Google's Workhorse

Protocol Buffers (protobuf) is Google's language-neutral, platform-neutral serialization format. It's used in virtually all Google services.

How Protocol Buffers Work

protobuf
// user.proto syntax = "proto3"; package user; message User { int64 user_id = 1; // Field number 1 string name = 2; // Field number 2 string email = 3; // Field number 3 repeated string roles = 4; google.protobuf.Timestamp created_at = 5; message Settings { string theme = 1; bool notifications = 2; } Settings settings = 6; }

Wire Format Deep Dive

┌─────────────────────────────────────────────────────────────────┐ │ PROTOCOL BUFFERS WIRE FORMAT │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Each field encoded as: [Tag][Value] │ │ │ │ Tag = (field_number << 3) | wire_type │ │ │ │ Wire Types: │ │ ┌────────┬──────────────────────────────────────────────┐ │ │ │ Type │ Description │ │ │ ├────────┼──────────────────────────────────────────────┤ │ │ │ 0 │ Varint (int32, int64, uint32, uint64, bool) │ │ │ │ 1 │ 64-bit (fixed64, sfixed64, double) │ │ │ │ 2 │ Length-delimited (string, bytes, messages) │ │ │ │ 5 │ 32-bit (fixed32, sfixed32, float) │ │ │ └────────┴──────────────────────────────────────────────┘ │ │ │ │ Example: user_id = 12345 (field 1, int64) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Tag: 0x08 (field 1, wire type 0) │ │ │ │ Value: 0xB9 0x60 (varint encoding of 12345) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ Total: 3 bytes (vs 17 bytes in JSON for "user_id":12345) │ │ │ └─────────────────────────────────────────────────────────────────┘

Varint Encoding

┌─────────────────────────────────────────────────────────────────┐ │ VARINT ENCODING │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Variable-length integer encoding (smaller numbers = fewer bytes)│ │ │ │ Number 1: 0x01 (1 byte) │ │ Number 127: 0x7F (1 byte) │ │ Number 128: 0x80 0x01 (2 bytes) │ │ Number 300: 0xAC 0x02 (2 bytes) │ │ Number 12345: 0xB9 0x60 (2 bytes) │ │ │ │ MSB of each byte indicates if more bytes follow: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ 12345 in binary: 11 0000 0011 1001 │ │ │ │ │ │ │ │ Split into 7-bit groups (right to left): │ │ │ │ Group 1: 011 1001 (57) │ │ │ │ Group 2: 110 0000 (96) │ │ │ │ │ │ │ │ Add continuation bits: │ │ │ │ Byte 1: 1 0111001 = 0xB9 (has more) │ │ │ │ Byte 2: 0 1100000 = 0x60 (final) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Efficiency: Numbers 0-127 always fit in 1 byte! │ │ (Most IDs, counts, and flags fall in this range) │ │ │ └─────────────────────────────────────────────────────────────────┘

Schema Evolution: The Killer Feature

┌─────────────────────────────────────────────────────────────────┐ │ PROTOBUF SCHEMA EVOLUTION RULES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ SAFE CHANGES (Backward + Forward Compatible): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ✓ Add new optional fields (new field numbers) │ │ │ │ ✓ Remove fields (old readers ignore unknown fields) │ │ │ │ ✓ Rename fields (wire format uses numbers, not names) │ │ │ │ ✓ Change int32 ↔ int64, uint32 ↔ uint64 │ │ │ │ ✓ Change singular to repeated (if non-packed) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ UNSAFE CHANGES (Will Break Compatibility): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ✗ Change field numbers │ │ │ │ ✗ Change wire types (string → int32) │ │ │ │ ✗ Change between signed and unsigned │ │ │ │ ✗ Reuse deleted field numbers │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ RESERVED FIELDS (Prevent Reuse): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ message User { │ │ │ │ reserved 2, 15, 9 to 11; // Don't reuse these │ │ │ │ reserved "foo", "bar"; // Don't reuse these names │ │ │ │ } │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Real-World Evolution Example

protobuf
// Version 1 (2020) message Order { int64 order_id = 1; int64 user_id = 2; repeated Item items = 3; int32 total_cents = 4; } // Version 2 (2021) - Added shipping info message Order { int64 order_id = 1; int64 user_id = 2; repeated Item items = 3; int32 total_cents = 4; ShippingAddress shipping = 5; // NEW FIELD string tracking_number = 6; // NEW FIELD } // Version 3 (2022) - Deprecated total_cents, use Money type message Order { int64 order_id = 1; int64 user_id = 2; repeated Item items = 3; int32 total_cents = 4 [deprecated = true]; // Keep for compatibility ShippingAddress shipping = 5; string tracking_number = 6; Money total = 7; // NEW: structured money type } // All versions can communicate: // - V1 readers ignore fields 5, 6, 7 // - V3 readers handle missing fields 5, 6, 7 gracefully // - V3 writers still populate field 4 for V1 readers

Code Generation and Usage

# Generate Go code from proto files protoc --go_out=. --go-grpc_out=. user.proto # Generated code structure # user.pb.go - Message types and serialization # user_grpc.pb.go - gRPC service stubs

3. Apache Avro: Schema Registry Power

Avro is the serialization format of choice in the Kafka ecosystem. Its killer feature: schemas stored separately from data.

Avro Schema Definition

json
{ "namespace": "com.example", "type": "record", "name": "User", "fields": [ {"name": "user_id", "type": "long"}, {"name": "name", "type": "string"}, {"name": "email", "type": ["null", "string"], "default": null}, { "name": "roles", "type": {"type": "array", "items": "string"}, "default": [] }, { "name": "settings", "type": { "type": "record", "name": "Settings", "fields": [ {"name": "theme", "type": "string", "default": "light"}, {"name": "notifications", "type": "boolean", "default": true} ] } } ] }

Schema Registry Pattern

┌─────────────────────────────────────────────────────────────────┐ │ SCHEMA REGISTRY ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ │ │ │ Schema Registry │ │ │ │ (Confluent/Apicurio)│ │ │ └──────────┬──────────┘ │ │ │ │ │ ┌─────────────────────┼─────────────────────┐ │ │ │ 1. Register │ 3. Fetch │ │ │ │ Schema │ Schema │ │ │ │ │ │ │ │ ┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐ │ │ │ Producer │ │ Kafka │ │ Consumer │ │ │ │ │────────►│ Topic │────────►│ │ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ │ │ │ │ │ │ │ │ │ │ 2. Send message Messages contain 4. Deserialize │ │ with schema ID schema ID prefix using fetched │ │ (4 bytes) schema │ │ │ │ Message Wire Format: │ │ ┌────────┬──────────┬─────────────────────────────┐ │ │ │ Magic │ Schema │ Avro Binary Data │ │ │ │ Byte │ ID │ │ │ │ │ (1B) │ (4B) │ (Variable) │ │ │ └────────┴──────────┴─────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Why Schema Registry Matters

┌─────────────────────────────────────────────────────────────────┐ │ SCHEMA REGISTRY BENEFITS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. SCHEMA EVOLUTION VALIDATION │ │ Registry enforces compatibility rules before accepting │ │ new schema versions │ │ │ │ 2. SCHEMA REUSE │ │ Same schema used by millions of messages │ │ Only 5 bytes overhead per message (magic + ID) │ │ │ │ 3. DECOUPLED PRODUCERS/CONSUMERS │ │ Consumers can read any message if they have registry access │ │ No coordination needed during deployments │ │ │ │ 4. SCHEMA DISCOVERY │ │ Self-documenting system - registry is source of truth │ │ for all data formats │ │ │ │ Compatibility Modes: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ BACKWARD: New schema can read old data │ │ │ │ FORWARD: Old schema can read new data │ │ │ │ FULL: Both backward and forward compatible │ │ │ │ NONE: No compatibility checking │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Avro vs Protocol Buffers

┌─────────────────────────────────────────────────────────────────┐ │ AVRO vs PROTOCOL BUFFERS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Feature │ Avro │ Protocol Buffers │ │ ─────────────────────┼───────────────────┼─────────────────────│ │ Schema in message │ No (external) │ Yes (via .proto) │ │ Dynamic typing │ Yes │ No (needs codegen) │ │ Wire format size │ Slightly smaller │ Slightly larger │ │ Field identification │ Position-based │ Number-based │ │ Best for │ Data pipelines │ RPC/APIs │ │ Ecosystem │ Hadoop/Kafka │ gRPC/Microservices │ │ Schema definition │ JSON │ .proto DSL │ │ Code generation │ Optional │ Required │ │ │ │ When to choose Avro: │ │ • Kafka-based streaming pipelines │ │ • Big data processing (Spark, Flink) │ │ • Schema evolution is frequent │ │ • Dynamic/scripting languages │ │ │ │ When to choose Protobuf: │ │ • gRPC services │ │ • Performance-critical RPC │ │ • Strong typing required │ │ • Cross-service contracts │ │ │ └─────────────────────────────────────────────────────────────────┘

4. Other Serialization Formats

MessagePack: Binary JSON

┌─────────────────────────────────────────────────────────────────┐ │ MESSAGEPACK │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ "JSON but binary" - same data model, smaller size │ │ │ │ JSON: {"compact":true,"schema":0} (27 bytes) │ │ MessagePack: 82 A7 compact C3 A6 schema 00 (18 bytes) │ │ │ │ Encoding rules: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Type │ Size │ Format │ │ │ ├───────────┼──────┼──────────────────────────────────────│ │ │ │ nil │ 1B │ 0xc0 │ │ │ │ false │ 1B │ 0xc2 │ │ │ │ true │ 1B │ 0xc3 │ │ │ │ int (0-127)│ 1B │ 0x00-0x7f (positive fixint) │ │ │ │ string<32 │ 1B+N │ 0xa0-0xbf (fixstr) │ │ │ │ map<16 │ 1B+N │ 0x80-0x8f (fixmap) │ │ │ └───────────┴──────┴──────────────────────────────────────┘ │ │ │ │ Best for: │ │ • Redis value storage │ │ • Real-time communication (Socket.IO) │ │ • When you want JSON compatibility with less size │ │ │ │ Used by: Redis, Fluentd, Pinterest │ │ │ └─────────────────────────────────────────────────────────────────┘

FlatBuffers: Zero-Copy Access

┌─────────────────────────────────────────────────────────────────┐ │ FLATBUFFERS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Revolutionary concept: Access data WITHOUT deserialization │ │ │ │ Traditional (Protobuf/JSON): │ │ ┌─────────┐ ┌──────────────┐ ┌─────────┐ │ │ │ Receive │───►│ Deserialize │───►│ Access │ │ │ │ Bytes │ │ to Objects │ │ Fields │ │ │ └─────────┘ └──────────────┘ └─────────┘ │ │ O(n) time │ │ O(n) memory │ │ │ │ FlatBuffers: │ │ ┌─────────┐ ┌─────────────────────────────┐ │ │ │ Receive │───►│ Access Fields Directly │ │ │ │ Bytes │ │ (Pointer math on raw bytes) │ │ │ └─────────┘ └─────────────────────────────┘ │ │ O(1) time for field access │ │ O(1) memory (use received buffer) │ │ │ │ Buffer Layout: │ │ ┌─────────┬────────────┬──────────┬────────────┐ │ │ │ Root │ VTable │ Inline │ Offset │ │ │ │ Offset │ (offsets) │ Fields │ Fields │ │ │ └─────────┴────────────┴──────────┴────────────┘ │ │ │ │ Perfect for: │ │ • Games (Unity, Unreal) │ │ • Mobile apps (memory constrained) │ │ • High-frequency trading │ │ • When you only access few fields per message │ │ │ │ Used by: Facebook (Messenger), Google (Android) │ │ │ └─────────────────────────────────────────────────────────────────┘

Apache Thrift

┌─────────────────────────────────────────────────────────────────┐ │ APACHE THRIFT │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Facebook's solution: IDL + RPC + Serialization all-in-one │ │ │ │ // user.thrift │ │ struct User { │ │ 1: required i64 userId, │ │ 2: required string name, │ │ 3: optional string email, │ │ } │ │ │ │ service UserService { │ │ User getUser(1: i64 userId), │ │ void createUser(1: User user), │ │ } │ │ │ │ Multiple protocols: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ TBinaryProtocol │ Standard binary (like Protobuf) │ │ │ │ TCompactProtocol │ Variable-length encoding │ │ │ │ TJSONProtocol │ JSON for debugging │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Historical note: │ │ • Created at Facebook (2007) │ │ • Protobuf open-sourced by Google (2008) │ │ • Now mostly replaced by gRPC/Protobuf in new projects │ │ │ │ Still used by: Apache Cassandra, Apache Hive, Evernote │ │ │ └─────────────────────────────────────────────────────────────────┘

5. Serialization Format Comparison

Size Comparison

┌─────────────────────────────────────────────────────────────────┐ │ PAYLOAD SIZE COMPARISON │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Same data structure (User with nested settings): │ │ │ │ Format │ Size │ Relative │ Notes │ │ ────────────────┼─────────┼──────────┼─────────────────────────│ │ JSON │ 185B │ 100% │ Baseline │ │ JSON (minified) │ 142B │ 77% │ No whitespace │ │ MessagePack │ 98B │ 53% │ Binary JSON │ │ BSON │ 125B │ 68% │ MongoDB format │ │ Protocol Buffers│ 48B │ 26% │ Google standard │ │ Avro │ 45B │ 24% │ No field names │ │ FlatBuffers │ 72B │ 39% │ Zero-copy overhead │ │ Cap'n Proto │ 56B │ 30% │ Like FlatBuffers │ │ │ │ Note: FlatBuffers larger than Protobuf but faster access │ │ │ └─────────────────────────────────────────────────────────────────┘

Performance Comparison

┌─────────────────────────────────────────────────────────────────┐ │ PERFORMANCE COMPARISON (1M messages) │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Serialize (smaller is better): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ JSON (stdlib) ████████████████████████████ 800ms │ │ │ │ JSON (jsoniter) ████████████████ 450ms │ │ │ │ MessagePack ██████████████ 380ms │ │ │ │ Protocol Buffers ████████ 220ms │ │ │ │ FlatBuffers █████ 140ms │ │ │ │ Cap'n Proto ████ 120ms │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Deserialize (smaller is better): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ JSON (stdlib) ██████████████████████████████ 900ms │ │ │ │ JSON (jsoniter) ██████████████████ 500ms │ │ │ │ MessagePack ████████████████ 420ms │ │ │ │ Protocol Buffers ██████████ 280ms │ │ │ │ FlatBuffers █ 25ms (zero-copy!) │ │ │ │ Cap'n Proto █ 20ms (zero-copy!) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Memory Allocations (deserialize): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ JSON ████████████████████████ 500MB │ │ │ │ Protocol Buffers ██████████ 200MB │ │ │ │ MessagePack ████████████ 280MB │ │ │ │ FlatBuffers █ ~0MB (uses input buffer) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

6. Choosing the Right Format

Decision Framework

┌─────────────────────────────────────────────────────────────────┐ │ SERIALIZATION FORMAT DECISION TREE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Is this a public API? │ │ ├── YES ──► Is client a browser? │ │ │ ├── YES ──► JSON (universal support) │ │ │ └── NO ──► JSON or gRPC (provide both) │ │ │ │ │ └── NO (Internal service) ──► Performance critical? │ │ ├── NO ──► JSON (simplicity) │ │ │ │ │ └── YES ──► Need schema? │ │ ├── YES ──► RPC? │ │ │ ├── YES ──► gRPC │ │ │ └── NO ──► Avro │ │ │ │ │ └── NO ──► MessagePack│ │ │ │ Special cases: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Kafka/Streaming ──► Avro + Schema Registry │ │ │ │ Games/Real-time ──► FlatBuffers │ │ │ │ Mobile (bandwidth) ──► Protocol Buffers │ │ │ │ Analytics pipeline ──► Parquet/ORC (columnar) │ │ │ │ Config files ──► JSON/YAML (human readable) │ │ │ │ Cross-language RPC ──► gRPC + Protocol Buffers │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Real-World Choices

┌─────────────────────────────────────────────────────────────────┐ │ WHAT BIG TECH USES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Google: │ │ • Internal RPC: Protocol Buffers (everywhere) │ │ • Public APIs: JSON (Cloud APIs), gRPC (client libraries) │ │ │ │ Meta (Facebook): │ │ • Internal: Thrift (legacy), Protocol Buffers (newer) │ │ • GraphQL: JSON responses │ │ • Mobile: FlatBuffers (Messenger) │ │ │ │ Netflix: │ │ • Microservices: gRPC + Protobuf │ │ • Public API: JSON │ │ • Data pipelines: Avro │ │ │ │ LinkedIn: │ │ • Rest.li framework: JSON (PDSC schema) │ │ • Kafka: Avro │ │ │ │ Uber: │ │ • Internal: Protocol Buffers │ │ • Kafka: Avro + Schema Registry │ │ │ │ Amazon: │ │ • AWS APIs: JSON (REST), Protobuf (gRPC) │ │ • Internal: Ion (Amazon's binary+JSON format) │ │ │ └─────────────────────────────────────────────────────────────────┘

7. Backward and Forward Compatibility

The Compatibility Matrix

┌─────────────────────────────────────────────────────────────────┐ │ COMPATIBILITY TYPES │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ BACKWARD COMPATIBLE: │ │ New code can read old data │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Old Writer ───► [Old Data] ───► New Reader ✓ │ │ │ │ │ │ │ │ Example: Adding new optional field │ │ │ │ V1: {name: "Alice"} │ │ │ │ V2: {name: "Alice", email: null} // New reader handles│ │ │ │ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ FORWARD COMPATIBLE: │ │ Old code can read new data │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ New Writer ───► [New Data] ───► Old Reader ✓ │ │ │ │ │ │ │ │ Example: Old reader ignores unknown fields │ │ │ │ V2: {name: "Alice", email: "a@b.com"} │ │ │ │ V1 reader: {name: "Alice"} // email ignored │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ FULL COMPATIBLE: │ │ Both backward AND forward compatible │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ Old ◄────────► New │ │ │ │ Both directions work │ │ │ │ │ │ │ │ Required for zero-downtime deployments! │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Safe Evolution Rules

┌─────────────────────────────────────────────────────────────────┐ │ SAFE SCHEMA EVOLUTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ALWAYS SAFE: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ✓ Add optional field with default value │ │ │ │ ✓ Remove optional field (if code handles missing) │ │ │ │ ✓ Rename field (Protobuf: uses numbers, not names) │ │ │ │ ✓ Add new enum value (at end) │ │ │ │ ✓ Widen numeric type (int32 → int64) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ BREAKING CHANGES: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ✗ Remove required field │ │ │ │ ✗ Change field type (string → int) │ │ │ │ ✗ Rename field (Avro: uses names for identification) │ │ │ │ ✗ Change field number (Protobuf) │ │ │ │ ✗ Remove enum value │ │ │ │ ✗ Narrow numeric type (int64 → int32) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ GRADUAL MIGRATION STRATEGY: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ 1. Add new field alongside old field │ │ │ │ 2. Deploy writers that populate both fields │ │ │ │ 3. Deploy readers that prefer new field │ │ │ │ 4. Stop populating old field │ │ │ │ 5. Remove old field from schema │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

8. Binary Data and Special Types

Handling Binary Data

┌─────────────────────────────────────────────────────────────────┐ │ BINARY DATA IN SERIALIZATION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Problem: JSON cannot represent arbitrary binary data │ │ │ │ Solution 1: Base64 Encoding │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Raw bytes: [0x48, 0x65, 0x6C, 0x6C, 0x6F] │ │ │ │ Base64: "SGVsbG8=" │ │ │ │ Overhead: 33% larger │ │ │ │ │ │ │ │ { │ │ │ │ "file_content": "SGVsbG8gV29ybGQh", │ │ │ │ "content_type": "application/octet-stream" │ │ │ │ } │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Solution 2: Use Binary Format (Protobuf/MessagePack) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // Protobuf │ │ │ │ message FileUpload { │ │ │ │ bytes content = 1; // Native binary support │ │ │ │ string filename = 2; │ │ │ │ } │ │ │ │ │ │ │ │ // No encoding overhead - bytes stored directly │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Solution 3: Hybrid (gRPC streaming) │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // Large files: use streaming │ │ │ │ rpc UploadFile(stream FileChunk) returns (UploadResult); │ │ │ │ │ │ │ │ message FileChunk { │ │ │ │ bytes data = 1; │ │ │ │ int64 offset = 2; │ │ │ │ } │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Date/Time Handling

┌─────────────────────────────────────────────────────────────────┐ │ DATE/TIME IN DISTRIBUTED SYSTEMS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ JSON Options: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ISO 8601: "2024-01-15T10:30:00Z" │ │ │ │ Unix ms: 1705315800000 │ │ │ │ Unix sec: 1705315800 │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Protobuf (google.protobuf.Timestamp): │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ message Timestamp { │ │ │ │ int64 seconds = 1; // Unix epoch seconds │ │ │ │ int32 nanos = 2; // Nanosecond precision │ │ │ │ } │ │ │ │ │ │ │ │ // Always UTC - no timezone ambiguity │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Best Practices: │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ✓ Always store in UTC │ │ │ │ ✓ Include timezone only when user-facing │ │ │ │ ✓ Use ISO 8601 for JSON APIs │ │ │ │ ✓ Use Timestamp for internal services │ │ │ │ ✗ Never trust client-provided timestamps for ordering │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Interview Questions

Conceptual Questions

  1. Why does LinkedIn spend 15-20% of CPU on JSON serialization?
    • Every field name repeated in every message
    • Numbers encoded as variable-length ASCII strings
    • No schema = no optimization opportunities
    • Parsing requires character-by-character scanning
    • Memory allocation for intermediate objects
  2. When would you choose Avro over Protocol Buffers?
    • Kafka-centric architecture (native integration)
    • Schema registry needed for governance
    • Dynamic languages (no codegen required)
    • Frequent schema evolution
    • When you need to read data without knowing schema at compile time
  3. Explain FlatBuffers' zero-copy deserialization.
    • Data laid out in memory exactly as it appears on wire
    • Accessing a field = pointer arithmetic on raw bytes
    • No parsing, no object construction, no allocations
    • Trade-off: Slightly larger wire format, read-only access
  4. How does Protobuf achieve backward compatibility?
    • Field numbers (not names) identify fields on wire
    • Unknown fields ignored by old readers
    • Missing fields get default values in new readers
    • Reserved fields prevent accidental reuse

System Design Questions

  1. Design a schema evolution strategy for a payment system processing 1M txns/day.
    • Use Schema Registry with FULL compatibility mode
    • Version all schemas explicitly
    • Dual-write period during migrations
    • Canary deployments to catch compatibility issues
    • Rollback strategy for each schema change
  2. Your service is 40% CPU on JSON serialization. Options?
    • Profile to confirm bottleneck
    • Try drop-in JSON library (jsoniter, simdjson)
    • Consider partial migration to Protobuf for hot paths
    • Evaluate if all fields are needed (projection)
    • Cache serialized representations where possible

Common Mistakes

┌─────────────────────────────────────────────────────────────────┐ │ SERIALIZATION ANTI-PATTERNS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ✗ MISTAKE: Serializing internal domain objects directly │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // DON'T: Internal class exposed on API │ │ │ │ json.Marshal(user) // Leaks internal structure │ │ │ │ │ │ │ │ // DO: Use DTOs (Data Transfer Objects) │ │ │ │ json.Marshal(toUserResponse(user)) │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ✗ MISTAKE: No versioning strategy │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // DON'T: Hope for the best │ │ │ │ message User { string name = 1; } │ │ │ │ │ │ │ │ // DO: Plan for evolution │ │ │ │ // Reserved fields, schema registry, version headers │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ✗ MISTAKE: Ignoring serialization in performance testing │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // Benchmarks often skip ser/deser │ │ │ │ // Real production: ser/deser can be 30% of latency │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ✗ MISTAKE: Base64 encoding large binary in JSON │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ // 10MB file → 13.3MB in JSON │ │ │ │ // Solution: multipart upload or binary format │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘

Summary

┌─────────────────────────────────────────────────────────────────┐ │ MODULE 3 KEY TAKEAWAYS │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. Serialization is not free - can consume 10-30% CPU │ │ │ │ 2. JSON: Universal but inefficient │ │ - Use for public APIs and debugging │ │ - Avoid for high-throughput internal services │ │ │ │ 3. Protocol Buffers: Google's choice │ │ - Excellent for RPC and typed contracts │ │ - Schema evolution via field numbers │ │ - Mandatory for gRPC │ │ │ │ 4. Avro: Kafka's best friend │ │ - Schema registry for governance │ │ - Dynamic languages and data pipelines │ │ │ │ 5. FlatBuffers: When every microsecond counts │ │ - Zero-copy deserialization │ │ - Games, mobile, HFT │ │ │ │ 6. Compatibility is crucial │ │ - Plan for schema evolution from day 1 │ │ - Full compatibility enables zero-downtime deploys │ │ │ └─────────────────────────────────────────────────────────────────┘

Next Module: Time, Clocks, and Ordering - Understanding logical time in distributed systems.
All Blogs
Tags:serializationprotobufavrojsonschema-evolutiondistributed-systems