Data Modeling in Databases for System Design Interviews and Production Systems

A Complete Guide for Senior Engineers

By a Principal Database Architect with 20+ Years of Experience

Preface

In my two decades of designing database systems from the core ledger at a major bank processing trillions of dollars annually, to social media platforms handling millions of posts per second, to analytics systems crunching petabytes of clickstream data I've learned one fundamental truth: data modeling decisions made in the first weeks of a project echo for years, sometimes decades.

A poorly chosen schema becomes a millstone around your neck. It slows down queries that should be fast. It makes simple features complicated. It creates operational nightmares during scaling. And most painfully, it becomes nearly impossible to change once the system is in production with real data and real dependencies.

This book is not about memorizing normal forms for an exam. It's about developing the intuition to make good modeling decisions decisions that will serve you in system design interviews, in architectural reviews, and most importantly, in production systems that real people depend on.

I'll share stories of failures I've witnessed and caused. I'll explain why certain decisions that seem obviously wrong are sometimes right, and why decisions that seem obviously right often lead to disaster. I'll prepare you to think like a database architect, not just to recite database facts.

Let's begin.

Part I: Foundations of Data Modeling

Chapter 1: The Two Worlds of Data Modeling

Before we touch a single table or document, we need to understand that data modeling exists in two distinct but connected worlds: the logical and the physical.

Logical Modeling: The World of Business Meaning

Logical modeling is about understanding the business domain. When we create a logical model, we're answering questions like: What entities exist in this business? What are the relationships between them? What attributes describe each entity? What constraints govern the data?

At this stage, we don't care about databases at all. We care about truth. A customer places orders. An order contains line items. A line item references a product. This is true regardless of whether we store it in PostgreSQL, MongoDB, or carved into stone tablets.

The logical model is your contract with the business. When a product manager says "customers can have multiple shipping addresses," that becomes part of the logical model. When a compliance officer says "we must retain transaction records for seven years," that's a constraint in the logical model.

I've seen teams skip logical modeling to "move fast." They open a database client, start creating tables, and figure it out as they go. This works for small systems and prototypes. For anything that will live longer than a few months or grow beyond a single developer's comprehension, it's a recipe for chaos.

One team I consulted for had built a successful e-commerce platform this way. Five years later, they had 847 tables, no documentation, and nobody who understood how everything connected. A simple feature like "show customers their lifetime spending" required querying 23 tables with logic that nobody could explain. The original developers had left. The tables had names like orders2, orders_new, orders_final, and orders_final_v2. This is the end state of skipping logical modeling.

Physical Modeling: The World of Performance

Physical modeling translates the logical model into actual database structures. Here, we make decisions driven by technology and performance: Which database engine? How do we partition data? What indexes do we create? How do we denormalize for read performance?

The same logical model can lead to radically different physical models depending on the workload. A customer entity might become:

A normalized table in PostgreSQL for a system that does complex queries and transactional updates
A denormalized document in MongoDB for a system that reads entire customer profiles frequently
A wide table in Cassandra for a system that needs massive write throughput across regions
A compressed columnar format in ClickHouse for analytics that aggregates millions of customers

Physical modeling is where experience matters most. It requires understanding how databases actually work not their marketing materials, but their actual behavior under load, their failure modes, their operational requirements.

The Bridge Between Worlds

The relationship between logical and physical modeling is not a one-way translation. Physical constraints sometimes force changes to the logical model.

Consider a social media system where the logical model includes "users follow other users" as a many-to-many relationship. Simple enough. But when you have users with millions of followers, the physical reality intrudes. You might need to change the logical model to include the concept of "celebrity users" who are handled differently, or "follow request" as a distinct state, or "follower count" as a denormalized attribute that's eventually consistent rather than derived in real-time.

Good database architects maintain both models and understand their relationship. The logical model remains the source of truth for business meaning. The physical model documents how that meaning is actually implemented, including all the compromises and optimizations required by reality.

Chapter 2: OLTP vs OLAP - The Great Divide

One of the most fundamental distinctions in database systems is between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). These workloads are so different that they've driven the development of entirely different database architectures.

OLTP: The World of Transactions

OLTP systems handle the transactional workload of a business. When a customer places an order, when a payment is processed, when an employee updates a record these are OLTP operations.

OLTP workloads have characteristic patterns:

Short, focused queries: A typical OLTP query touches one record or a small set of records. "Get the profile for user 12345." "Insert this new order." "Update the status of shipment 67890." These queries should complete in milliseconds.

High concurrency: An e-commerce site might have thousands of customers browsing and buying simultaneously. Each represents concurrent database operations. The system must handle this concurrency without operations interfering with each other.

Data integrity requirements: When a customer pays for an order, we must guarantee that the payment is recorded and the order is marked as paid atomically. Partial states payment recorded but order not marked, or vice versa are unacceptable.

Read and write mixed: Unlike analytical systems that are mostly reads, OLTP systems have significant write traffic. Every action a user takes potentially modifies data.

OLTP systems typically use row-oriented storage because most operations need entire rows. When you retrieve a customer profile, you want all their attributes: name, email, address, preferences. Row storage keeps all these together on disk, making retrieval efficient.

OLAP: The World of Analysis

OLAP systems answer questions about aggregated data over time. How much revenue did we generate last quarter? What's the trend in customer acquisition? Which products are most popular in which regions?

OLAP workloads have very different patterns:

Long, complex queries: An analytical query might scan billions of rows, join multiple large tables, and compute aggregations. These queries might run for minutes or even hours.

Low concurrency: A few analysts or automated reports generate queries. You might have dozens or hundreds of concurrent analytical queries, not thousands or millions.

Read-mostly: OLAP systems primarily read data. Writes happen through batch loads perhaps hourly or daily ETL jobs that bring in new data.

Columnar access patterns: Analytical queries typically access a few columns across many rows. "Sum the revenue column for all orders in 2024" doesn't need customer names or shipping addresses it needs only the revenue column and date column from potentially billions of rows.

This is why OLAP systems use columnar storage. Instead of storing all attributes of a row together, they store all values of a column together. When a query needs only revenue and date, it reads only those columns from disk, potentially reducing I/O by 10x or more.

The Disaster of Mixing Workloads

Early in my career, I watched a company try to run analytics on their OLTP database. Their reasoning seemed sound: "Why maintain two systems? We already have all the data in PostgreSQL."

The first analytical queries worked fine. As the data grew and more analysts ran more complex queries, problems emerged. Analytical queries held locks that blocked transactional operations. The database server's memory was consumed by large query results, leaving nothing for transaction caching. The query planner made poor choices because it was optimized for small queries.

The breaking point came on a Monday morning. An analyst ran a query that scanned the entire orders table now containing three years of data to compute monthly revenue trends. This query held a lock for 47 minutes. During this time, no orders could be completed. The company lost an estimated $2 million in revenue from failed transactions.

The fix wasn't better query optimization or more hardware. The fix was architectural: separate the OLTP and OLAP workloads into different systems. The OLTP database handled transactions. A separate data warehouse, loaded nightly, handled analytics. Analytical queries could run as long as needed without affecting production.

This pattern OLTP database plus separate OLAP data warehouse remains the standard architecture for good reason. The workloads are fundamentally different, and trying to serve both with one system leads to neither being served well.

Modern Convergence: HTAP Systems

Recently, a new category has emerged: Hybrid Transactional/Analytical Processing (HTAP) systems. Databases like TiDB, SingleStore, and CockroachDB claim to handle both workloads.

These systems use sophisticated techniques like maintaining both row and columnar representations of data, or using read replicas with columnar indexes to serve mixed workloads. They're genuinely useful for certain scenarios: systems that need real-time analytics without ETL delay, or startups that can't yet justify separate OLTP and OLAP infrastructure.

But be cautious. HTAP systems are complex, and the "hybrid" nature means compromises. They're rarely as good at OLTP as dedicated OLTP systems, or as good at OLAP as dedicated OLAP systems. For systems at scale, the separation usually remains the right choice.

Chapter 3: Row-Store vs Column-Store

The difference between row-oriented and column-oriented storage is fundamental to database performance, yet many engineers don't understand it deeply. Let me fix that.

Row Storage: How Traditional Databases Work

In a row-oriented database like PostgreSQL or MySQL, data is stored row by row on disk. If you have a table with columns (id, name, email, age, created_at), each row's data is stored contiguously:

Row 1: [1, "Alice", "alice@example.com", 30, "2024-01-15"]
Row 2: [2, "Bob", "bob@example.com", 25, "2024-01-16"]
Row 3: [3, "Carol", "carol@example.com", 35, "2024-01-17"]

When you query

SELECT * FROM users WHERE id = 1

, the database finds row 1 (using an index) and reads all its columns in one disk operation. Fast.

When you query

SELECT * FROM users WHERE id IN (1, 2, 3)

, the database reads three rows, potentially three disk operations. Still fast.

This is perfect for OLTP. Most queries want entire rows or small sets of rows.

The Problem: Analytical Queries

Now consider: SELECT AVG(age) FROM users

With one million users, the database must:

Read every row from disk
Extract only the age column from each row
Compute the average

But we're reading entire rows when we only need one column. If each row is 500 bytes and the age column is 4 bytes, we're reading 125 times more data than necessary. With a billion rows, this translates to hundreds of gigabytes of unnecessary I/O.

Column Storage: The Analytical Solution

Column-oriented databases store data by column:

Column "id": [1, 2, 3, ...]
Column "name": ["Alice", "Bob", "Carol", ...]
Column "email": ["alice@example.com", "bob@example.com", ...]
Column "age": [30, 25, 35, ...]
Column "created_at": ["2024-01-15", "2024-01-16", "2024-01-17", ...]

Now SELECT AVG(age) FROM users only reads the age column. With a billion rows at 4 bytes each, that's 4 GB instead of 500 GB. The query runs 100x faster just from reduced I/O.

But it gets better. Columnar data compresses dramatically because values in the same column are similar. Ages cluster around a few decades. Dates cluster around recent periods. Countries cluster around common values. Compression ratios of 10:1 are common, making our 4 GB age column only 400 MB on disk.

The Trade-off

Row stores excel at:

Point queries (get one row)
Small range queries (get a few hundred rows)
Mixed read/write workloads
Updates to individual rows

Column stores excel at:

Full table scans
Aggregations across many rows
Queries that access few columns
Compression of large datasets

Column stores struggle with:

Point queries (must reconstruct row from multiple column files)
Updates (must modify multiple column files)
High-frequency inserts (batch loading is preferred)

This is why we use PostgreSQL for production transactional data and ClickHouse or BigQuery for analytics. Each is optimized for its workload.

A War Story: The Column Store Mistake

A fintech startup I advised decided to use ClickHouse for everything. They'd read about its impressive analytical performance and thought: "If it's fast for analytics, it must be fast for everything."

Their transaction processing system required:

Inserting individual transactions as they occurred
Updating transaction status frequently
Querying individual transactions by ID
Maintaining strict consistency

ClickHouse is designed for batch inserts, not individual inserts. Each insert created a new data part that had to be merged later. With thousands of transactions per minute, they accumulated thousands of parts per hour. Merge operations consumed all available CPU. Query performance degraded as the system searched across thousands of parts.

Updates were worse. ClickHouse doesn't really update it marks old data for deletion and inserts new data. Their status update workload generated enormous amounts of "deleted" data awaiting cleanup.

After three months, they had a terabyte of dead data, queries took seconds instead of milliseconds, and they couldn't process transactions fast enough. They had to rewrite their entire transaction system to use PostgreSQL, keeping ClickHouse only for analytics.

The lesson: choose databases based on workload characteristics, not benchmark headlines.

Part II: Normalization Deep Dive

Chapter 4: The Philosophy of Normalization

Normalization is often taught as a set of mechanical rules: 1NF, 2NF, 3NF, BCNF. Students memorize definitions, apply them to textbook examples, and promptly forget everything after the exam. This is unfortunate because normalization embodies deep principles about data integrity that every database practitioner should understand intuitively.

What Normalization Actually Prevents

Before discussing normal forms, let's understand the problems normalization solves. These are called data anomalies.

Update Anomalies: Imagine a table storing customer orders with customer address denormalized into each order:

orders(order_id, customer_id, customer_name, customer_address, product, quantity)

When customer Alice moves to a new address, you must update every order she's ever placed. Miss one, and you have inconsistent data some orders show her old address, some show the new. With millions of orders, ensuring complete updates is difficult and expensive.

Insertion Anomalies: In the same table, you cannot record a new customer until they place an order. The customer_address depends on having an order to attach it to. This seems wrong a customer exists independent of their orders.

Deletion Anomalies: If you delete all orders for a customer (perhaps as part of data cleanup), you lose the customer's address information. Again, the customer's existence shouldn't depend on their orders.

Normalization eliminates these anomalies by organizing data so that each fact is stored once, in one place, with a clear mechanism for relating facts to each other.

First Normal Form: The Foundation

A table is in First Normal Form if it contains only atomic values no repeating groups, no composite values that should be separate.

Consider a poorly designed product table:

products(product_id, name, colors)

Where colors might contain: "red, blue, green"

This violates 1NF because colors is not atomic it's a list masquerading as a single value. Problems arise immediately:

How do you find all blue products? You need string parsing: WHERE colors LIKE '%blue%'. This is slow and error-prone (it would match "lightblue" too).
How do you add a color? String manipulation.
How do you ensure no duplicates? Complex validation.
How do you constrain to valid colors? Essentially impossible.

The 1NF solution:

products(product_id, name)
product_colors(product_id, color)

Now each color is a separate row. Finding blue products is simple: WHERE color = 'blue'. Adding a color is an insert. Removing a color is a delete. You can add a foreign key to a colors reference table if you want to constrain valid values.

First Normal Form seems obvious, yet violations are common. I've seen production databases with JSON arrays stored in text columns, comma-separated IDs, and pipe-delimited key-value pairs. Each violation creates the same problems: difficult queries, data integrity issues, and eventual regret.

Second Normal Form: Full Dependency on Keys

A table is in Second Normal Form if it's in 1NF and every non-key column depends on the entire primary key, not just part of it.

This matters for tables with composite keys. Consider:

order_items(order_id, product_id, quantity, product_name, product_price)
Primary key: (order_id, product_id)

Here, quantity depends on the full key a specific product in a specific order. But product_name and product_price depend only on product_id, not on order_id. This is a partial dependency, violating 2NF.

The problems:

Product names are duplicated across every order containing that product
Updating a product name requires updating every order_items row
You can't add a product until it appears in an order

The solution separates product information:

products(product_id, product_name, product_price)
order_items(order_id, product_id, quantity)

Now each product's information exists once. Order items reference products by ID.

Third Normal Form: No Transitive Dependencies

A table is in Third Normal Form if it's in 2NF and every non-key column depends directly on the primary key, not on another non-key column.

Consider:

employees(employee_id, name, department_id, department_name, department_location)

Here, department_name and department_location depend on department_id, which depends on employee_id. This chain employee_id → department_id → department_name is a transitive dependency.

The problems are familiar:

Department information is duplicated across employees
Changing a department name requires updating all employees in that department
You can't record a department with no employees

The 3NF solution:

departments(department_id, department_name, department_location)
employees(employee_id, name, department_id)

BCNF: The Refinement

Boyce-Codd Normal Form handles a subtle case that 3NF misses: when a non-key column determines part of a composite key.

This is relatively rare in practice, but consider a course scheduling system:

courses(student_id, course, instructor)

Suppose business rules dictate that each course has exactly one instructor meaning course → instructor. But our key is (student_id, course), and course determines instructor, which is a 3NF violation that 3NF's definition doesn't quite catch.

BCNF requires that every determinant (left side of a functional dependency) be a candidate key. In this example, course determines instructor, but course alone isn't a key. We split:

courses(course, instructor)
enrollments(student_id, course)

For most practical systems, achieving 3NF is sufficient. BCNF matters in complex schemas with many interdependencies.

The Real Reason Normalization Matters: Consistency

Beyond the textbook anomalies, normalization matters because it enables strong consistency guarantees. When a fact is stored once, updates are atomic. When facts are duplicated, updates become distributed transactions across multiple locations.

In a banking system, the balance of an account should be stored once. If you denormalize it into every transaction record, updating the balance means updating potentially millions of records. Miss one, and you have inconsistent data. In finance, inconsistent data is not merely inconvenient it's potentially illegal.

This is why transactional systems, especially financial ones, prefer normalized designs. The join overhead is worth the consistency guarantee.

Chapter 5: The Banking System Case Study

Let me walk through how normalization applies to a real system with serious consistency requirements: a banking core ledger.

The Domain

A bank needs to track:

Accounts (checking, savings, loan)
Account holders (customers, who may have multiple accounts)
Transactions (deposits, withdrawals, transfers)
Running balances
Statements

The stakes are high. Regulatory requirements mandate accurate records. Customers depend on balances being correct. Auditors examine transaction histories.

The Naive Design (Don't Do This)

A developer unfamiliar with financial systems might design:

accounts(
  account_id,
  customer_name,
  customer_address,
  customer_phone,
  account_type,
  balance,
  last_transaction_date,
  last_transaction_amount,
  monthly_interest_rate,
  total_fees_ytd
)

This is a disaster waiting to happen. Customer information is duplicated if they have multiple accounts. The balance is stored directly, creating risk of inconsistency with transactions. Derived data (total_fees_ytd) is stored rather than computed.

The Normalized Design

A proper banking schema separates concerns:

customers(
  customer_id PRIMARY KEY,
  name,
  address,
  phone,
  date_of_birth,
  tax_id,
  created_at
)

accounts(
  account_id PRIMARY KEY,
  customer_id REFERENCES customers,
  account_type,
  currency,
  opened_at,
  closed_at,
  status
)

transactions(
  transaction_id PRIMARY KEY,
  account_id REFERENCES accounts,
  transaction_type,
  amount,
  running_balance,
  description,
  created_at,
  posted_at
)

Notice what's NOT here: there's no balance column in the accounts table. The balance is derived from transactions. This seems inefficient getting a balance requires summing all transactions but it's intentional.

The Ledger Principle

In double-entry bookkeeping, every transaction affects at least two accounts. When you transfer $100 from checking to savings:

Debit checking account by $100
Credit savings account by $100

The sum of all debits and credits in the system must equal zero. This invariant is fundamental if it's violated, something is wrong.

If we stored balance as a column, we'd need to update two rows atomically for every transfer. This creates race conditions and complexity. Instead, we store transactions and compute balances:

sql
SELECT SUM(
  CASE
    WHEN transaction_type = 'credit' THEN amount
    WHEN transaction_type = 'debit' THEN -amount
  END
) AS balance
FROM transactions
WHERE account_id = ?

The balance is always consistent with transactions because it IS the transactions.

The Running Balance Optimization

Computing balance from all historical transactions is expensive for accounts with many transactions. The running_balance column is a controlled denormalization.

Each transaction records the balance after that transaction:

transaction_id | amount | running_balance
1              | 1000   | 1000
2              | -200   | 800
3              | 500    | 1300
4              | -100   | 1200

The current balance is the running_balance of the most recent transaction. But we can still verify by summing amounts if the sum doesn't match the final running_balance, we've detected corruption.

This pattern store computed values but retain the ability to recompute and verify is common in financial systems.

Why Not Shard?

Many distributed systems use sharding for scale. Banking systems rarely shard account data, and the reason is transactional complexity.

Consider a transfer between accounts that live on different shards. This requires a distributed transaction:

Begin transaction on shard A
Begin transaction on shard B
Debit account on shard A
Credit account on shard B
Commit on shard A
Commit on shard B

If step 5 succeeds but step 6 fails, money has vanished. You've debited one account but not credited another. Handling this correctly requires sophisticated distributed transaction protocols (two-phase commit) or saga patterns with compensation.

Banks typically keep account data in a single, highly available database cluster rather than shard it. They scale through read replicas (for balance inquiries) and by archiving old transactions to separate systems. The transactional core stays unified.

Archiving Strategy

A bank might process millions of transactions daily. Over years, the transactions table grows to billions of rows. Query performance degrades. Backups take forever.

The solution is archiving with careful consideration for query patterns:

Online transactions: Last 90 days, in the primary database. All balance queries hit this.

Recent archive: 90 days to 2 years, in a separate database or table partition. Statement generation queries hit this.

Cold archive: Beyond 2 years, compressed and stored in object storage. Only accessed for legal/audit requests.

The schema supports this through partitioning:

sql
CREATE TABLE transactions (
  -- columns
) PARTITION BY RANGE (posted_at);

CREATE TABLE transactions_2024_q1
  PARTITION OF transactions
  FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

Old partitions can be detached and archived. New partitions are created ahead of time. Query planner automatically routes queries to relevant partitions.

Isolation Levels: A Brief Detour

Financial systems typically use SERIALIZABLE or REPEATABLE READ isolation. The default READ COMMITTED can cause phantom reads transactions seeing different data at different points which is unacceptable for financial operations.

Consider computing a balance while another transaction is transferring funds:

Transaction A reads debit: -100
Transaction B inserts credit: +100
Transaction B commits
Transaction A continues, doesn't see the credit
Transaction A computes balance missing $100

With SERIALIZABLE isolation, this race condition is prevented. Transaction A would see a consistent snapshot, or the database would abort one transaction and retry.

The cost is reduced concurrency. SERIALIZABLE transactions may block each other or cause retries. For banking systems, correctness justifies this cost.

Part III: Denormalization in Distributed Systems

Chapter 6: When Normalization Isn't Enough

I've argued for normalization in transactional systems. Now I'll argue for denormalization in distributed, read-heavy systems. This isn't a contradiction it's recognizing that different systems have different requirements.

The Read-Heavy Reality

Consider a social media platform. Users read their feed thousands of times for every post they create. They view profiles constantly but update them rarely. They scroll through comments endlessly but add comments occasionally.

When reads outnumber writes by 1000:1, optimizing for write consistency (normalization) at the cost of read performance is the wrong trade-off. We should optimize for reads, even if it complicates writes.

The Join Problem at Scale

Normalized data requires joins. In a social network, displaying a post might require:

Fetch the post
Join to get author information
Join to get like count
Join to get comment count
Join to get the first few comments
For each comment, join to get commenter information

That's potentially dozens of table accesses for one post. In a relational database on a single machine, the query optimizer handles this efficiently a few milliseconds.

But in a distributed database, each join potentially crosses network boundaries. Table A is on server 1. Table B is on server 2. Each network hop adds latency. Under load, a query that takes 2ms locally might take 200ms when distributed.

When you're rendering a feed of 50 posts, 200ms per post is 10 seconds total. Unacceptable.

The Denormalization Solution

Denormalization trades write complexity for read simplicity. Instead of joining at read time, we pre-join at write time.

A denormalized post document might look like:

json
{
  "post_id": "123",
  "content": "Hello world",
  "author": {
    "user_id": "456",
    "name": "Alice",
    "avatar_url": "https://..."
  },
  "stats": {
    "like_count": 1523,
    "comment_count": 87,
    "share_count": 23
  },
  "preview_comments": [
    {
      "comment_id": "789",
      "user": {"user_id": "111", "name": "Bob"},
      "content": "Great post!",
      "created_at": "2024-01-15T10:30:00Z"
    }
  ],
  "created_at": "2024-01-15T10:00:00Z"
}

One read fetches everything needed to render the post. No joins. Predictable latency.

The Consistency Challenge

Denormalization creates data duplication, and duplication creates consistency challenges.

When Alice changes her name, it should update everywhere: her profile, all her posts, all her comments, all places where her name appears. In a normalized system, this is one update. In a denormalized system, it's potentially millions of updates.

How do we handle this?

Accept eventual consistency: For many use cases, it's fine if Alice's old name appears for a few minutes after she changes it. Update her canonical profile immediately; propagate to denormalized copies asynchronously.

Selective denormalization: Only denormalize stable data. User IDs don't change; always store those. Avatar URLs change rarely; cache with TTL. Display names change occasionally; accept staleness.

Reference plus cache: Store the reference (user_id) and cache the denormalized data (name, avatar). The cache has a TTL; when it expires, re-fetch. This bounds staleness.

Event-driven updates: When Alice changes her name, publish an event. Subscribers update their denormalized copies. This can be fast but isn't instant, and handling failures is complex.

A Real-World Example: Facebook's Approach

Facebook (now Meta) famously denormalizes aggressively. A user's news feed is precomputed and stored, not generated on-the-fly. When a friend posts, Facebook updates the precomputed feeds of all their friends asynchronously.

This approach has trade-offs:

Reading your feed is fast: one or a few reads from a precomputed cache
Posting is complex: your post must propagate to thousands of friends' feeds
Consistency is eventual: your post might appear in friends' feeds at different times
Storage is enormous: every user has their own feed copy

For Facebook's scale (billions of users, each with hundreds of friends), this trade-off is correct. Read latency matters more than write complexity. Eventual consistency is acceptable for social content.

For a banking system, these trade-offs would be disastrous. Context determines correct design.

Chapter 7: Patterns of Denormalization

Let me describe specific denormalization patterns I've used in production systems.

Precomputed Aggregates

The problem: "Show the number of orders each customer has placed this month."

The normalized approach:

sql
SELECT customer_id, COUNT(*)
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY customer_id;

This scans potentially millions of rows. For a dashboard showing all customers, it's expensive.

The denormalized approach: maintain a separate aggregates table:

sql
customer_monthly_stats(
  customer_id,
  year_month,
  order_count,
  total_revenue,
  updated_at
)

When an order is placed, increment the counter:

sql
INSERT INTO customer_monthly_stats (customer_id, year_month, order_count, total_revenue)
VALUES (?, ?, 1, ?)
ON CONFLICT (customer_id, year_month)
DO UPDATE SET
  order_count = customer_monthly_stats.order_count + 1,
  total_revenue = customer_monthly_stats.total_revenue + EXCLUDED.total_revenue,
  updated_at = NOW();

Reading customer stats is now a simple lookup.

The trade-off: writes are more complex and slightly slower. But reads which happen far more often are dramatically faster. The aggregate data is always consistent if updates are atomic.

Materialized Views

Many databases support materialized views query results stored as tables. They're essentially database-managed denormalization.

sql
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
  c.customer_id,
  c.name,
  COUNT(o.order_id) as order_count,
  SUM(o.total) as lifetime_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

The view is refreshed periodically or on-demand:

sql
REFRESH MATERIALIZED VIEW customer_order_summary;

Materialized views are convenient but have limitations:

Refresh can be expensive for large datasets
Data staleness between refreshes
Not all databases support incremental refresh

For production systems with strict latency requirements, I typically implement custom aggregation rather than relying on materialized views.

Event-Driven Projections

In event-sourced systems, projections are a form of denormalization. The event log is the source of truth; projections are optimized read models derived from events.

Consider an e-commerce system:

Events: OrderPlaced, OrderShipped, OrderDelivered, OrderCanceled
Projection: current_orders table showing only in-progress orders

When OrderPlaced occurs, insert into current_orders. When OrderDelivered or OrderCanceled occurs, delete from current_orders.

The projection is denormalized (combines data from multiple event types), optimized for the specific read pattern (in-progress orders), and eventually consistent with the event log.

Cache-Based Denormalization

Sometimes denormalization lives not in the database but in a cache layer.

Pattern:

Database stores normalized data
Application denormalizes at read time
Result is cached
Subsequent reads hit cache

python
def get_post_with_details(post_id):
    # Try cache first
    cached = redis.get(f"post:{post_id}")
    if cached:
        return json.loads(cached)

    # Build denormalized view from normalized data
    post = db.query("SELECT * FROM posts WHERE id = ?", post_id)
    author = db.query("SELECT * FROM users WHERE id = ?", post.author_id)
    stats = db.query("SELECT * FROM post_stats WHERE post_id = ?", post_id)

    result = {
        "post": post,
        "author": author,
        "stats": stats
    }

    # Cache with TTL
    redis.setex(f"post:{post_id}", 3600, json.dumps(result))

    return result

This approach:

Keeps the database normalized for write consistency
Provides denormalized reads for performance
Bounds staleness through cache TTL
Allows easy cache invalidation when data changes

The complexity is in cache invalidation. When the author changes their name, you must invalidate all cached posts by that author. Cache invalidation strategies could fill a chapter themselves.

Part IV: Database Case Studies

Let me walk through modeling a social media platform a domain that exemplifies the challenges of read-heavy, eventually consistent distributed systems.

Post Schema: The Foundation

A post is the core entity. In a normalized relational model:

sql
posts(
  post_id BIGINT PRIMARY KEY,
  author_id BIGINT REFERENCES users,
  content TEXT,
  created_at TIMESTAMP
)

This is clean but requires joins for every read. In a document database like MongoDB:

json
{
  "_id": ObjectId("..."),
  "author_id": "12345",
  "author_snapshot": {
    "name": "Alice",
    "avatar_url": "https://..."
  },
  "content": "Hello world!",
  "created_at": ISODate("2024-01-15T10:00:00Z"),
  "stats": {
    "like_count": 0,
    "comment_count": 0,
    "share_count": 0
  }
}

The author_snapshot is denormalization it's a copy of author data at post creation time. This enables displaying posts without joining to the users collection. The trade-off: if Alice changes her name, old posts show the old name until updated.

The Likes Challenge

Likes seem simple but create interesting scaling challenges.

Normalized approach:

sql
likes(
  like_id BIGINT PRIMARY KEY,
  post_id BIGINT REFERENCES posts,
  user_id BIGINT REFERENCES users,
  created_at TIMESTAMP,
  UNIQUE(post_id, user_id)
)

To get the like count:

SELECT COUNT(*) FROM likes WHERE post_id = ?

For a viral post with millions of likes, this query is expensive. The like count is needed for every post view you can't afford expensive queries.

Denormalized approach: Store like_count in the post document, increment/decrement atomically:

sql
UPDATE posts SET like_count = like_count + 1 WHERE post_id = ?

But this creates contention. A viral post might receive thousands of likes per second. All those updates serialize on the same row, creating a bottleneck.

The counter aggregation pattern: Instead of updating the post directly, write likes to a separate stream or log. Aggregate counts periodically:

Like events go to a Kafka topic
A consumer aggregates counts per time window (say, every second)
Aggregated counts are applied to posts

This trades perfect accuracy for throughput. The like count might lag reality by a few seconds. For social media, this is acceptable.

Comments: The Hierarchy Problem

Comments often have replies, forming a tree structure. Modeling hierarchies in databases is notoriously tricky.

Option 1: Adjacency List

sql
comments(
  comment_id BIGINT PRIMARY KEY,
  post_id BIGINT REFERENCES posts,
  parent_id BIGINT REFERENCES comments,  -- NULL for top-level
  author_id BIGINT REFERENCES users,
  content TEXT,
  created_at TIMESTAMP
)

Simple to understand and implement. Getting immediate children is easy:

sql
SELECT * FROM comments WHERE parent_id = ?

But getting an entire thread (all descendants) requires recursive queries or multiple round trips. For deep threads, this is expensive.

Option 2: Materialized Path

Store the path from root to each comment:

sql
comments(
  comment_id BIGINT PRIMARY KEY,
  post_id BIGINT REFERENCES posts,
  path TEXT,  -- e.g., "1/5/12/45" meaning root=1, grandparent=5, parent=12, self=45
  depth INT,
  author_id BIGINT REFERENCES users,
  content TEXT,
  created_at TIMESTAMP
)

Getting a thread is simple:

sql
SELECT * FROM comments
WHERE path LIKE '1/5/%'
ORDER BY path

But moving a comment (changing its parent) requires updating paths of all descendants. And paths can get long for deep threads.

Option 3: Closure Table

Store all ancestor-descendant relationships:

sql
comments(
  comment_id BIGINT PRIMARY KEY,
  post_id BIGINT REFERENCES posts,
  author_id BIGINT REFERENCES users,
  content TEXT,
  created_at TIMESTAMP
)

comment_tree(
  ancestor_id BIGINT REFERENCES comments,
  descendant_id BIGINT REFERENCES comments,
  depth INT,
  PRIMARY KEY(ancestor_id, descendant_id)
)

When comment 45 is added under comment 12 (which is under 5, which is under 1):

Insert (45, 45, 0) self-reference
Insert (12, 45, 1) immediate parent
Insert (5, 45, 2) grandparent
Insert (1, 45, 3) root

Getting all descendants of comment 1:

sql
SELECT c.* FROM comments c
JOIN comment_tree ct ON c.comment_id = ct.descendant_id
WHERE ct.ancestor_id = 1
ORDER BY ct.depth

Closure tables make reads fast and support arbitrary depth. The cost is write complexity adding a comment requires multiple inserts. For read-heavy comment systems, this trade-off often makes sense.

My Production Choice:

For most social media systems, I use a hybrid:

Limit comment depth to 2-3 levels. Twitter, Facebook, and Reddit all do this. Deep threading is rare, and limiting depth simplifies everything.
Use adjacency list with caching. With limited depth, recursive queries are manageable. Cache rendered comment threads.
Denormalize reply counts and preview replies into the parent comment for fast rendering.

Feed Generation: The Hardest Problem

The home feed showing posts from people you follow, sorted by relevance is the hardest data modeling problem in social media.

The naive approach: Query at read time:

sql
SELECT p.* FROM posts p
JOIN follows f ON p.author_id = f.followed_id
WHERE f.follower_id = ?
ORDER BY p.created_at DESC
LIMIT 50

This works for users following a few dozen people. For users following thousands, with millions of potential posts, this query is too expensive to run on every feed load.

The fan-out-on-write approach: When someone posts, immediately add the post to the feeds of all their followers:

sql
feed_items(
  user_id BIGINT,      -- whose feed
  post_id BIGINT,
  created_at TIMESTAMP,
  PRIMARY KEY(user_id, created_at DESC, post_id)
)

When Alice posts, if she has 1,000 followers, insert 1,000 rows into feed_items. Reading a feed is then trivial:

sql
SELECT * FROM feed_items WHERE user_id = ? ORDER BY created_at DESC LIMIT 50

This works until you hit celebrity users. When someone with 10 million followers posts, you'd need to insert 10 million rows. This takes time the feed update becomes slow.

The hybrid approach: Most production systems use a combination:

Regular users: Fan-out-on-write. Their posts are inserted into followers' feeds immediately.
Celebrity users: Fan-out-on-read. Their posts are stored separately and merged into feeds at read time.

The feed query becomes: "Get my precomputed feed items, then merge with recent posts from celebrities I follow." This bounds write fan-out while keeping reads fast.

Handling Viral Posts

Viral posts stress every assumption. A post that suddenly gets millions of views and interactions creates:

Read hot spots on the post data
Write hot spots from likes and comments
Network traffic spikes to replicas
Cache pressure

Mitigation strategies:

Separate hot post handling: Detect posts that are trending and route their traffic to dedicated infrastructure more caching, more replicas, rate limiting.

Accept stale counts: During viral periods, like counts can lag. Users rarely notice the difference between 1,234,567 and 1,234,789.

Rate limit interactions: If comments are flooding in, queue them and process with slight delay.

Graceful degradation: If systems are overloaded, stop showing exact counts and show "10K+ likes" instead.

Chapter 9: Analytics Platform (OLAP)

Let me shift to a completely different domain: an analytics system for tracking user behavior across a product.

The Star Schema

Analytics databases typically use star or snowflake schemas, radically different from OLTP normalization.

A star schema has:

Fact tables: Large tables recording events or measurements (page views, purchases, clicks)
Dimension tables: Smaller tables describing attributes of the facts (users, products, dates, locations)

                    [dim_date]
                        |
[dim_user] ---- [fact_page_views] ---- [dim_page]
                        |
                    [dim_device]

The fact table references dimension tables through foreign keys. Queries join the fact table with relevant dimensions.

Why This Structure?

The star schema optimizes for analytical queries. Consider: "How many page views did we have from US users on mobile devices in January 2024, broken down by page?"

sql
SELECT
  dp.page_name,
  COUNT(*) as view_count
FROM fact_page_views f
JOIN dim_user du ON f.user_key = du.user_key
JOIN dim_device dd ON f.device_key = dd.device_key
JOIN dim_date ddate ON f.date_key = ddate.date_key
JOIN dim_page dp ON f.page_key = dp.page_key
WHERE du.country = 'US'
  AND dd.device_type = 'mobile'
  AND ddate.year = 2024
  AND ddate.month = 1
GROUP BY dp.page_name
ORDER BY view_count DESC

This query scans the fact table (potentially billions of rows) but filters efficiently through dimension attributes. Columnar storage makes this fast only the needed columns are read.

Dimension Design

Dimension tables are denormalized they contain redundant data to avoid joins between dimension levels.

sql
dim_date(
  date_key INT PRIMARY KEY,
  full_date DATE,
  year INT,
  quarter INT,
  month INT,
  month_name VARCHAR(20),
  week INT,
  day_of_week INT,
  day_name VARCHAR(20),
  is_weekend BOOLEAN,
  is_holiday BOOLEAN
)

Notice the redundancy: month_name can be derived from month, day_name from day_of_week. We store both because analytics queries frequently group by month name or day name, and computing these at query time for billions of rows is wasteful.

Slowly Changing Dimensions

User attributes change over time. A user might change their subscription tier or move to a different country. How do you handle historical analysis?

Type 1: Overwrite

Just update the dimension row. Historical queries use the current value.

sql
UPDATE dim_user SET tier = 'premium' WHERE user_id = 123

Simple, but you lose history. If someone upgraded from free to premium, all historical analysis treats them as always having been premium.

Type 2: Add New Row

Create a new dimension row for each change, tracking validity periods:

sql
dim_user(
  user_key INT PRIMARY KEY,  -- surrogate key
  user_id INT,               -- natural key (not unique)
  name VARCHAR(100),
  tier VARCHAR(20),
  valid_from DATE,
  valid_to DATE,             -- NULL for current
  is_current BOOLEAN
)

When a user upgrades:

Update old row:
valid_to = today, is_current = false
Insert new row:
valid_from = today, valid_to = NULL, is_current = true

Historical queries join on date ranges. Current queries filter is_current = true.

This is more complex but preserves history accurately.

Type 3: Add Columns

Store both old and new values:

sql
dim_user(
  user_key INT PRIMARY KEY,
  user_id INT,
  name VARCHAR(100),
  current_tier VARCHAR(20),
  previous_tier VARCHAR(20)
)

Limited to tracking one historical value. Rarely used.

Fact Table Design

Fact tables are where volume lives. A page view fact table might have:

sql
fact_page_views(
  view_id BIGINT,
  user_key INT REFERENCES dim_user,
  page_key INT REFERENCES dim_page,
  device_key INT REFERENCES dim_device,
  date_key INT REFERENCES dim_date,
  time_key INT REFERENCES dim_time,
  session_id VARCHAR(50),
  referrer VARCHAR(500),
  load_time_ms INT,
  is_bounce BOOLEAN
)

Design considerations:

Grain: Each row represents one page view. The grain must be clearly defined and consistent. Mixing grains (some rows are page views, some are sessions) causes analytical errors.

Measures: Numeric values you'll aggregate load_time_ms for performance analysis, is_bounce for conversion analysis. Measures should be additive where possible.

Degenerate dimensions: session_id is stored in the fact table rather than a separate dimension because there's no additional session attributes needed. It's a "degenerate" dimension.

Partitioning for Analytics

Analytical tables are partitioned by time almost always. Queries almost always filter by date range, and partition pruning dramatically reduces data scanned.

sql
CREATE TABLE fact_page_views (
  -- columns
) PARTITION BY RANGE (date_key);

CREATE TABLE fact_page_views_2024_01
  PARTITION OF fact_page_views
  FOR VALUES FROM (20240101) TO (20240201);

A query for January 2024 data only scans the January partition potentially 1/12 of a year's data, or 1/120 of a decade's data.

Pre-Aggregated Rollups

Some queries are too expensive even with columnar storage and partitioning. Daily active users across a year requires deduplicating users across 365 days. Running this on raw page views might scan trillions of rows.

Pre-aggregated rollups compute and store aggregated data:

sql
daily_active_users(
  date_key INT,
  platform VARCHAR(20),
  country VARCHAR(50),
  active_users INT,
  new_users INT,
  returning_users INT,
  PRIMARY KEY(date_key, platform, country)
)

This is populated by a nightly batch job:

sql
INSERT INTO daily_active_users
SELECT
  date_key,
  platform,
  country,
  COUNT(DISTINCT user_id) as active_users,
  ...
FROM fact_page_views
WHERE date_key = <yesterday>
GROUP BY date_key, platform, country

Dashboard queries hit the rollup table instead of raw facts milliseconds instead of minutes.

The trade-off: you must decide aggregation dimensions upfront. If someone wants active users by browser (not in the rollup), they're back to the raw data.

A Real Analytics Architecture

A production analytics platform typically has layers:

Raw event storage: All events, minimally processed. Often in object storage (S3) in compressed formats (Parquet). This is the source of truth.
Data warehouse: Cleaned, modeled data in a star schema. Tools like BigQuery, Snowflake, or Redshift. This is what analysts query.
Pre-aggregated cubes: Common aggregations precomputed. Either in the warehouse as rollup tables, or in specialized OLAP cubes.
Caching layer: Query results cached for dashboards. Fresh enough for business purposes (hourly, daily).

Data flows from raw to warehouse via ETL or ELT pipelines. Late-arriving data and corrections are handled through append-only patterns never delete or update; always add new records with timestamps.

Chapter 10: Logging & Observability System

Logging systems have unique characteristics: extreme write volume, append-only nature, time-based queries, and often short retention. Let me walk through modeling such a system.

The Write Challenge

A logging system for a large infrastructure might ingest:

Millions of log lines per second
From thousands of servers
With varying schemas (application logs, system logs, access logs)
Requiring sub-second ingestion latency

Relational databases aren't designed for this. Even with batching, inserting millions of rows per second overwhelms traditional RDBMS.

Time-Series Data Model

Logs are time-series data: each entry has a timestamp and is primarily queried by time range.

The fundamental schema:

logs(
  timestamp TIMESTAMP,
  source VARCHAR(100),
  level VARCHAR(20),
  message TEXT,
  metadata JSONB
)

Partitioned by time:

logs_2024_01_15_00  -- Hour 0 of Jan 15, 2024
logs_2024_01_15_01  -- Hour 1 of Jan 15, 2024
...

Hourly partitioning is common. Each partition is manageable in size, and queries typically span hours or days, not weeks.

Write Path Optimization

To handle massive write throughput:

Batching: Don't insert individual rows. Batch thousands of rows and insert together. Batching amortizes transaction overhead.

Asynchronous writes: Accept logs into a buffer (Kafka, memory) and flush to storage periodically. The trade-off is potential data loss if the buffer isn't persisted.

Append-only storage: Use storage optimized for appends. LSM-tree based systems (like those underlying Elasticsearch or ClickHouse) handle this well.

Avoid indexes on write path: Every index slows writes. Index what's necessary for queries, nothing more.

Query Patterns and Indexing

Log queries typically filter by:

Time range (almost always)
Source/service
Log level (errors, warnings)
Keywords in message

Design indexes for these patterns:

sql
-- Composite index for common query pattern
CREATE INDEX idx_logs_time_source_level
ON logs (timestamp, source, level);

Full-text search on messages requires a different approach. Relational indexes can't efficiently search within text. This is why logging systems often use Elasticsearch its inverted index supports text search.

The Elasticsearch Model

Elasticsearch models logs as JSON documents:

json
{
  "@timestamp": "2024-01-15T10:30:00Z",
  "source": "api-server-01",
  "level": "ERROR",
  "message": "Connection timeout to database",
  "metadata": {
    "request_id": "abc-123",
    "user_id": "456",
    "duration_ms": 5000
  }
}

It builds inverted indexes on all text fields, enabling queries like:

message: "timeout" AND source: "api-server-*" AND @timestamp: [now-1h TO now]

This query finds all logs from the last hour, from API servers, containing "timeout" in the message useful for debugging an incident.

Hot-Warm-Cold Architecture

Logs have temporal access patterns: recent logs are queried frequently; older logs are queried rarely.

Hot storage: Recent data (last 24-48 hours) on fast storage SSDs, memory caches. Optimized for query performance.

Warm storage: Older data (days to weeks) on standard storage. Still queryable but slower.

Cold storage: Historical data (weeks to months) on cheap storage object storage, compressed archives. Queried rarely, typically for compliance or forensics.

In Elasticsearch, this is managed through index lifecycle management:

New logs write to hot nodes
After 24 hours, indexes move to warm nodes
After 7 days, indexes move to cold nodes
After 30 days, indexes are deleted or archived

This optimizes cost: expensive fast storage only for data that needs it.

TTL-Based Deletion

Logs rarely need indefinite retention. Defining retention periods and automatic deletion prevents unbounded growth.

sql
-- Partition-based deletion
DROP TABLE logs_2023_12_15;  -- Delete logs older than 30 days

With time-based partitioning, deleting old data is instantaneous just drop the partition. No scanning, no index updates.

Schema-on-Read vs Schema-on-Write

Traditional databases enforce schema on write: data must conform to the schema when inserted.

Logging systems often prefer schema-on-read: data is ingested with minimal structure, and schema is applied when querying.

Advantages of schema-on-read:

Flexible: different services can log different fields
Agile: no schema migrations needed
Resilient: malformed data doesn't break ingestion

Disadvantages:

Query-time parsing overhead
Harder to optimize queries
Data quality issues discovered at query time

A middle ground: enforce loose schema (timestamp and source required) but allow arbitrary metadata in a JSON field.

Chapter 11: E-Commerce Platform

E-commerce combines transactional requirements (orders must be accurate) with read-heavy patterns (product browsing) and search needs. Let me walk through the modeling challenges.

Product Catalog: The Flexibility Challenge

Products have varying attributes. A laptop has specs like RAM, storage, and screen size. A shirt has size and color. A book has author and ISBN. A single fixed schema doesn't fit.

Option 1: Entity-Attribute-Value (EAV)

sql
products(product_id, name, category_id, base_price)
product_attributes(product_id, attribute_name, attribute_value)

Flexible but terrible for queries:

sql
-- Find laptops with 16GB RAM
SELECT p.* FROM products p
JOIN product_attributes pa ON p.product_id = pa.product_id
WHERE pa.attribute_name = 'ram' AND pa.attribute_value = '16GB'

This is slow and can't use normal indexes effectively. EAV is an anti-pattern for production e-commerce.

Option 2: Category-Specific Tables

sql
products(product_id, name, category_id, base_price)
laptops(product_id, ram, storage, screen_size, processor)
shirts(product_id, size, color, material)
books(product_id, author, isbn, page_count)

Query-friendly but explosion of tables. Adding a new category requires schema changes.

Option 3: JSON Attributes

sql
products(
  product_id,
  name,
  category_id,
  base_price,
  attributes JSONB
)

Flexible, supports varying schemas, and in PostgreSQL, JSONB can be indexed:

sql
CREATE INDEX idx_products_ram ON products ((attributes->>'ram'));

This is my preferred approach for most e-commerce systems. It balances flexibility with query performance.

Inventory: The Consistency Challenge

Inventory management requires strong consistency. If a product has 5 units and two customers try to buy 3 each, only one can succeed.

The race condition:

Customer A reads inventory: 5 units
Customer B reads inventory: 5 units
Customer A reserves 3, writes inventory: 2 units
Customer B reserves 3, writes inventory: 2 units
Both think they succeeded, but we sold 6 units of a 5-unit inventory

Solution: Atomic operations

sql
UPDATE inventory
SET quantity = quantity - 3
WHERE product_id = ? AND quantity >= 3
RETURNING quantity;

If the update affects 0 rows, the reservation failed (insufficient inventory). The UPDATE is atomic no race condition.

For high-traffic products, this creates contention. Every purchase serializes on the same row.

Solution: Inventory sharding

Split inventory into shards:

sql
inventory_shards(
  product_id,
  shard_id,
  quantity
)
-- For product 123 with 100 units:
-- (123, 0, 20), (123, 1, 20), (123, 2, 20), (123, 3, 20), (123, 4, 20)

Reservations try random shards until one succeeds. This distributes contention across shards.

Orders: The Immutable Record

Order data should be immutable once placed. When an order records a price, that's the price regardless of subsequent product price changes.

sql
orders(
  order_id,
  customer_id,
  status,
  total_amount,
  created_at,
  updated_at
)

order_items(
  order_item_id,
  order_id,
  product_id,
  product_name,      -- snapshot at order time
  unit_price,        -- snapshot at order time
  quantity,
  line_total
)

product_name and unit_price are denormalized snapshots. The order_items table is the permanent record of what was ordered and at what price, regardless of product table changes.

Flash Sales: The Thundering Herd

Flash sales limited-quantity products available at a specific time create extreme load spikes. At the sale start, thousands of users simultaneously:

Load the product page
Add to cart
Checkout

This thundering herd can overwhelm normal infrastructure.

Strategies:

Pre-warming: Before the sale, cache product data. Ensure replicas are synchronized. Scale up checkout infrastructure.

Virtual queue: Don't let everyone hit checkout simultaneously. Put users in a queue, process in batches. This trades user experience (waiting) for system stability.

Optimistic inventory: Accept orders optimistically, reconcile inventory asynchronously. Some orders might fail post-checkout, requiring customer notification and refund. This is a business decision: better to potentially over-sell and apologize than to have the site crash.

Separate infrastructure: Route flash sale traffic to dedicated infrastructure that doesn't affect regular shopping.

Cart: Redis vs Database

Shopping carts present a storage choice:

Redis (or similar cache):

Fast reads and writes
Simple data model (hash of product_id → quantity)
Automatic expiration (cart abandoned after 24 hours)
Data loss risk if Redis restarts without persistence

Database:

Durable
Can store complex cart data (saved for later, gift options)
Enables analytics on cart behavior
Slower than Redis

Hybrid approach:

Active carts in Redis for speed
Persist to database periodically (every few minutes)
On Redis miss, restore from database

This provides speed with durability. Cart loss is a bad user experience but not a business catastrophe.

Search Index Synchronization

Product search requires a search engine (Elasticsearch, Algolia). The search index must stay synchronized with the product database.

Change Data Capture (CDC): Capture database changes (inserts, updates, deletes) and stream to the search indexer. Tools like Debezium tap into database replication logs.

Event-driven updates: Product service publishes events (ProductCreated, ProductUpdated). Search service subscribes and updates index.

Periodic full sync: Nightly job that rebuilds the entire search index from the database. Heavy-handed but ensures consistency.

I typically use CDC for real-time sync with periodic full sync as a consistency safety net.

Chapter 12: IoT / Time-Series System

IoT systems collect data from devices sensors, vehicles, industrial equipment. The data is time-series: each reading has a timestamp and a source device.

The Volume Challenge

Consider a fleet of 100,000 vehicles, each reporting location every 5 seconds:

100,000 × 12 readings/minute = 1.2 million readings/minute
1.2M × 60 × 24 = 1.7 billion readings/day

Traditional databases struggle with this volume. Time-series databases are designed for it.

Time-Based Partitioning

Every time-series system partitions by time. This is non-negotiable.

sql
device_readings(
  device_id VARCHAR(50),
  timestamp TIMESTAMP,
  reading_type VARCHAR(20),
  value DOUBLE,
  metadata JSONB,
  PRIMARY KEY (device_id, timestamp, reading_type)
)
PARTITION BY RANGE (timestamp);

Queries almost always include time range:

sql
SELECT * FROM device_readings
WHERE device_id = '123'
  AND timestamp BETWEEN '2024-01-15' AND '2024-01-16';

Partition pruning ensures only relevant partitions are scanned.

Rolling Window Retention

IoT data often has limited retention requirements. Real-time dashboards need the last few hours. Troubleshooting needs a few days. Long-term analysis might need aggregates but not raw data.

Last 24 hours: Full resolution (every 5 seconds)
Last 7 days: 1-minute aggregates
Last 90 days: 1-hour aggregates
Beyond 90 days: Daily aggregates only

This dramatically reduces storage while preserving analytical value.

Downsampling

Aggregating high-resolution data into lower-resolution summaries:

sql
-- Hourly aggregates from minute data
INSERT INTO readings_hourly (device_id, hour, reading_type, avg_value, min_value, max_value, count)
SELECT
  device_id,
  DATE_TRUNC('hour', timestamp) as hour,
  reading_type,
  AVG(value),
  MIN(value),
  MAX(value),
  COUNT(*)
FROM readings_minute
WHERE timestamp >= ? AND timestamp < ?
GROUP BY device_id, DATE_TRUNC('hour', timestamp), reading_type;

This runs periodically (hourly) to maintain the aggregate table.

Compression Strategies

Time-series data compresses well because adjacent values are often similar. A temperature sensor might read: 72.1, 72.2, 72.1, 72.3, 72.2...

Specialized compression:

Delta encoding: Store differences between consecutive values instead of absolute values
Run-length encoding: "72.1 repeated 50 times" instead of 50 separate values
Gorilla compression: A technique from Facebook's time-series database, achieving 10x compression

Time-series databases like InfluxDB, TimescaleDB, and ClickHouse implement these automatically.

Efficient Recent Queries

"Show me the last hour of readings for device 123" is the most common query pattern.

Optimization: Keep recent data in memory or hot storage. TimescaleDB chunks tables by time; recent chunks stay in memory. InfluxDB's cache keeps recent data for fast access.

Query patterns to optimize:

sql
-- Latest reading per device (common for dashboards)
SELECT DISTINCT ON (device_id) *
FROM device_readings
ORDER BY device_id, timestamp DESC;

-- Last N readings for a device
SELECT * FROM device_readings
WHERE device_id = '123'
ORDER BY timestamp DESC
LIMIT 100;

Both benefit from indexes on (device_id, timestamp DESC).

Chapter 13: Search System

Search systems require specialized data structures beyond what relational databases offer. Let me explain the fundamentals.

The Inverted Index

A traditional index maps: row → data. An inverted index maps: data → rows.

Consider documents:

Doc 1: "The quick brown fox"
Doc 2: "The lazy brown dog"
Doc 3: "Quick foxes are rare"

The inverted index:

"the"    → [Doc 1, Doc 2]
"quick"  → [Doc 1, Doc 3]
"brown"  → [Doc 1, Doc 2]
"fox"    → [Doc 1]
"lazy"   → [Doc 2]
"dog"    → [Doc 2]
"foxes"  → [Doc 3]
"are"    → [Doc 3]
"rare"   → [Doc 3]

Searching for "quick" finds [Doc 1, Doc 3]. Searching for "quick brown" intersects [Doc 1, Doc 3] with [Doc 1, Doc 2] to get [Doc 1].

Why Relational LIKE Doesn't Work

sql
SELECT * FROM documents WHERE content LIKE '%quick%';

This requires scanning every document, checking if "quick" appears. With millions of documents, this is unacceptably slow.

Even with indexes, LIKE with a leading wildcard (%quick%) can't use the index. The index is sorted by the start of the string, not arbitrary substrings.

Full-Text Search Features

Beyond simple matching, search systems provide:

Stemming: "running" and "runs" both match "run". The index stores root forms.

Tokenization: "don't" becomes ["don't", "dont", "do", "not"]. Different tokenizers for different languages.

Ranking: Documents aren't just matches or non-matches. They're ranked by relevance. TF-IDF, BM25, and other algorithms score matches.

Phrase matching: "New York" as a phrase, not just documents containing both words separately.

Fuzzy matching: "quikc" matches "quick" (typo tolerance).

Hybrid Architecture

Most systems combine relational databases with search engines:

Relational database: Source of truth for structured data. Handles transactions, relationships, consistency.
Search engine: Optimized index for search. Eventually consistent with the database.

Data flow:

Application writes to database
Change event triggers index update
Search engine indexes the document
Search queries hit the search engine
Results include IDs; application fetches full data from database

This separation allows each system to be optimized for its purpose.

Relevance Scoring

How does a search engine decide Doc 1 is more relevant than Doc 3?

Term Frequency (TF): Documents with more occurrences of the search term rank higher. If "quick" appears 5 times in Doc A and once in Doc B, Doc A scores higher.

Inverse Document Frequency (IDF): Rare terms matter more. "The" appears in almost every document it's not useful for ranking. "Bioluminescent" appears rarely documents containing it are probably relevant to searches for it.

Field weighting: Matches in the title might score higher than matches in the body.

Recency: Newer documents might rank higher.

Popularity: Documents with more views or engagement might rank higher.

Production search systems tune these factors extensively based on user behavior data.

Chapter 14: Ad-Tech / High Throughput Counters

Ad-tech systems have extreme requirements: billions of events, real-time aggregation, low latency analytics. Let me explain the data modeling challenges.

Impression and Click Tracking

Every ad shown is an impression. Every click is a click event. At scale:

10 billion impressions/day
100 million clicks/day
Must track by: ad, campaign, publisher, user segment, geography, device, time

The naive table:

sql
impressions(
  impression_id,
  ad_id,
  campaign_id,
  publisher_id,
  user_id,
  country,
  device_type,
  timestamp
)

Inserting 10 billion rows/day requires ~115,000 inserts/second sustained. Queries aggregating across dimensions touch billions of rows.

Event Streaming Architecture

Direct database writes don't scale. The standard architecture:

Event ingestion: Events flow to Kafka topics (impressions, clicks)
Stream processing: Flink/Spark processes streams, aggregates in windows
Pre-aggregated storage: Aggregated counts go to a fast store (Redis, Cassandra)
Raw event archive: Full events go to object storage for later analysis

Real-time dashboards query the pre-aggregated store. Historical analysis queries the raw archive.

High-Cardinality Counters

Counting unique users who saw an ad seems simple:

sql
SELECT COUNT(DISTINCT user_id) FROM impressions WHERE ad_id = ?

But user_id has billions of values (high cardinality). Maintaining exact distinct counts across billions of events and thousands of ads is impractical.

HyperLogLog to the rescue: A probabilistic data structure that estimates distinct counts with ~0.8% error using minimal memory. Instead of storing every user_id, HyperLogLog maintains a compact sketch.

Most analytics systems accept approximate counts. "Approximately 1.2 million unique viewers" is good enough for ad billing and optimization.

Real-Time Dashboards

Advertisers want to see campaign performance in real-time: impressions, clicks, spend, conversions.

Architecture:

Stream processor aggregates by campaign, by minute
Results stored in Redis:
campaign:123:impressions:2024-01-15T10:30 = 45678
Dashboard queries Redis, sums across time buckets
Historical data rolls up to Clickhouse for long-term storage

This provides sub-second dashboard updates while keeping infrastructure manageable.

Windowed Aggregation

Counting events by time window:

Tumbling windows: Non-overlapping fixed periods. Events from 10:00-10:05 go in one bucket, 10:05-10:10 in another.

Sliding windows: Overlapping windows. "Last 5 minutes" computed continuously, including events from 10:03-10:08, then 10:04-10:09, etc.

Session windows: Grouped by user activity. Events within 30 minutes of each other are one session; gaps longer than 30 minutes start a new session.

Stream processing frameworks (Flink, Kafka Streams) provide built-in windowing. The challenge is handling late-arriving events data that arrives after its window has closed.

Part V: Database Selection

Chapter 15: Choosing the Right Database

"Which database should I use?" is the wrong question. The right question is: "What are my workload characteristics, and which database matches them?"

PostgreSQL: The Versatile Workhorse

PostgreSQL is my default recommendation for new projects. It's the most capable general-purpose database available.

Strengths:

ACID compliance with sophisticated transaction support
Rich SQL with CTEs, window functions, JSON operators
JSONB for flexible schema alongside relational tables
Full-text search (good enough for many use cases)
Extensibility (PostGIS for geo, pg_vector for embeddings)
Mature replication and high-availability options
Large ecosystem of tools and expertise

Limitations:

Horizontal scaling requires external tools (Citus) or careful sharding
Write throughput limited by single-leader architecture
Not optimized for analytical workloads at scale
Connection model (process per connection) limits concurrency

Use when:

Your schema is relational or hybrid relational+JSON
You need transactions across tables
You need the flexibility of SQL
You want one database to start with

Avoid when:

You need massive write throughput (>100K writes/sec)
You need linear horizontal scale
Your data is purely analytical at petabyte scale

MySQL: The Established Standard

MySQL powers a significant portion of the web. It's proven and well-understood.

Strengths:

Extremely well-tested and stable
Huge community and tooling ecosystem
Good read scaling via replication
InnoDB engine is solid for OLTP
Vitess provides sharding solution (YouTube scale)

Limitations:

Historically weaker SQL feature set than PostgreSQL (improving)
JSON support less sophisticated than PostgreSQL's JSONB
Replication can lag under heavy write load
Default configuration often suboptimal

Use when:

You have MySQL expertise
You're building on the LAMP stack
You need a database that every hosting provider supports

Avoid when:

You need advanced SQL features
You're starting fresh with no MySQL investment

MongoDB: The Document Database

MongoDB popularized the document database model. It's evolved significantly from early criticisms.

Strengths:

Flexible schema great for varied document structures
Horizontal scaling built-in (sharding)
Good developer experience for object-oriented code
Atlas (managed service) is polished
Aggregation pipeline is powerful

Limitations:

No multi-document transactions (until recent versions, and with limitations)
Join-like operations ($lookup) are expensive
Easy to create performance problems with poor data modeling
Storage efficiency less than relational for highly relational data

Use when:

Your data is document-oriented (content, catalogs, configurations)
Schema flexibility is genuinely needed
You're denormalizing anyway

Avoid when:

Your data is highly relational
You need cross-document transactions
Your team doesn't understand document modeling

Cassandra: The Distributed Write Machine

Cassandra is designed for massive write throughput across data centers.

Strengths:

Linear horizontal scale for writes
Multi-datacenter replication
Tunable consistency
No single point of failure
Handles very large datasets

Limitations:

Query model is extremely limited (no joins, limited filtering)
Data modeling requires deep understanding
Operational complexity is high
Read latency higher than alternatives for some patterns

Use when:

You need massive write throughput (>1M writes/sec)
You need multi-region active-active
Your query patterns fit Cassandra's model

Avoid when:

You need flexible queries
You need strong consistency
You don't have operational expertise

DynamoDB: The Managed NoSQL

DynamoDB provides Cassandra-like capabilities as a managed service.

Strengths:

Fully managed (no operations)
Scales automatically
Single-digit millisecond latency
Integrates with AWS ecosystem
Pay-per-request pricing option

Limitations:

Query model is limited (like Cassandra)
Costs can explode with inefficient access patterns
Vendor lock-in (AWS only)
Item size limit (400KB)

Use when:

You're on AWS and want managed infrastructure
Your access patterns are well-defined
You need extreme scale

Avoid when:

You need flexible queries
Multi-cloud is a requirement
You're cost-sensitive at high scale

Redis: The Speed Layer

Redis is an in-memory data structure store, typically used as cache or fast storage.

Strengths:

Sub-millisecond latency
Rich data structures (sets, sorted sets, hyperloglog)
Pub/sub capabilities
Lua scripting for atomic operations
Cluster mode for horizontal scale

Limitations:

Memory-bound (expensive at large scale)
Persistence options have trade-offs
Not a primary database for most use cases
Data loss possible under failures

Use when:

You need a cache layer
You need session storage
You need real-time leaderboards or counters
You need a pub/sub broker

Avoid when:

You need durable primary storage
Your dataset exceeds memory capacity
You need complex queries

Elasticsearch: The Search Engine

Elasticsearch excels at full-text search and log analytics.

Strengths:

Inverted index for fast text search
Powerful query DSL
Real-time indexing
Kibana for visualization
Good for log aggregation

Limitations:

Not a database (eventual consistency, no transactions)
Resource-intensive
Cluster management is complex
Write-heavy workloads stress the system

Use when:

You need full-text search
You're building log analytics
You need complex text queries

Avoid when:

You're using it as a primary database
You need transactions
Your use case is simple search (PostgreSQL FTS might suffice)

ClickHouse: The Analytical Speed Demon

ClickHouse is a columnar database optimized for analytical queries.

Strengths:

Extremely fast analytical queries
Efficient compression
Real-time ingestion
SQL interface
Cost-effective for large datasets

Limitations:

Not for OLTP (no transactions, limited updates)
Operational complexity
Replication model requires understanding
JOINs can be problematic

Use when:

You need sub-second analytics on billions of rows
You're building real-time dashboards
You have high-volume event data

Avoid when:

You need transactions
You need frequent updates
You're not doing analytics

BigQuery / Snowflake: The Warehouse Giants

These managed analytical warehouses excel at petabyte-scale analytics.

Strengths:

Massive scale with zero infrastructure
Separation of storage and compute
Pay-per-query pricing
SQL interface
Integrations with BI tools

Limitations:

Not for OLTP
Latency too high for interactive use (seconds to minutes)
Costs can surprise at high query volume
Vendor lock-in

Use when:

You have petabytes of analytical data
You want zero operational burden
Query latency of seconds is acceptable

Avoid when:

You need sub-second queries
You need real-time analytics
You're cost-sensitive at high volume

NewSQL: CockroachDB, TiDB, Spanner

NewSQL databases promise SQL semantics with horizontal scale.

Strengths:

SQL with horizontal scale
ACID across distributed data
Automatic sharding
Strong consistency

Limitations:

Higher latency than single-node PostgreSQL
Operational complexity varies
Younger ecosystems
Performance characteristics differ from traditional RDBMS

Use when:

You've outgrown single-node PostgreSQL
You need SQL with horizontal scale
You need geo-distributed data with consistency

Avoid when:

Single-node PostgreSQL meets your needs
You need the maturity of traditional databases
Your team lacks distributed systems expertise

Chapter 16: Scaling Strategies

Scaling is not a single technique it's a toolbox of approaches applied appropriately.

Vertical Scaling: The First Option

Before distributing, scale up. Modern servers support:

Hundreds of CPU cores
Terabytes of RAM
Petabytes of storage

A well-tuned PostgreSQL on serious hardware handles more than most applications need. Vertical scaling is operationally simple one machine, no distributed complexity.

When vertical works:

Your data fits on one large machine
Your read/write rate fits single-machine throughput
You can afford the hardware

When vertical fails:

Physical limits exceeded
Cost becomes prohibitive
Availability requirements exceed single-machine guarantees

Read Replicas: The First Distribution

Most applications read more than they write. Read replicas multiply read capacity.

Client → Load Balancer → Primary (writes)
                       → Replica 1 (reads)
                       → Replica 2 (reads)
                       → Replica 3 (reads)

The primary handles writes; replicas handle reads. Adding replicas increases read throughput linearly.

Complications:

Replication lag: Replicas are slightly behind the primary. A user writes data, then immediately reads from a replica that hasn't received the write yet. Strategies:

Read-your-writes: Route reads to primary for N seconds after writes
Sticky sessions: Users always read from the same replica
Wait for replication: Delay reads until replicas catch up

Failover: When the primary fails, a replica must be promoted. This must be automated and tested.

Sharding: The Nuclear Option

When a single primary can't handle the write load or data volume, sharding distributes data across multiple primaries.

Each shard is an independent database. Data is partitioned based on a shard key:

Range sharding: User IDs 1-1M on shard 1, 1M-2M on shard 2. Simple but can create hot shards (new users mostly on the latest shard).

Hash sharding: hash(user_id) % N determines shard. Even distribution but makes range queries across shards expensive.

Geo sharding: US users on US shard, EU users on EU shard. Good for data locality and compliance.

The complexity tax:

Cross-shard queries: Queries spanning shards require scatter-gather, dramatically increasing complexity
Cross-shard transactions: Essentially impossible without significant infrastructure
Rebalancing: Adding shards requires moving data, which is operationally risky
Operational burden: N shards means N times the monitoring, backups, and incident surface

I've seen teams shard prematurely and drown in complexity. I've seen teams avoid sharding too long and face scaling crises. The right time is when vertical scaling and read replicas are genuinely insufficient.

Global Secondary Indexes

In a sharded database, local indexes only search within a shard. "Find all users named Alice" requires querying every shard.

Global secondary indexes maintain cross-shard lookups:

Users sharded by user_id
Global index from email → user_id, shard

This enables efficient email lookups but creates consistency challenges: the index is in a different shard than the data.

Multi-Region Replication

For global applications, data must be in multiple regions:

Latency: Users should read from nearby replicas
Availability: Regional outages shouldn't cause global outages
Compliance: Data sovereignty laws require certain data to stay in-region

Patterns:

Primary in one region, read replicas globally: Simple but writes always go to one region. High write latency for distant users.

Multi-primary (active-active): Primaries in multiple regions, replicating to each other. Lower write latency but conflict resolution becomes necessary.

Partitioned by region: Each region owns its data. Users are homed to a region. Simple but cross-region queries are complex.

Part VI: Indexing Masterclass

Chapter 17: Understanding Index Structures

Indexes are how databases find data efficiently. Understanding index structures helps you design effective schemas and troubleshoot performance.

B-Trees: The Relational Standard

B-trees are balanced tree structures that maintain sorted order. They're the default index type in PostgreSQL, MySQL, and most relational databases.

A B-tree with millions of entries has perhaps 4-5 levels. Finding any value requires reading 4-5 pages from disk a few milliseconds at most.

B-tree strengths:

Efficient point lookups (find exact value)
Efficient range scans (find values between X and Y)
Efficient ordered retrieval (sorted results without extra sorting)
Balanced read and write performance

B-tree operations:

Lookup: O(log N)
Insert: O(log N)
Delete: O(log N)
Range scan: O(log N + K) where K is results

LSM Trees: The Write-Optimized Alternative

Log-Structured Merge trees optimize for write throughput at the cost of read efficiency. They're used in Cassandra, RocksDB, and LevelDB.

Writes go to an in-memory buffer (memtable). When full, the memtable is written to disk as an immutable sorted file (SSTable). Periodically, SSTables are merged (compaction).

LSM strengths:

Very high write throughput (sequential writes)
Efficient disk utilization
Good for write-heavy workloads

LSM weaknesses:

Read requires checking multiple SSTables (read amplification)
Background compaction consumes CPU and I/O (write amplification)
Compaction can cause latency spikes

Bloom Filters: Probabilistic Acceleration

Bloom filters are probabilistic data structures that test set membership. They can say "definitely not in set" or "probably in set."

In databases, Bloom filters help avoid unnecessary disk reads. Before reading an SSTable to find a key, check the Bloom filter. If it says "not present," skip the read.

if bloom_filter.might_contain(key):
    actually_read_sstable(key)
else:
    skip_sstable()  # Saved a disk read!

False positives cause unnecessary reads. False negatives don't happen if the key is present, the Bloom filter always says "might contain."

Index Types You Should Know

Composite indexes: Index on multiple columns. CREATE INDEX idx ON orders(customer_id, created_at). Order matters: this index is useful for queries filtering by customer_id, or customer_id + created_at. It's not useful for queries filtering only by created_at.

Covering indexes: Indexes that contain all columns needed by a query. The query can be answered from the index alone without reading the table (index-only scan).

sql
CREATE INDEX idx ON orders(customer_id) INCLUDE (status, total);
-- Query can use index only:
SELECT status, total FROM orders WHERE customer_id = 123;

Partial indexes: Indexes on a subset of rows. Useful when queries always filter to that subset.

sql
CREATE INDEX idx ON orders(customer_id) WHERE status = 'pending';
-- Only pending orders are indexed, smaller index, faster queries

Expression indexes: Indexes on computed expressions.

sql
CREATE INDEX idx ON users(LOWER(email));
-- Now case-insensitive email lookups are fast

Write Amplification: The Hidden Cost

Every write to indexed data also updates indexes. A table with 10 indexes means each insert requires ~10 additional operations.

This is write amplification: one logical write becomes many physical writes.

Implications:

Insert/update performance degrades with index count
Indexes aren't free add only what queries need
For write-heavy workloads, be aggressive about removing unused indexes

Measuring index usage:

PostgreSQL tracks index usage:

sql
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

Indexes with zero or few scans are candidates for removal.

Part VII: API and Database Design Together

Chapter 18: Designing APIs That Respect Database Realities

API design and database design are interconnected. A poorly designed API forces database access patterns that kill performance. A well-designed API enables efficient database usage.

Resource Modeling

REST resources should map sensibly to data entities, but not always 1:1.

Database entities:

customers
orders
order_items
products

API resources:

/customers/{id}
/customers/{id}/orders
/orders/{id} (includes items as nested resource)
/products/{id}

The API resource /orders/{id} returns denormalized data (order + items) even though the database stores them separately. The API serves client needs; the database serves data integrity needs.

Pagination: Get This Right

Returning all results is not scalable. Pagination is mandatory for any list endpoint.

Offset pagination:

GET /orders?page=10&per_page=20

Implementation:

sql
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 180;

Problems:

Performance: OFFSET 10000 requires reading 10000 rows to skip them
Inconsistency: If new orders are inserted while paginating, you'll miss or duplicate items

Cursor pagination:

GET /orders?cursor=eyJpZCI6MTIzNH0=&limit=20

Cursor encodes the position (e.g., last seen order_id). Implementation:

sql
SELECT * FROM orders WHERE id < 1234 ORDER BY id DESC LIMIT 20;

Benefits:

Consistent performance regardless of page depth
Stable results even with concurrent inserts
No skipped or duplicated items

Recommendation: Use cursor pagination for anything but the simplest cases. Offset pagination is acceptable for admin interfaces with small datasets.

Filtering Without Killing Performance

APIs often support filtering:

GET /products?category=electronics&price_min=100&price_max=500&in_stock=true

Each filter combination potentially requires a different index. With many filterable fields, index combinatorics explode.

Strategies:

Limit filter combinations: Don't allow arbitrary filtering. Define supported filter patterns and index for those.

Required filters: Require high-selectivity filters. "All products with price 100-500" scans too much; "Electronics with price 100-500" is reasonable.

Search endpoint instead: For complex filtering, use a search engine. Elasticsearch handles arbitrary filters efficiently.

Sorting: The Database Must Support It

GET /products?sort=price_asc

This requires:

sql
SELECT * FROM products ORDER BY price ASC;

Without an index on price, this is a full table scan + sort. Expensive.

Solutions:

Index commonly sorted columns
Limit allowed sort fields to indexed columns
Default sort matches an existing index

Rate Limiting: Protect the Database

Without rate limiting, one misbehaving client can overload your database.

Per-user limits: Each user gets N requests per minute.

Expensive query limits: Complex queries (multiple filters, large ranges) have lower limits than simple queries.

Backoff headers: When rate limited, return Retry-After header. Clients can implement exponential backoff.

Idempotency: Safe Retries

Network failures happen. Clients retry. Without idempotency, retries cause duplicate effects.

POST /orders without idempotency: Two retries create two orders.

POST /orders with idempotency key:

POST /orders
Idempotency-Key: abc-123-unique-key

Server stores the result keyed by idempotency key. If the same key is seen again, return the stored result instead of re-executing.

Database support:

sql
CREATE TABLE idempotency_keys (
  key VARCHAR(100) PRIMARY KEY,
  result JSONB,
  created_at TIMESTAMP,
  expires_at TIMESTAMP
);

Part VIII: Distributed System Considerations

Chapter 19: CAP and Data Modeling

The CAP theorem constrains distributed database design. Your data model must work within those constraints.

Consistency vs Availability: Real Choices

When a network partition occurs, you must choose:

CP: Refuse requests that can't be confirmed consistent. Some users get errors.
AP: Accept requests that might create inconsistency. All users get responses.

Data modeling implications:

For CP systems (banking, inventory):

Design for single-source-of-truth
Avoid denormalization that creates consistency challenges
Accept that some operations will fail during partitions

For AP systems (social media, product catalogs):

Design for eventual consistency
Embrace denormalization for performance
Build reconciliation mechanisms for conflicts

Eventual Consistency in Practice

"Eventually consistent" means:

Writes propagate asynchronously
Different clients may see different states
Given enough time without new writes, all replicas converge

Designing for eventual consistency:

Timestamps everywhere: Every record has created_at, updated_at. Conflict resolution can use timestamps.

Immutable events: Instead of mutable state, store immutable events. Current state is derived from events. Events are easier to replicate and reconcile.

Application-level resolution: The application knows which data can be safely merged and which requires manual resolution.

Idempotent Writes: The Distributed Must-Have

In distributed systems, exactly-once delivery is extremely difficult. At-least-once is achievable. This means writes might be delivered multiple times.

Writes must be idempotent: applying them multiple times has the same effect as applying once.

Non-idempotent:

sql
UPDATE accounts SET balance = balance + 100 WHERE id = 123;

If this executes twice, balance increases by 200.

Idempotent:

sql
INSERT INTO transactions (transaction_id, account_id, amount)
VALUES ('txn-abc-123', 123, 100)
ON CONFLICT (transaction_id) DO NOTHING;

The transaction_id is unique. Duplicate inserts are ignored. Balance derived from transactions is correct regardless of retries.

Schema Evolution: The Long Game

Distributed systems run many versions of code simultaneously during deployments. The database schema must support all versions.

Backward compatible changes:

Add nullable column
Add new table
Add index

Breaking changes (avoid during rolling deploys):

Remove column (old code still references it)
Rename column (old code uses old name)
Change column type (old code expects old type)

Safe migration pattern:

Add new column (nullable)
Deploy code that writes to both old and new columns
Backfill new column from old column
Deploy code that reads from new column
Remove old column (after all old code is gone)

This takes multiple deployments but is safe.

Part IX: Interview Preparation

Chapter 20: Database Modeling Interview Questions

Let me provide questions that test deep understanding, not just trivia.

Conceptual Questions

Q1: A startup asks you to design their database. Their product is growing 10x per year, and they want to "use NoSQL because it scales better." How do you respond?

Strong answer: I'd first understand their actual requirements: data model, query patterns, consistency needs, current scale, and team expertise. "NoSQL scales better" is too simplistic PostgreSQL handles more than most startups will ever need. NoSQL databases have specific strengths (Cassandra for write throughput, MongoDB for document flexibility) but also significant trade-offs (limited query capability, operational complexity). I'd likely recommend starting with PostgreSQL for its flexibility, then introducing specialized databases only when specific workloads demand them.

Q2: Explain why joins become problematic in distributed databases.

Strong answer: In a single-node database, joining tables requires reading from local disk fast. In a distributed database, tables might live on different nodes. Joining requires network round trips: send keys to the other node, receive matching rows, combine results. Network latency is orders of magnitude higher than disk latency. Additionally, joins might require transferring large amounts of data between nodes. Distributed databases often denormalize to avoid joins, or they require careful schema design where joined data is co-located on the same node (by sharing partition keys).

Q3: When would you denormalize, and how do you manage the consistency implications?

Strong answer: I denormalize when read performance is critical, reads vastly outnumber writes, and eventual consistency is acceptable. For example, embedding author information in each blog post avoids joins on every read. To manage consistency: (1) Accept staleness for stable data like usernames; (2) Use event-driven updates where author changes publish events that update denormalized copies; (3) Implement TTL-based cache invalidation for time-bounded staleness; (4) Maintain reconciliation jobs that periodically verify consistency.

Scenario Questions

Q4: Design a schema for a ride-sharing application. Consider: riders, drivers, trips, payments, ratings.

Strong answer: I'd start with entity identification: Riders, Drivers, Vehicles, Trips, Payments, Ratings. For the core schema:

sql
riders(rider_id, name, email, phone, created_at)
drivers(driver_id, name, email, phone, vehicle_id, current_location, status, rating_avg)
vehicles(vehicle_id, driver_id, make, model, license_plate)
trips(trip_id, rider_id, driver_id, pickup_location, dropoff_location, status,
      started_at, completed_at, distance, fare)
payments(payment_id, trip_id, amount, method, status, processed_at)
ratings(rating_id, trip_id, rater_type, rating, comment, created_at)

For scale considerations: Trips will be huge, so partition by created_at. Driver location updates are very frequent consider a separate real-time store (Redis) for current locations rather than updating the drivers table constantly. Ratings could be denormalized into drivers.rating_avg for fast access, updated asynchronously.

Q5: You're storing user activity events (page views, clicks, purchases). You have 100M daily active users generating 1000 events each per day. Design the storage.

Strong answer: That's 100B events per day, roughly 1.1M events per second sustained. This is beyond traditional RDBMS capability.

I'd use a tiered architecture:

Ingestion: Kafka for buffering and distribution
Real-time: Events streamed to Flink for real-time aggregates (metrics, counters)
Short-term: ClickHouse for last 7 days of events, supporting ad-hoc analysis
Long-term: S3 + Parquet for historical events, queryable via Athena/Presto

Schema in ClickHouse:

sql
CREATE TABLE events (
  user_id UInt64,
  event_type String,
  event_data String,  -- JSON
  timestamp DateTime
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (user_id, timestamp);

Partitioning by date enables efficient time-range queries and TTL-based deletion. Ordering by user_id + timestamp optimizes user-specific queries.

Q6: Your team uses PostgreSQL. The orders table has 500M rows and queries are slowing down. What do you do?

Strong answer: I'd investigate before changing anything. First, identify the slow queries using pg_stat_statements. Check if they're using indexes (EXPLAIN ANALYZE). Common issues:

Missing indexes on filtered columns: Add appropriate indexes
Full table scans: Maybe a sequential scan is actually optimal, or maybe statistics are stale (run ANALYZE)
Index bloat: REINDEX to rebuild fragmented indexes
Table bloat: VACUUM FULL to reclaim space (requires downtime) or pg_repack

If the data has time-series characteristics (orders by date), table partitioning by month would help:

sql
CREATE TABLE orders_2024_01 PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

Old partitions can be archived; queries automatically prune to relevant partitions.

For read scaling, add read replicas and route read queries there.

Only if these don't suffice would I consider sharding or a different database.

Tradeoff Questions

Q7: Strong consistency vs eventual consistency when do you choose each?

Strong answer: Strong consistency when correctness is more important than availability. Financial transactions, inventory management, anything where incorrect state causes significant harm. The cost is potential unavailability during failures and higher latency for coordination.

Eventual consistency when availability is paramount and temporary inconsistency is acceptable. Social media feeds, product recommendations, view counters. Users tolerate seeing slightly stale data; they don't tolerate the site being down.

The choice is business-driven: what's the cost of showing stale data vs the cost of showing an error page? For banking, an error is far better than showing wrong balance. For Twitter, showing a post 2 seconds late is far better than showing an error.

Q8: SQL vs NoSQL how do you decide?

Strong answer: I ask: What are the query patterns? SQL databases excel at flexible queries if I don't know all queries upfront, SQL's query optimizer handles new patterns without schema changes. NoSQL databases require knowing access patterns upfront and designing the schema specifically for those patterns.

I also ask: What consistency do I need? If I need cross-entity transactions (transfer money between accounts), relational databases handle this naturally. NoSQL databases either don't support it or require careful design.

Finally: What's the scale? A single PostgreSQL instance handles more than most applications need. "We need NoSQL for scale" is often premature optimization. If I genuinely need millions of writes per second, then yes, I look at Cassandra or DynamoDB.

Most applications should start with PostgreSQL. It handles documents (JSONB), full-text search, and geospatial queries. Only when hitting specific limitations should you introduce specialized databases.

Red Flags Interviewers Watch For

"I always use [X]": Dogmatic preference for one database regardless of requirements
"We'll figure it out later": No thought for scale or evolution
Premature optimization: Sharding a database that has 1000 rows
Ignoring consistency: Not considering what happens during failures
Not questioning requirements: Accepting "we need real-time" without asking "what does real-time mean for this use case?"

How to Explain Tradeoffs Clearly

Use this structure:

State the options
Explain the tradeoffs of each
State your recommendation
Explain why given the context

Example: "We could use a normalized schema or denormalize the user information into posts. Normalized means single source of truth and easy updates, but requires joins on every read. Denormalized means fast reads but stale data when users update their profile. Given that this is a high-traffic feed where reads are 1000:1 to writes, and users rarely change their profile info, I'd denormalize and accept eventual consistency with a 5-minute cache TTL."

Part X: Final Section

Executive Summary

Data modeling is the foundation of system design. Poor models become permanent constraints; good models enable evolution and scaling.

For transactional systems, prioritize normalization and strong consistency. Accept join costs for data integrity. Use PostgreSQL or similar RDBMS. Scale with read replicas before considering distribution.

For read-heavy systems, embrace denormalization and eventual consistency. Optimize for common query patterns. Accept the consistency management overhead. Consider document databases or aggressive caching.

For analytical systems, use columnar storage with star schemas. Prioritize query performance over write efficiency. Pre-aggregate where possible. Use specialized OLAP databases.

For high-throughput event systems, design for append-only writes, time-based partitioning, and tiered storage. Accept that traditional databases won't scale; use specialized systems.

The choice between SQL and NoSQL is not ideology it's workload fit. Know your query patterns, consistency requirements, and scale needs. Choose accordingly.

Decision-Making Framework

When designing a data model:

What are the entities and relationships? Start with logical modeling. Don't think about databases yet.
What are the query patterns? List the queries the system must support. Estimate their frequency and latency requirements.
What are the consistency requirements? For each operation, what happens if data is stale or inconsistent?
What is the scale? Current and projected. Events per second. Data volume. User concurrency.
What are the operational constraints? Team expertise. Budget. Managed vs self-hosted.

Based on answers:

Relational + normalized for complex queries with consistency
Relational + denormalized for read-heavy with some complexity
Document DB for document-oriented data with known access patterns
Wide-column for massive write scale with simple queries
Columnar for analytical aggregations
Time-series for IoT/event streams

Database Selection Matrix

Requirement	Primary Choice	Alternative
General OLTP, flexible queries	PostgreSQL	MySQL
Document-oriented, known patterns	MongoDB	PostgreSQL (JSONB)
Massive write throughput	Cassandra	ScyllaDB, DynamoDB
Full-text search	Elasticsearch	PostgreSQL FTS, MeiliSearch
Real-time analytics	ClickHouse	Apache Druid
Warehouse analytics	Snowflake, BigQuery	Redshift
Caching, sessions	Redis	Memcached
Graph relationships	Neo4j	PostgreSQL (recursive CTEs)
Time-series data	TimescaleDB	InfluxDB

Scaling Mental Model

Start here and move right only when necessary:

Single PostgreSQL
      ↓ reads exceeding capacity
Add read replicas
      ↓ writes exceeding capacity
Vertical scaling (bigger machine)
      ↓ vertical limits reached
Table partitioning
      ↓ single-node limits reached
Application-level sharding
      ↓ operational burden too high
Managed distributed DB (CockroachDB, Spanner, etc.)

At each step, the complexity and operational burden increase significantly. Move right only when the current solution is genuinely insufficient.

30-Day Mastery Roadmap

Week 1: Foundations

Days 1-2: Normalization theory (1NF through BCNF) with banking schema exercise
Days 3-4: Denormalization patterns with social media schema exercise
Days 5-7: PostgreSQL deep dive partitioning, indexing, EXPLAIN ANALYZE

Week 2: Specialized Systems

Days 8-9: Document databases (MongoDB data modeling)
Days 10-11: Wide-column databases (Cassandra data modeling)
Days 12-14: Analytical databases (ClickHouse, star/snowflake schemas)

Week 3: Scale & Distribution

Days 15-16: Replication and consistency models
Days 17-18: Sharding strategies and trade-offs
Days 19-21: Distributed transactions and sagas

Week 4: Integration & Practice

Days 22-23: API design and pagination
Days 24-25: Full system design exercises (e-commerce, social media)
Days 26-28: Interview practice with peers
Days 29-30: Review weak areas, consolidate notes

Daily Practice:

Read one engineering blog post about data at scale
Do one LeetCode database problem
Design one schema from scratch

Resources:

"Designing Data-Intensive Applications" by Martin Kleppmann
PostgreSQL documentation (seriously, it's excellent)
Database vendor engineering blogs (Netflix, Uber, Pinterest, Airbnb)

Closing Thoughts

I've been designing database systems for over two decades. I've made every mistake in this book some of them expensively. The patterns I've described aren't academic theory; they're hard-won lessons from production systems serving millions of users.

The most important skill isn't knowing the right answer. It's knowing how to reason about trade-offs. Every design decision has consequences. The best architects understand those consequences deeply and make decisions appropriate for their specific context.

There is no perfect database or perfect schema. There are only appropriate choices for specific requirements. Your job is to understand those requirements deeply, evaluate the options honestly, and choose wisely.

May your queries be fast, your data be consistent (or eventually consistent, as appropriate), and your on-call rotations be peaceful.

"The database is not a detail. It is the foundation. Get it wrong, and you will pay for decades. Get it right, and your system will serve you faithfully as it scales beyond your wildest projections."

Data Modeling in Databases for System Design Interviews and Production Systems

A Complete Guide for Senior Engineers

Preface

Part I: Foundations of Data Modeling

Chapter 1: The Two Worlds of Data Modeling

Logical Modeling: The World of Business Meaning

Physical Modeling: The World of Performance

The Bridge Between Worlds

Chapter 2: OLTP vs OLAP - The Great Divide

OLTP: The World of Transactions

OLAP: The World of Analysis

The Disaster of Mixing Workloads

Modern Convergence: HTAP Systems

Chapter 3: Row-Store vs Column-Store

Row Storage: How Traditional Databases Work

The Problem: Analytical Queries

Column Storage: The Analytical Solution

The Trade-off

A War Story: The Column Store Mistake

Part II: Normalization Deep Dive

Chapter 4: The Philosophy of Normalization

What Normalization Actually Prevents

First Normal Form: The Foundation

Second Normal Form: Full Dependency on Keys

Third Normal Form: No Transitive Dependencies

BCNF: The Refinement

The Real Reason Normalization Matters: Consistency

Chapter 5: The Banking System Case Study

The Domain

The Naive Design (Don't Do This)

The Normalized Design

The Ledger Principle

The Running Balance Optimization

Why Not Shard?

Archiving Strategy

Isolation Levels: A Brief Detour

Part III: Denormalization in Distributed Systems

Chapter 6: When Normalization Isn't Enough

The Read-Heavy Reality

The Join Problem at Scale

The Denormalization Solution

The Consistency Challenge

A Real-World Example: Facebook's Approach

Chapter 7: Patterns of Denormalization

Precomputed Aggregates

Materialized Views

Event-Driven Projections

Cache-Based Denormalization

Part IV: Database Case Studies

Chapter 8: Social Media Platform

Post Schema: The Foundation

The Likes Challenge

Comments: The Hierarchy Problem

Feed Generation: The Hardest Problem

Handling Viral Posts

Chapter 9: Analytics Platform (OLAP)

The Star Schema

Why This Structure?

Dimension Design

Slowly Changing Dimensions

Fact Table Design

Partitioning for Analytics

Pre-Aggregated Rollups

A Real Analytics Architecture

Chapter 10: Logging & Observability System

The Write Challenge

Time-Series Data Model

Write Path Optimization

Query Patterns and Indexing

The Elasticsearch Model

Hot-Warm-Cold Architecture

TTL-Based Deletion

Schema-on-Read vs Schema-on-Write

Chapter 11: E-Commerce Platform

Product Catalog: The Flexibility Challenge

Inventory: The Consistency Challenge

Orders: The Immutable Record

Flash Sales: The Thundering Herd

Cart: Redis vs Database

Search Index Synchronization