System Design Part 5: Caching - The Complete Guide
Introduction
Caching is often called the "silver bullet" of system design - a simple concept that dramatically improves performance. But like any powerful tool, misusing it leads to subtle bugs, stale data, and eventually, 3 AM production incidents. Let's master caching from first principles.
1. Designing a Multi-Level Cache for Product Catalog
Normal Load (1K req/s):
├── L1 hit rate: 80%
├── L2 hit rate: 95% (of L1 misses)
├── DB queries: ~10/sec
└── Avg latency: 2ms
Peak Load (10K req/s):
├── L1 hit rate: 85% (hot items stay cached)
├── L2 hit rate: 98% (Redis handles more)
├── DB queries: ~30/sec
└── Avg latency: 5ms
Extreme Load (100K req/s):
├── L1 hit rate: 90% (increase L1 size)
├── L2 hit rate: 99%
├── DB queries: ~100/sec (consider read replica)
└── Avg latency: 3ms (L1 dominates)
2. Cache Invalidation Strategies
The Two Hardest Problems in Computer Science
"There are only two hard things in Computer Science: cache invalidation and naming things."
— Phil Karlton
Strategy 1: Time-Based Expiration (TTL)
go
// Simple but risks stale datafunc(c *Cache)SetWithTTL(key string, value interface{}, ttl time.Duration){ c.client.Set(ctx, key, value, ttl)}// Trade-off: Lower TTL = fresher data but higher DB load// Higher TTL = stale data but better performance
type VersionedCache struct{ cache *L2Cache
db ProductRepository
}type VersionedProduct struct{ Product
Version int64`json:"version"`}func(vc *VersionedCache)Get(ctx context.Context, id string)(*Product,error){ cached, err := vc.cache.GetVersioned(ctx, id)if err !=nil{returnnil, err
}// Always check version with database currentVersion, err := vc.db.GetVersion(ctx, id)if err !=nil{returnnil, err
}if cached !=nil&& cached.Version == currentVersion {return&cached.Product,nil}// Version mismatch or cache miss - fetch fresh product, err := vc.db.GetByID(ctx, id)if err !=nil{returnnil, err
}// Cache with version vc.cache.SetVersioned(ctx,&VersionedProduct{ Product:*product, Version: currentVersion,})return product,nil}
Invalidation Decision Matrix
Strategy
Data Freshness
Complexity
Use When
TTL
Minutes stale
Simple
Read-heavy, tolerance for staleness
Event-based
Seconds stale
Medium
Real-time requirements
Write-through
Always fresh
Medium
Consistency critical
Write-behind
Seconds stale
Complex
High write throughput
Versioned
Always fresh
Medium
Mixed read/write
3. Cache Stampede Prevention
The Stampede Problem
mermaid
sequenceDiagramparticipant R1 as Request 1
participant R2 as Request 2
participant R3 as Request 3
participant Cache
participant DB
Note over Cache: Cache expires
R1->>Cache: GET product:123
Cache-->>R1: MISS
R2->>Cache: GET product:123
Cache-->>R2: MISS
R3->>Cache: GET product:123
Cache-->>R3: MISS
R1->>DB: SELECT * FROM products...
R2->>DB: SELECT * FROM products...
R3->>DB: SELECT * FROM products...
Note over DB: Database overwhelmed!
Solution 1: Locking (Single-Flight)
go
package stampede
import("context""sync""time""golang.org/x/sync/singleflight")type StampedeProtectedCache struct{ cache *L2Cache
db ProductRepository
group singleflight.Group
}func(c *StampedeProtectedCache)Get(ctx context.Context, id string)(*Product,error){// Try cache firstif product, err := c.cache.Get(ctx, id); err ==nil&& product !=nil{return product,nil}// Use singleflight to prevent stampede result, err,_:= c.group.Do(id,func()(interface{},error){// Double-check cache (another request might have populated it)if product, err := c.cache.Get(ctx, id); err ==nil&& product !=nil{return product,nil}// Fetch from database product, err := c.db.GetByID(ctx, id)if err !=nil{returnnil, err
}// Populate cache c.cache.Set(ctx, product)return product,nil})if err !=nil{returnnil, err
}return result.(*Product),nil}// Distributed locking for multi-instance deploymentstype DistributedStampedeCache struct{ cache *L2Cache
db ProductRepository
redis *redis.Client
lockTTL time.Duration
}func(c *DistributedStampedeCache)Get(ctx context.Context, id string)(*Product,error){// Try cacheif product, err := c.cache.Get(ctx, id); err ==nil&& product !=nil{return product,nil}// Try to acquire distributed lock lockKey := fmt.Sprintf("lock:product:%s", id) acquired, err := c.redis.SetNX(ctx, lockKey,"1", c.lockTTL).Result()if err !=nil{returnnil, err
}if acquired {// We got the lock - fetch from DBdefer c.redis.Del(ctx, lockKey) product, err := c.db.GetByID(ctx, id)if err !=nil{returnnil, err
} c.cache.Set(ctx, product)return product,nil}// Another instance is fetching - wait and retry cachefor i :=0; i <10; i++{ time.Sleep(50* time.Millisecond)if product, err := c.cache.Get(ctx, id); err ==nil&& product !=nil{return product,nil}}// Timeout - fetch ourselvesreturn c.db.GetByID(ctx, id)}
Solution 2: Probabilistic Early Expiration
go
// XFetch algorithm - probabilistically refresh before expirationtype XFetchCache struct{ cache *L2Cache
db ProductRepository
beta float64// Typically 1.0}type CachedValue struct{ Product *Product
Delta time.Duration // Time to compute value ExpiresAt time.Time
}func(c *XFetchCache)Get(ctx context.Context, id string)(*Product,error){ cached, err := c.cache.GetWithMeta(ctx, id)if err !=nil{returnnil, err
}if cached !=nil{// XFetch formula: should we refresh early? now := time.Now() ttl := cached.ExpiresAt.Sub(now)// gap = delta * beta * log(random) gap := time.Duration(float64(cached.Delta)* c.beta *(-math.Log(rand.Float64())))if ttl-gap <=0{// Probabilistically refresh in backgroundgo c.refresh(context.Background(), id)}return cached.Product,nil}return c.refresh(ctx, id)}func(c *XFetchCache)refresh(ctx context.Context, id string)(*Product,error){ start := time.Now() product, err := c.db.GetByID(ctx, id)if err !=nil{returnnil, err
} delta := time.Since(start) c.cache.SetWithMeta(ctx,&CachedValue{ Product: product, Delta: delta, ExpiresAt: time.Now().Add(c.cache.ttl),})return product,nil}
Solution 3: Background Refresh
go
type BackgroundRefreshCache struct{ cache *L2Cache
db ProductRepository
refreshQueue chanstring softTTL time.Duration // When to start background refresh hardTTL time.Duration // When cache actually expires}func(c *BackgroundRefreshCache)Get(ctx context.Context, id string)(*Product,error){ cached, meta, err := c.cache.GetWithTTL(ctx, id)if err !=nil{returnnil, err
}if cached !=nil{// Check if we should trigger background refreshif meta.TTL < c.hardTTL-c.softTTL {select{case c.refreshQueue <- id:default:// Queue full, skip refresh}}return cached,nil}// Cache miss - synchronous fetch product, err := c.db.GetByID(ctx, id)if err !=nil{returnnil, err
} c.cache.Set(ctx, product)// Uses hardTTLreturn product,nil}func(c *BackgroundRefreshCache)refreshWorker(ctx context.Context){for{select{case<-ctx.Done():returncase id :=<-c.refreshQueue: product, err := c.db.GetByID(ctx, id)if err !=nil{ log.Printf("Background refresh failed for %s: %v", id, err)continue} c.cache.Set(ctx, product)}}}
package cdn
import("net/http""time")type CacheConfig struct{ Public bool Private bool MaxAge time.Duration
SMaxAge time.Duration // CDN-specific max age MustRevalidate bool NoCache bool NoStore bool Immutable bool}funcSetCacheHeaders(w http.ResponseWriter, cfg CacheConfig){var directives []stringif cfg.NoStore { directives =append(directives,"no-store")}elseif cfg.NoCache { directives =append(directives,"no-cache")}else{if cfg.Public { directives =append(directives,"public")}if cfg.Private { directives =append(directives,"private")}if cfg.MaxAge >0{ directives =append(directives, fmt.Sprintf("max-age=%d",int(cfg.MaxAge.Seconds())))}if cfg.SMaxAge >0{ directives =append(directives, fmt.Sprintf("s-maxage=%d",int(cfg.SMaxAge.Seconds())))}if cfg.MustRevalidate { directives =append(directives,"must-revalidate")}if cfg.Immutable { directives =append(directives,"immutable")}} w.Header().Set("Cache-Control", strings.Join(directives,", "))}// Middleware for different content typesfuncCacheMiddleware(contentType string)func(http.Handler) http.Handler {var config CacheConfig
switch contentType {case"product-list":// Cacheable at CDN for 1 minute, browser for 30 seconds config = CacheConfig{ Public:true, MaxAge:30* time.Second, SMaxAge:60* time.Second,}case"product-detail":// Longer cache for individual products config = CacheConfig{ Public:true, MaxAge:60* time.Second, SMaxAge:300* time.Second,}case"user-specific":// Never cache user-specific data at CDN config = CacheConfig{ Private:true, MaxAge:60* time.Second,}case"static-asset":// Immutable with long cache config = CacheConfig{ Public:true, MaxAge:365*24* time.Hour, Immutable:true,}case"sensitive":// Never cache config = CacheConfig{ NoStore:true,}}returnfunc(next http.Handler) http.Handler {return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request){SetCacheHeaders(w, config) next.ServeHTTP(w, r)})}}
Cache Key Design for APIs
go
type CacheKeyBuilder struct{ baseKey string params map[string]string}funcNewCacheKey(path string)*CacheKeyBuilder {return&CacheKeyBuilder{ baseKey: path, params:make(map[string]string),}}func(b *CacheKeyBuilder)WithParam(key, value string)*CacheKeyBuilder { b.params[key]= value
return b
}func(b *CacheKeyBuilder)WithVersion(v string)*CacheKeyBuilder {return b.WithParam("v", v)}func(b *CacheKeyBuilder)WithLocale(locale string)*CacheKeyBuilder {return b.WithParam("locale", locale)}func(b *CacheKeyBuilder)Build()string{iflen(b.params)==0{return b.baseKey
}// Sort params for consistent keys keys :=make([]string,0,len(b.params))for k :=range b.params { keys =append(keys, k)} sort.Strings(keys)var parts []stringfor_, k :=range keys { parts =append(parts, fmt.Sprintf("%s=%s", k, b.params[k]))}return b.baseKey +"?"+ strings.Join(parts,"&")}// Vary header for CDN key variationfuncSetVaryHeaders(w http.ResponseWriter, headers ...string){ w.Header().Set("Vary", strings.Join(headers,", "))}// Example API handlerfuncProductHandler(w http.ResponseWriter, r *http.Request){// Vary cache by Accept-Language and Accept-EncodingSetVaryHeaders(w,"Accept-Language","Accept-Encoding")// Set cache headersSetCacheHeaders(w, CacheConfig{ Public:true, MaxAge:60* time.Second, SMaxAge:300* time.Second,})// Add ETag for conditional requests etag :=generateETag(product) w.Header().Set("ETag", etag)// Check If-None-Matchif r.Header.Get("If-None-Match")== etag { w.WriteHeader(http.StatusNotModified)return}// Return product json.NewEncoder(w).Encode(product)}
CDN Cache Purging
go
type CDNPurger struct{ cloudflareClient *cloudflare.API
fastlyClient *fastly.Client
}// Purge specific URLsfunc(p *CDNPurger)PurgeURLs(ctx context.Context, urls []string)error{// Cloudflare_, err := p.cloudflareClient.PurgeCache(ctx, zoneID, cloudflare.PurgeCacheRequest{ Files: urls,})if err !=nil{return fmt.Errorf("cloudflare purge failed: %w", err)}returnnil}// Purge by cache tag (more efficient)func(p *CDNPurger)PurgeByTag(ctx context.Context, tags []string)error{// Cloudflare cache tags_, err := p.cloudflareClient.PurgeCache(ctx, zoneID, cloudflare.PurgeCacheRequest{ Tags: tags,})return err
}// Purge everything (use sparingly!)func(p *CDNPurger)PurgeAll(ctx context.Context)error{_, err := p.cloudflareClient.PurgeCache(ctx, zoneID, cloudflare.PurgeCacheRequest{ Everything:true,})return err
}// Usage with cache tagsfuncProductHandler(w http.ResponseWriter, r *http.Request){ productID := chi.URLParam(r,"id") product :=getProduct(productID)// Set cache tags for targeted purging w.Header().Set("Cache-Tag", strings.Join([]string{ fmt.Sprintf("product:%s", productID), fmt.Sprintf("category:%s", product.CategoryID),"products",},","))// ... rest of handler}// When product updates, purge by tagfuncOnProductUpdate(productID, categoryID string){ purger.PurgeByTag(ctx,[]string{ fmt.Sprintf("product:%s", productID),})}// When category updates, purge all products in categoryfuncOnCategoryUpdate(categoryID string){ purger.PurgeByTag(ctx,[]string{ fmt.Sprintf("category:%s", categoryID),})}
Summary: Caching Best Practices
Quick Reference
What to Cache:
✅ Static content (images, CSS, JS)
✅ Database query results
✅ Computed/aggregated data
✅ Session data
✅ API responses (with care)
What NOT to Cache:
❌ User-specific sensitive data at CDN
❌ Real-time data (stock prices, live scores)
❌ Frequently changing data with low TTL
❌ Large objects that exceed memory
Cache Hierarchy Decision
Layer
Latency
Size
Use Case
L1 (Process)
~1μs
100MB-1GB
Hot data, computed values
L2 (Redis)
~1ms
10GB-100GB
Session, frequent queries
L3 (CDN)
~10ms
Unlimited
Static assets, API responses
Database
~50ms
Unlimited
Source of truth
Key Metrics to Monitor
Metric
Target
Action if Breached
Hit Rate
>90%
Increase cache size, review TTL
Eviction Rate
<5%
Increase memory allocation
Memory Usage
<80%
Plan capacity increase
Latency p99
<5ms
Check Redis cluster health
Stampede Events
0
Implement singleflight/locking
This guide covers the essential caching concepts for system design interviews. Remember: caching is a trade-off between freshness and performance. Always start with the simplest solution that meets your consistency requirements.