40 - Design Distributed Cache

Previous: 39 - Design Typeahead & Autocomplete | Next: 41 - Design Ticket Booking System

1. Problem Statement

Design a distributed in-memory caching system (think Redis Cluster or Memcached) that provides sub-millisecond reads, horizontal scalability, and high availability. The cache sits between application servers and databases to absorb read-heavy traffic.

2. Requirements

Functional

Requirement	Detail
GET/SET/DELETE	Basic key-value operations
TTL support	Keys expire after configurable time-to-live
Eviction policies	LRU, LFU, TTL-based eviction when memory full
Atomic operations	Increment, compare-and-swap for counters
Bulk operations	Multi-get for batch reads

Non-Functional

Requirement	Target
Read latency (p99)	< 1ms
Write latency (p99)	< 5ms
Availability	99.99%
Throughput	1M+ ops/sec per node
Data size	TB-scale across cluster

3. Cache Cluster Architecture

+-------------------+     +-------------------+     +-------------------+
| Application       |     | Application       |     | Application       |
| Server + Client   |     | Server + Client   |     | Server + Client   |
+--------+----------+     +--------+----------+     +--------+----------+
         |                         |                         |
         +------------+------------+------------+------------+
                      |                         |
               +------v--------+         +------v--------+
               | Cache Node 1  |         | Cache Node 2  |
               | (keys: a-m)   |         | (keys: n-z)   |
               | Primary       |         | Primary       |
               +------+--------+         +------+--------+
                      |                         |
               +------v--------+         +------v--------+
               | Replica 1a    |         | Replica 2a    |
               +---------------+         +---------------+
               | Replica 1b    |         | Replica 2b    |
               +---------------+         +---------------+

Key Components

Cache nodes -- Hold data in memory, handle GET/SET/DELETE
Cache client (smart client) -- Runs in application process, routes requests to correct node
Configuration service -- ZooKeeper/etcd tracks cluster membership and key ranges
Replication -- Each primary has 1-2 replicas for fault tolerance

4. Consistent Hashing for Key Distribution

Consistent hashing minimizes key redistribution when nodes are added or removed.

                    Node A
                   /      \
            key_hash=42   key_hash=350
               /               \
         Node D --- hash ring --- Node B
               \               /
            key_hash=210   key_hash=180
                   \      /
                    Node C

Ring positions: A=0, B=90, C=180, D=270
Key with hash=42 -> walks clockwise -> lands on Node B

Virtual Nodes

Each physical node maps to 100-200 virtual positions on the ring.

Without vnodes	With vnodes (150/node)
Uneven key distribution	Near-uniform distribution
Adding node shifts ~1/N keys	Adding node shifts ~1/N keys evenly
Hot spots possible	Hot spots rare

Client-Side Routing

CacheClient:
  ring: ConsistentHashRing
  
  get(key):
    node = ring.getNode(hash(key))
    return node.get(key)
    
  set(key, value, ttl):
    node = ring.getNode(hash(key))
    node.set(key, value, ttl)

5. Eviction Policies

When a node's memory is full, it must evict entries to make room.

Policy	How It Works	Best For	Overhead
LRU	Evict least recently used	General workloads	O(1) with doubly-linked list + hashmap
LFU	Evict least frequently used	Skewed access patterns	O(log n) or O(1) with frequency buckets
TTL-based	Evict expired keys first	Time-sensitive data	O(1) with expiry heap
Random	Evict random keys	Uniform access	O(1), no tracking overhead
LRU + TTL	LRU among non-expired, TTL-expired evicted first	Most production systems	Combined overhead

LRU Implementation (Interview Favorite)

HashMap<Key, DoublyLinkedListNode>
DoublyLinkedList (most recent at head, least recent at tail)

GET(key):
  node = map[key]
  move node to head of list      # O(1)
  return node.value

SET(key, value):
  if key in map:
    update node, move to head    # O(1)
  else:
    if at capacity:
      evict tail node            # O(1)
      remove from map            # O(1)
    insert new node at head      # O(1)
    add to map                   # O(1)

6. Data Replication

Strategy	Mechanism	Consistency	Latency Impact
Synchronous	Primary waits for replica ACK before responding	Strong	+1-2ms per write
Asynchronous	Primary responds immediately, replicates in background	Eventual	No impact
Semi-synchronous	Wait for 1 of N replicas (quorum)	Tunable	+0.5-1ms

Replication Flow

Client SET("user:123", data)
  |
  v
Primary Node
  |--- write to local memory
  |--- async replicate to Replica A
  |--- async replicate to Replica B
  |--- respond OK to client (async mode)

Interview Tip

For a cache, async replication is usually acceptable. A brief inconsistency window (milliseconds) is tolerable because the cache is not the source of truth -- the database is.

7. Write Policies

How the cache interacts with the backing database on writes.

+--------------------------------------------------+
|                Write Policies                      |
+--------------------------------------------------+

Cache-Aside (Lazy Loading):
  App reads:  check cache -> miss -> read DB -> populate cache
  App writes: write DB -> invalidate cache

Write-Through:
  App writes: write cache -> cache writes DB (sync)
  Reads always hit cache (warm)

Write-Behind (Write-Back):
  App writes: write cache -> cache writes DB (async, batched)
  Risk: data loss if cache node dies before flush

Policy	Read Perf	Write Perf	Consistency	Data Loss Risk
Cache-aside	Miss penalty on first read	Fast (skip cache)	Good (invalidation)	None
Write-through	Always warm	Slower (sync DB write)	Strong	None
Write-behind	Always warm	Fast (async)	Eventual	Yes (node crash)

Interview Tip

Cache-aside is the most common in practice. Mention that write-through adds latency but guarantees cache freshness, while write-behind risks data loss.

8. Handling Node Failures

Scenario: Node B dies

Before failure:          After failure + rehash:
  A: keys 0-89            A: keys 0-89 + keys 90-179 (absorbed B's range)
  B: keys 90-179          C: keys 180-269 (unchanged)
  C: keys 180-269         D: keys 270-359 (unchanged)
  D: keys 270-359

With replicas:
  B's replica promoted to primary
  New replica created from promoted primary
  Minimal disruption

Failure Detection

Heartbeat -- Nodes ping each other periodically
Gossip protocol -- Nodes share membership state (see 44 - Gossip Protocols & Membership)
External monitor -- ZooKeeper/etcd detects failures, triggers failover

Recovery Steps

Detect failure (heartbeat timeout)
Promote replica to primary
Update hash ring in configuration service
Clients refresh ring from config service
Spawn new replica for promoted node
Cache warming: preload hot keys (optional)

9. Full System Diagram

                    +---------------------+
                    |    Application       |
                    |    Servers           |
                    +----------+----------+
                               |
                    +----------v----------+
                    |   Smart Cache Client |
                    |  (consistent hash    |
                    |   ring, routing,     |
                    |   connection pool)   |
                    +----------+----------+
                               |
          +--------------------+--------------------+
          |                    |                    |
+---------v------+   +---------v------+   +---------v------+
| Cache Node A   |   | Cache Node B   |   | Cache Node C   |
| (Primary)      |   | (Primary)      |   | (Primary)      |
| Memory: 64GB   |   | Memory: 64GB   |   | Memory: 64GB   |
+-------+--------+   +-------+--------+   +-------+--------+
        |                     |                     |
+-------v--------+   +-------v--------+   +-------v--------+
| Replica A1     |   | Replica B1     |   | Replica C1     |
+----------------+   +----------------+   +----------------+

          +--------------------+
          | Config Service     |
          | (ZooKeeper/etcd)   |
          | - ring membership  |
          | - node health      |
          | - shard mapping    |
          +--------------------+

          +--------------------+
          | Monitoring         |
          | - hit/miss ratio   |
          | - latency p50/p99  |
          | - memory usage     |
          | - eviction rate    |
          +--------------------+

10. Memory Management

Slab Allocation (Memcached Approach)

Memory divided into slab classes by size:

Slab Class 1: 64B chunks    [████][████][████][    ]
Slab Class 2: 128B chunks   [████████][████████][        ]
Slab Class 3: 256B chunks   [████████████████][                ]
Slab Class 4: 512B chunks   [████████████████████████████████]

Key-value pair -> pick smallest slab class that fits

Approach	Pros	Cons
Slab allocator	No fragmentation, predictable	Internal waste (padding)
malloc/free	No padding waste	Fragmentation over time
jemalloc (Redis)	Low fragmentation, good throughput	Complex tuning

11. Cache Warming

When a new node joins or a node restarts, its cache is cold.

Strategy	How	Trade-off
Passive warming	Cache fills as requests naturally arrive	Spike in DB load during warm-up
Active warming	Preload hot keys from another replica or backup	Faster warm-up, needs hot key list
Snapshot restore	Load RDB/AOF snapshot from disk	Fast restore, slightly stale data
Shadow traffic	Mirror production reads to new node before cutover	Smooth transition, extra resources

12. Monitoring & Observability

Metric	What It Tells You	Alert Threshold
Hit ratio	Cache effectiveness	< 80%
Latency p50/p99	Performance SLA	p99 > 5ms
Eviction rate	Memory pressure	> 1000/sec
Memory usage	Capacity planning	> 85%
Connection count	Client load	> 80% of max
Replication lag	Data freshness on replicas	> 100ms

13. Comparison: Redis Cluster vs Memcached

Feature	Redis Cluster	Memcached
Data structures	Strings, lists, sets, sorted sets, hashes, streams	Strings only
Persistence	RDB snapshots + AOF	None
Replication	Built-in primary-replica	None (client-side)
Partitioning	Hash slots (16384 slots)	Client-side consistent hashing
Eviction	LRU, LFU, volatile-lru, volatile-ttl, etc.	LRU only
Pub/Sub	Yes	No
Lua scripting	Yes	No
Multi-threaded	I/O threads (Redis 6+)	Yes (multi-threaded)
Memory efficiency	Higher overhead per key	Lower overhead (slab allocator)

When to Choose Which

Redis -- Need data structures beyond strings, persistence, or Pub/Sub
Memcached -- Pure key-value caching, maximum simplicity, multi-threaded perf

14. Advanced: Thundering Herd Prevention

When a popular cached key expires, many concurrent requests flood the database simultaneously.

Solutions:

1. Locking: First request acquires lock, others wait
   Thread 1: cache miss -> acquire lock -> query DB -> populate cache -> release lock
   Thread 2-N: cache miss -> wait for lock -> read from cache (now populated)

2. Early expiration: Refresh cache before TTL expires
   Actual TTL = 60s, logical TTL = 50s
   First request after 50s triggers background refresh

3. Stale-while-revalidate: Serve stale data while refreshing
   Return expired value immediately, trigger async refresh

15. Capacity Estimation

Assumptions:
- 100M unique keys
- Average key: 50 bytes, average value: 500 bytes
- Total data: 100M * 550B = 55GB
- Replication factor: 2 (primary + 1 replica)
- Total memory: 55GB * 2 = 110GB
- Nodes with 64GB RAM (50GB usable for cache): ceil(110/50) = 3 nodes

Throughput:
- 1M ops/sec target
- Single Redis node: ~100K ops/sec
- Need 10 primary nodes for throughput (even if memory fits in 3)
- With replicas reading: 10 primaries + 10 replicas = 20 nodes

16. Key Trade-offs Discussion

Decision	Option A	Option B
Routing	Smart client (low latency)	Proxy-based (simpler clients, extra hop)
Replication	Sync (strong consistency)	Async (lower latency, risk of stale reads)
Eviction	LRU (simple, general)	LFU (better for skewed workloads)
Write policy	Cache-aside (simple)	Write-through (always warm cache)
Persistence	None (pure cache)	RDB/AOF (faster recovery)
Failure handling	Rehash to other nodes	Promote replica (less disruption)

17. Interview Checklist

Clarified: read/write ratio, data size, latency requirements
Explained consistent hashing with virtual nodes
Described eviction policies (LRU implementation detail is a bonus)
Covered replication (async for cache is fine) and failover
Discussed write policies (cache-aside vs write-through vs write-behind)
Addressed thundering herd problem
Monitoring: hit ratio, latency percentiles, eviction rate
Capacity estimation with node count
Compared Redis Cluster vs Memcached

18. Resources

System Design Interview (Alex Xu, Vol 1) -- Chapter on Distributed Cache
Redis Documentation -- redis.io/docs (cluster, persistence, eviction)
Memcached Wiki -- memcached.org
Paper: "Scaling Memcache at Facebook" (NSDI 2013)
YouTube: System Design Interview -- Design Distributed Cache
YouTube: Martin Kleppmann -- Caching Strategies

Previous: 39 - Design Typeahead & Autocomplete | Next: 41 - Design Ticket Booking System