40 - Design Distributed Cache
Previous: 39 - Design Typeahead & Autocomplete | Next: 41 - Design Ticket Booking System
1. Problem Statement
Design a distributed in-memory caching system (think Redis Cluster or Memcached) that provides sub-millisecond reads, horizontal scalability, and high availability. The cache sits between application servers and databases to absorb read-heavy traffic.
2. Requirements
Functional
| Requirement | Detail |
|---|---|
| GET/SET/DELETE | Basic key-value operations |
| TTL support | Keys expire after configurable time-to-live |
| Eviction policies | LRU, LFU, TTL-based eviction when memory full |
| Atomic operations | Increment, compare-and-swap for counters |
| Bulk operations | Multi-get for batch reads |
Non-Functional
| Requirement | Target |
|---|---|
| Read latency (p99) | < 1ms |
| Write latency (p99) | < 5ms |
| Availability | 99.99% |
| Throughput | 1M+ ops/sec per node |
| Data size | TB-scale across cluster |
3. Cache Cluster Architecture
+-------------------+ +-------------------+ +-------------------+
| Application | | Application | | Application |
| Server + Client | | Server + Client | | Server + Client |
+--------+----------+ +--------+----------+ +--------+----------+
| | |
+------------+------------+------------+------------+
| |
+------v--------+ +------v--------+
| Cache Node 1 | | Cache Node 2 |
| (keys: a-m) | | (keys: n-z) |
| Primary | | Primary |
+------+--------+ +------+--------+
| |
+------v--------+ +------v--------+
| Replica 1a | | Replica 2a |
+---------------+ +---------------+
| Replica 1b | | Replica 2b |
+---------------+ +---------------+
Key Components
- Cache nodes -- Hold data in memory, handle GET/SET/DELETE
- Cache client (smart client) -- Runs in application process, routes requests to correct node
- Configuration service -- ZooKeeper/etcd tracks cluster membership and key ranges
- Replication -- Each primary has 1-2 replicas for fault tolerance
4. Consistent Hashing for Key Distribution
Consistent hashing minimizes key redistribution when nodes are added or removed.
Node A
/ \
key_hash=42 key_hash=350
/ \
Node D --- hash ring --- Node B
\ /
key_hash=210 key_hash=180
\ /
Node C
Ring positions: A=0, B=90, C=180, D=270
Key with hash=42 -> walks clockwise -> lands on Node B
Virtual Nodes
Each physical node maps to 100-200 virtual positions on the ring.
| Without vnodes | With vnodes (150/node) |
|---|---|
| Uneven key distribution | Near-uniform distribution |
| Adding node shifts ~1/N keys | Adding node shifts ~1/N keys evenly |
| Hot spots possible | Hot spots rare |
Client-Side Routing
CacheClient:
ring: ConsistentHashRing
get(key):
node = ring.getNode(hash(key))
return node.get(key)
set(key, value, ttl):
node = ring.getNode(hash(key))
node.set(key, value, ttl)
5. Eviction Policies
When a node's memory is full, it must evict entries to make room.
| Policy | How It Works | Best For | Overhead |
|---|---|---|---|
| LRU | Evict least recently used | General workloads | O(1) with doubly-linked list + hashmap |
| LFU | Evict least frequently used | Skewed access patterns | O(log n) or O(1) with frequency buckets |
| TTL-based | Evict expired keys first | Time-sensitive data | O(1) with expiry heap |
| Random | Evict random keys | Uniform access | O(1), no tracking overhead |
| LRU + TTL | LRU among non-expired, TTL-expired evicted first | Most production systems | Combined overhead |
LRU Implementation (Interview Favorite)
HashMap<Key, DoublyLinkedListNode>
DoublyLinkedList (most recent at head, least recent at tail)
GET(key):
node = map[key]
move node to head of list # O(1)
return node.value
SET(key, value):
if key in map:
update node, move to head # O(1)
else:
if at capacity:
evict tail node # O(1)
remove from map # O(1)
insert new node at head # O(1)
add to map # O(1)
6. Data Replication
| Strategy | Mechanism | Consistency | Latency Impact |
|---|---|---|---|
| Synchronous | Primary waits for replica ACK before responding | Strong | +1-2ms per write |
| Asynchronous | Primary responds immediately, replicates in background | Eventual | No impact |
| Semi-synchronous | Wait for 1 of N replicas (quorum) | Tunable | +0.5-1ms |
Replication Flow
Client SET("user:123", data)
|
v
Primary Node
|--- write to local memory
|--- async replicate to Replica A
|--- async replicate to Replica B
|--- respond OK to client (async mode)
Interview Tip
For a cache, async replication is usually acceptable. A brief inconsistency window (milliseconds) is tolerable because the cache is not the source of truth -- the database is.
7. Write Policies
How the cache interacts with the backing database on writes.
+--------------------------------------------------+
| Write Policies |
+--------------------------------------------------+
Cache-Aside (Lazy Loading):
App reads: check cache -> miss -> read DB -> populate cache
App writes: write DB -> invalidate cache
Write-Through:
App writes: write cache -> cache writes DB (sync)
Reads always hit cache (warm)
Write-Behind (Write-Back):
App writes: write cache -> cache writes DB (async, batched)
Risk: data loss if cache node dies before flush
| Policy | Read Perf | Write Perf | Consistency | Data Loss Risk |
|---|---|---|---|---|
| Cache-aside | Miss penalty on first read | Fast (skip cache) | Good (invalidation) | None |
| Write-through | Always warm | Slower (sync DB write) | Strong | None |
| Write-behind | Always warm | Fast (async) | Eventual | Yes (node crash) |
Interview Tip
Cache-aside is the most common in practice. Mention that write-through adds latency but guarantees cache freshness, while write-behind risks data loss.
8. Handling Node Failures
Scenario: Node B dies
Before failure: After failure + rehash:
A: keys 0-89 A: keys 0-89 + keys 90-179 (absorbed B's range)
B: keys 90-179 C: keys 180-269 (unchanged)
C: keys 180-269 D: keys 270-359 (unchanged)
D: keys 270-359
With replicas:
B's replica promoted to primary
New replica created from promoted primary
Minimal disruption
Failure Detection
- Heartbeat -- Nodes ping each other periodically
- Gossip protocol -- Nodes share membership state (see 44 - Gossip Protocols & Membership)
- External monitor -- ZooKeeper/etcd detects failures, triggers failover
Recovery Steps
- Detect failure (heartbeat timeout)
- Promote replica to primary
- Update hash ring in configuration service
- Clients refresh ring from config service
- Spawn new replica for promoted node
- Cache warming: preload hot keys (optional)
9. Full System Diagram
+---------------------+
| Application |
| Servers |
+----------+----------+
|
+----------v----------+
| Smart Cache Client |
| (consistent hash |
| ring, routing, |
| connection pool) |
+----------+----------+
|
+--------------------+--------------------+
| | |
+---------v------+ +---------v------+ +---------v------+
| Cache Node A | | Cache Node B | | Cache Node C |
| (Primary) | | (Primary) | | (Primary) |
| Memory: 64GB | | Memory: 64GB | | Memory: 64GB |
+-------+--------+ +-------+--------+ +-------+--------+
| | |
+-------v--------+ +-------v--------+ +-------v--------+
| Replica A1 | | Replica B1 | | Replica C1 |
+----------------+ +----------------+ +----------------+
+--------------------+
| Config Service |
| (ZooKeeper/etcd) |
| - ring membership |
| - node health |
| - shard mapping |
+--------------------+
+--------------------+
| Monitoring |
| - hit/miss ratio |
| - latency p50/p99 |
| - memory usage |
| - eviction rate |
+--------------------+
10. Memory Management
Slab Allocation (Memcached Approach)
Memory divided into slab classes by size:
Slab Class 1: 64B chunks [████][████][████][ ]
Slab Class 2: 128B chunks [████████][████████][ ]
Slab Class 3: 256B chunks [████████████████][ ]
Slab Class 4: 512B chunks [████████████████████████████████]
Key-value pair -> pick smallest slab class that fits
| Approach | Pros | Cons |
|---|---|---|
| Slab allocator | No fragmentation, predictable | Internal waste (padding) |
| malloc/free | No padding waste | Fragmentation over time |
| jemalloc (Redis) | Low fragmentation, good throughput | Complex tuning |
11. Cache Warming
When a new node joins or a node restarts, its cache is cold.
| Strategy | How | Trade-off |
|---|---|---|
| Passive warming | Cache fills as requests naturally arrive | Spike in DB load during warm-up |
| Active warming | Preload hot keys from another replica or backup | Faster warm-up, needs hot key list |
| Snapshot restore | Load RDB/AOF snapshot from disk | Fast restore, slightly stale data |
| Shadow traffic | Mirror production reads to new node before cutover | Smooth transition, extra resources |
12. Monitoring & Observability
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| Hit ratio | Cache effectiveness | < 80% |
| Latency p50/p99 | Performance SLA | p99 > 5ms |
| Eviction rate | Memory pressure | > 1000/sec |
| Memory usage | Capacity planning | > 85% |
| Connection count | Client load | > 80% of max |
| Replication lag | Data freshness on replicas | > 100ms |
13. Comparison: Redis Cluster vs Memcached
| Feature | Redis Cluster | Memcached |
|---|---|---|
| Data structures | Strings, lists, sets, sorted sets, hashes, streams | Strings only |
| Persistence | RDB snapshots + AOF | None |
| Replication | Built-in primary-replica | None (client-side) |
| Partitioning | Hash slots (16384 slots) | Client-side consistent hashing |
| Eviction | LRU, LFU, volatile-lru, volatile-ttl, etc. | LRU only |
| Pub/Sub | Yes | No |
| Lua scripting | Yes | No |
| Multi-threaded | I/O threads (Redis 6+) | Yes (multi-threaded) |
| Memory efficiency | Higher overhead per key | Lower overhead (slab allocator) |
When to Choose Which
- Redis -- Need data structures beyond strings, persistence, or Pub/Sub
- Memcached -- Pure key-value caching, maximum simplicity, multi-threaded perf
14. Advanced: Thundering Herd Prevention
When a popular cached key expires, many concurrent requests flood the database simultaneously.
Solutions:
1. Locking: First request acquires lock, others wait
Thread 1: cache miss -> acquire lock -> query DB -> populate cache -> release lock
Thread 2-N: cache miss -> wait for lock -> read from cache (now populated)
2. Early expiration: Refresh cache before TTL expires
Actual TTL = 60s, logical TTL = 50s
First request after 50s triggers background refresh
3. Stale-while-revalidate: Serve stale data while refreshing
Return expired value immediately, trigger async refresh
15. Capacity Estimation
Assumptions:
- 100M unique keys
- Average key: 50 bytes, average value: 500 bytes
- Total data: 100M * 550B = 55GB
- Replication factor: 2 (primary + 1 replica)
- Total memory: 55GB * 2 = 110GB
- Nodes with 64GB RAM (50GB usable for cache): ceil(110/50) = 3 nodes
Throughput:
- 1M ops/sec target
- Single Redis node: ~100K ops/sec
- Need 10 primary nodes for throughput (even if memory fits in 3)
- With replicas reading: 10 primaries + 10 replicas = 20 nodes
16. Key Trade-offs Discussion
| Decision | Option A | Option B |
|---|---|---|
| Routing | Smart client (low latency) | Proxy-based (simpler clients, extra hop) |
| Replication | Sync (strong consistency) | Async (lower latency, risk of stale reads) |
| Eviction | LRU (simple, general) | LFU (better for skewed workloads) |
| Write policy | Cache-aside (simple) | Write-through (always warm cache) |
| Persistence | None (pure cache) | RDB/AOF (faster recovery) |
| Failure handling | Rehash to other nodes | Promote replica (less disruption) |
17. Interview Checklist
- Clarified: read/write ratio, data size, latency requirements
- Explained consistent hashing with virtual nodes
- Described eviction policies (LRU implementation detail is a bonus)
- Covered replication (async for cache is fine) and failover
- Discussed write policies (cache-aside vs write-through vs write-behind)
- Addressed thundering herd problem
- Monitoring: hit ratio, latency percentiles, eviction rate
- Capacity estimation with node count
- Compared Redis Cluster vs Memcached
18. Resources
- System Design Interview (Alex Xu, Vol 1) -- Chapter on Distributed Cache
- Redis Documentation -- redis.io/docs (cluster, persistence, eviction)
- Memcached Wiki -- memcached.org
- Paper: "Scaling Memcache at Facebook" (NSDI 2013)
- YouTube: System Design Interview -- Design Distributed Cache
- YouTube: Martin Kleppmann -- Caching Strategies
Previous: 39 - Design Typeahead & Autocomplete | Next: 41 - Design Ticket Booking System