40 - Design Distributed Cache

Previous: 39 - Design Typeahead & Autocomplete | Next: 41 - Design Ticket Booking System


1. Problem Statement

Design a distributed in-memory caching system (think Redis Cluster or Memcached) that provides sub-millisecond reads, horizontal scalability, and high availability. The cache sits between application servers and databases to absorb read-heavy traffic.


2. Requirements

Functional

RequirementDetail
GET/SET/DELETEBasic key-value operations
TTL supportKeys expire after configurable time-to-live
Eviction policiesLRU, LFU, TTL-based eviction when memory full
Atomic operationsIncrement, compare-and-swap for counters
Bulk operationsMulti-get for batch reads

Non-Functional

RequirementTarget
Read latency (p99)< 1ms
Write latency (p99)< 5ms
Availability99.99%
Throughput1M+ ops/sec per node
Data sizeTB-scale across cluster

3. Cache Cluster Architecture

+-------------------+     +-------------------+     +-------------------+
| Application       |     | Application       |     | Application       |
| Server + Client   |     | Server + Client   |     | Server + Client   |
+--------+----------+     +--------+----------+     +--------+----------+
         |                         |                         |
         +------------+------------+------------+------------+
                      |                         |
               +------v--------+         +------v--------+
               | Cache Node 1  |         | Cache Node 2  |
               | (keys: a-m)   |         | (keys: n-z)   |
               | Primary       |         | Primary       |
               +------+--------+         +------+--------+
                      |                         |
               +------v--------+         +------v--------+
               | Replica 1a    |         | Replica 2a    |
               +---------------+         +---------------+
               | Replica 1b    |         | Replica 2b    |
               +---------------+         +---------------+

Key Components

  1. Cache nodes -- Hold data in memory, handle GET/SET/DELETE
  2. Cache client (smart client) -- Runs in application process, routes requests to correct node
  3. Configuration service -- ZooKeeper/etcd tracks cluster membership and key ranges
  4. Replication -- Each primary has 1-2 replicas for fault tolerance

4. Consistent Hashing for Key Distribution

Consistent hashing minimizes key redistribution when nodes are added or removed.

                    Node A
                   /      \
            key_hash=42   key_hash=350
               /               \
         Node D --- hash ring --- Node B
               \               /
            key_hash=210   key_hash=180
                   \      /
                    Node C

Ring positions: A=0, B=90, C=180, D=270
Key with hash=42 -> walks clockwise -> lands on Node B

Virtual Nodes

Each physical node maps to 100-200 virtual positions on the ring.

Without vnodesWith vnodes (150/node)
Uneven key distributionNear-uniform distribution
Adding node shifts ~1/N keysAdding node shifts ~1/N keys evenly
Hot spots possibleHot spots rare

Client-Side Routing

CacheClient:
  ring: ConsistentHashRing
  
  get(key):
    node = ring.getNode(hash(key))
    return node.get(key)
    
  set(key, value, ttl):
    node = ring.getNode(hash(key))
    node.set(key, value, ttl)

5. Eviction Policies

When a node's memory is full, it must evict entries to make room.

PolicyHow It WorksBest ForOverhead
LRUEvict least recently usedGeneral workloadsO(1) with doubly-linked list + hashmap
LFUEvict least frequently usedSkewed access patternsO(log n) or O(1) with frequency buckets
TTL-basedEvict expired keys firstTime-sensitive dataO(1) with expiry heap
RandomEvict random keysUniform accessO(1), no tracking overhead
LRU + TTLLRU among non-expired, TTL-expired evicted firstMost production systemsCombined overhead

LRU Implementation (Interview Favorite)

HashMap<Key, DoublyLinkedListNode>
DoublyLinkedList (most recent at head, least recent at tail)

GET(key):
  node = map[key]
  move node to head of list      # O(1)
  return node.value

SET(key, value):
  if key in map:
    update node, move to head    # O(1)
  else:
    if at capacity:
      evict tail node            # O(1)
      remove from map            # O(1)
    insert new node at head      # O(1)
    add to map                   # O(1)

6. Data Replication

StrategyMechanismConsistencyLatency Impact
SynchronousPrimary waits for replica ACK before respondingStrong+1-2ms per write
AsynchronousPrimary responds immediately, replicates in backgroundEventualNo impact
Semi-synchronousWait for 1 of N replicas (quorum)Tunable+0.5-1ms

Replication Flow

Client SET("user:123", data)
  |
  v
Primary Node
  |--- write to local memory
  |--- async replicate to Replica A
  |--- async replicate to Replica B
  |--- respond OK to client (async mode)

Interview Tip

For a cache, async replication is usually acceptable. A brief inconsistency window (milliseconds) is tolerable because the cache is not the source of truth -- the database is.


7. Write Policies

How the cache interacts with the backing database on writes.

+--------------------------------------------------+
|                Write Policies                      |
+--------------------------------------------------+

Cache-Aside (Lazy Loading):
  App reads:  check cache -> miss -> read DB -> populate cache
  App writes: write DB -> invalidate cache

Write-Through:
  App writes: write cache -> cache writes DB (sync)
  Reads always hit cache (warm)

Write-Behind (Write-Back):
  App writes: write cache -> cache writes DB (async, batched)
  Risk: data loss if cache node dies before flush
PolicyRead PerfWrite PerfConsistencyData Loss Risk
Cache-asideMiss penalty on first readFast (skip cache)Good (invalidation)None
Write-throughAlways warmSlower (sync DB write)StrongNone
Write-behindAlways warmFast (async)EventualYes (node crash)

Interview Tip

Cache-aside is the most common in practice. Mention that write-through adds latency but guarantees cache freshness, while write-behind risks data loss.


8. Handling Node Failures

Scenario: Node B dies

Before failure:          After failure + rehash:
  A: keys 0-89            A: keys 0-89 + keys 90-179 (absorbed B's range)
  B: keys 90-179          C: keys 180-269 (unchanged)
  C: keys 180-269         D: keys 270-359 (unchanged)
  D: keys 270-359

With replicas:
  B's replica promoted to primary
  New replica created from promoted primary
  Minimal disruption

Failure Detection

  1. Heartbeat -- Nodes ping each other periodically
  2. Gossip protocol -- Nodes share membership state (see 44 - Gossip Protocols & Membership)
  3. External monitor -- ZooKeeper/etcd detects failures, triggers failover

Recovery Steps

  1. Detect failure (heartbeat timeout)
  2. Promote replica to primary
  3. Update hash ring in configuration service
  4. Clients refresh ring from config service
  5. Spawn new replica for promoted node
  6. Cache warming: preload hot keys (optional)

9. Full System Diagram

                    +---------------------+
                    |    Application       |
                    |    Servers           |
                    +----------+----------+
                               |
                    +----------v----------+
                    |   Smart Cache Client |
                    |  (consistent hash    |
                    |   ring, routing,     |
                    |   connection pool)   |
                    +----------+----------+
                               |
          +--------------------+--------------------+
          |                    |                    |
+---------v------+   +---------v------+   +---------v------+
| Cache Node A   |   | Cache Node B   |   | Cache Node C   |
| (Primary)      |   | (Primary)      |   | (Primary)      |
| Memory: 64GB   |   | Memory: 64GB   |   | Memory: 64GB   |
+-------+--------+   +-------+--------+   +-------+--------+
        |                     |                     |
+-------v--------+   +-------v--------+   +-------v--------+
| Replica A1     |   | Replica B1     |   | Replica C1     |
+----------------+   +----------------+   +----------------+

          +--------------------+
          | Config Service     |
          | (ZooKeeper/etcd)   |
          | - ring membership  |
          | - node health      |
          | - shard mapping    |
          +--------------------+

          +--------------------+
          | Monitoring         |
          | - hit/miss ratio   |
          | - latency p50/p99  |
          | - memory usage     |
          | - eviction rate    |
          +--------------------+

10. Memory Management

Slab Allocation (Memcached Approach)

Memory divided into slab classes by size:

Slab Class 1: 64B chunks    [████][████][████][    ]
Slab Class 2: 128B chunks   [████████][████████][        ]
Slab Class 3: 256B chunks   [████████████████][                ]
Slab Class 4: 512B chunks   [████████████████████████████████]

Key-value pair -> pick smallest slab class that fits
ApproachProsCons
Slab allocatorNo fragmentation, predictableInternal waste (padding)
malloc/freeNo padding wasteFragmentation over time
jemalloc (Redis)Low fragmentation, good throughputComplex tuning

11. Cache Warming

When a new node joins or a node restarts, its cache is cold.

StrategyHowTrade-off
Passive warmingCache fills as requests naturally arriveSpike in DB load during warm-up
Active warmingPreload hot keys from another replica or backupFaster warm-up, needs hot key list
Snapshot restoreLoad RDB/AOF snapshot from diskFast restore, slightly stale data
Shadow trafficMirror production reads to new node before cutoverSmooth transition, extra resources

12. Monitoring & Observability

MetricWhat It Tells YouAlert Threshold
Hit ratioCache effectiveness< 80%
Latency p50/p99Performance SLAp99 > 5ms
Eviction rateMemory pressure> 1000/sec
Memory usageCapacity planning> 85%
Connection countClient load> 80% of max
Replication lagData freshness on replicas> 100ms

13. Comparison: Redis Cluster vs Memcached

FeatureRedis ClusterMemcached
Data structuresStrings, lists, sets, sorted sets, hashes, streamsStrings only
PersistenceRDB snapshots + AOFNone
ReplicationBuilt-in primary-replicaNone (client-side)
PartitioningHash slots (16384 slots)Client-side consistent hashing
EvictionLRU, LFU, volatile-lru, volatile-ttl, etc.LRU only
Pub/SubYesNo
Lua scriptingYesNo
Multi-threadedI/O threads (Redis 6+)Yes (multi-threaded)
Memory efficiencyHigher overhead per keyLower overhead (slab allocator)

When to Choose Which

  • Redis -- Need data structures beyond strings, persistence, or Pub/Sub
  • Memcached -- Pure key-value caching, maximum simplicity, multi-threaded perf

14. Advanced: Thundering Herd Prevention

When a popular cached key expires, many concurrent requests flood the database simultaneously.

Solutions:

1. Locking: First request acquires lock, others wait
   Thread 1: cache miss -> acquire lock -> query DB -> populate cache -> release lock
   Thread 2-N: cache miss -> wait for lock -> read from cache (now populated)

2. Early expiration: Refresh cache before TTL expires
   Actual TTL = 60s, logical TTL = 50s
   First request after 50s triggers background refresh

3. Stale-while-revalidate: Serve stale data while refreshing
   Return expired value immediately, trigger async refresh

15. Capacity Estimation

Assumptions:
- 100M unique keys
- Average key: 50 bytes, average value: 500 bytes
- Total data: 100M * 550B = 55GB
- Replication factor: 2 (primary + 1 replica)
- Total memory: 55GB * 2 = 110GB
- Nodes with 64GB RAM (50GB usable for cache): ceil(110/50) = 3 nodes

Throughput:
- 1M ops/sec target
- Single Redis node: ~100K ops/sec
- Need 10 primary nodes for throughput (even if memory fits in 3)
- With replicas reading: 10 primaries + 10 replicas = 20 nodes

16. Key Trade-offs Discussion

DecisionOption AOption B
RoutingSmart client (low latency)Proxy-based (simpler clients, extra hop)
ReplicationSync (strong consistency)Async (lower latency, risk of stale reads)
EvictionLRU (simple, general)LFU (better for skewed workloads)
Write policyCache-aside (simple)Write-through (always warm cache)
PersistenceNone (pure cache)RDB/AOF (faster recovery)
Failure handlingRehash to other nodesPromote replica (less disruption)

17. Interview Checklist

  • Clarified: read/write ratio, data size, latency requirements
  • Explained consistent hashing with virtual nodes
  • Described eviction policies (LRU implementation detail is a bonus)
  • Covered replication (async for cache is fine) and failover
  • Discussed write policies (cache-aside vs write-through vs write-behind)
  • Addressed thundering herd problem
  • Monitoring: hit ratio, latency percentiles, eviction rate
  • Capacity estimation with node count
  • Compared Redis Cluster vs Memcached

18. Resources

  • System Design Interview (Alex Xu, Vol 1) -- Chapter on Distributed Cache
  • Redis Documentation -- redis.io/docs (cluster, persistence, eviction)
  • Memcached Wiki -- memcached.org
  • Paper: "Scaling Memcache at Facebook" (NSDI 2013)
  • YouTube: System Design Interview -- Design Distributed Cache
  • YouTube: Martin Kleppmann -- Caching Strategies

Previous: 39 - Design Typeahead & Autocomplete | Next: 41 - Design Ticket Booking System