29 - Design Key-Value Store
Previous: 28 - Design Rate Limiter | Next: 30 - Design Notification System
Why This Matters in Interviews
Designing a distributed key-value store (essentially Amazon Dynamo) tests your knowledge of consistent hashing, replication, consistency models, failure handling, and storage engine internals. It is one of the most comprehensive system design questions.
Step 1: Requirements
Functional Requirements
put(key, value)-- store a key-value pairget(key)-- retrieve the value for a keydelete(key)-- remove a key-value pair
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Latency | < 10ms for reads and writes (p99) |
| Availability | 99.99% (always writable, AP system) |
| Scalability | Handle petabytes of data, millions of QPS |
| Durability | No data loss |
| Tunable consistency | Eventual to strong, configurable per operation |
| Automatic scaling | Add/remove nodes without downtime |
Step 2: Single-Server Key-Value Store
In-Memory Hash Table
+----------------------------+
| Hash Table (in RAM) |
| key1 --> value1 |
| key2 --> value2 |
| key3 --> value3 |
+----------------------------+
Pros: Fast (O(1) lookups)
Cons: Limited by RAM, data lost on crash
Improving Single Server
+------------------+ +------------------+
| Write-Ahead Log | | Hash Table |
| (WAL on disk) | ---> | (in memory) |
| append-only | | fast lookups |
+------------------+ +------------------+
|
+-------+-------+
| Data file |
| (compacted |
| periodically)|
+---------------+
WAL ensures durability: if the process crashes, replay the log to rebuild the hash table.
Compaction: Periodically merge and compress old data files to reclaim space.
Step 3: Distributed Architecture
Consistent Hashing for Data Partitioning
Hash ring with virtual nodes:
Node A (vnode 1)
/
/
Node C (vnode 3) ---- Node A (vnode 2)
\ /
\ /
Node B (vnode 1)
|
Node B (vnode 2) --- Node C (vnode 1)
Key "user:123" hashes to position X on the ring
--> Walk clockwise to find first node --> Node B (vnode 1)
--> Replicate to next N-1 nodes on ring
Why virtual nodes?
- Even distribution (a node with more capacity gets more vnodes)
- Smoother rebalancing when nodes join/leave
- Avoids hotspots from uneven key distribution
Data Replication
Replication factor N = 3
Key "user:123" --> Primary: Node B
Replica 1: Node C (next on ring)
Replica 2: Node A (next unique node on ring)
Write path:
Client --> Coordinator --> Node B (primary)
--> Node C (replica)
--> Node A (replica)
Preference list: The list of N nodes responsible for a key, skipping virtual nodes owned by the same physical node.
Step 4: Consistency -- Quorum Protocol
W + R > N for Strong Consistency
N = 3 (replicas), W = 2 (write quorum), R = 2 (read quorum)
Write: Send to 3, wait for 2 ACKs --> success
Read: Send to 3, wait for 2 responses --> return latest
Since W + R = 4 > 3 = N, reads and writes overlap
--> At least one node has the latest value
--> Strong consistency (if no concurrent writes)
Tunable Consistency
| Configuration | W | R | Behavior |
|---|---|---|---|
| Strong consistency | N | 1 | Every write hits all nodes; reads from any |
| Strong consistency | N/2+1 | N/2+1 | Quorum overlap |
| High read performance | 1 | N | Fast writes, reads check all nodes |
| High write performance | N | 1 | Fast reads, writes to all nodes |
| Eventual consistency | 1 | 1 | Fastest, but may read stale data |
Dynamo/Cassandra default: N=3, W=2, R=2 (good balance).
Conflict Resolution
With eventual consistency, concurrent writes to the same key can conflict.
| Strategy | How It Works | Used By |
|---|---|---|
| Last-Write-Wins (LWW) | Highest timestamp wins | Cassandra (default) |
| Vector clocks | Track causal history, detect conflicts | Dynamo (original) |
| CRDTs | Conflict-free data types that auto-merge | Riak |
| Application-level | Return all versions, app resolves | Dynamo (shopping cart) |
Step 5: Failure Detection
Gossip Protocol
Every T seconds, each node:
1. Picks a random peer
2. Sends its membership list (node -> heartbeat counter)
3. Peer merges the list with its own
Node A: {A:100, B:98, C:95, D:92}
Node B: {A:99, B:100, C:96, D:90}
After gossip:
Node A: {A:100, B:100, C:96, D:92}
Node B: {A:100, B:100, C:96, D:92}
If D's counter hasn't increased in T_fail seconds --> mark D as suspected down
Why gossip?
- Decentralized (no single failure detector)
- Eventually consistent membership view
- Scalable: O(log N) rounds to propagate to all nodes
Step 6: Handling Temporary Failures
Sloppy Quorum + Hinted Handoff
Normal: Key "X" --> Node A, B, C (all healthy)
Node C is down:
Key "X" --> Node A, B, D (D temporarily holds C's data)
D stores data with a "hint":
{ key: "X", value: "...", hint: "intended for Node C" }
When C recovers:
D sends hinted data to C --> D deletes the hint
This maintains availability during temporary failures.
Sloppy quorum: The first N healthy nodes on the ring handle the request, even if they aren't the designated nodes. This prioritizes availability over strict consistency.
Step 7: Handling Permanent Failures
Anti-Entropy with Merkle Trees
When a replica has been down for a long time, it needs to synchronize efficiently.
Merkle tree for a key range:
Root Hash
/ \
Hash(1-4) Hash(5-8)
/ \ / \
Hash(1-2) Hash(3-4) Hash(5-6) Hash(7-8)
/ \ / \ / \ / \
k1 k2 k3 k4 k5 k6 k7 k8
Comparison between two replicas:
Root hashes differ?
--> Compare children: left subtree matches, right differs
--> Compare children: Hash(5-6) matches, Hash(7-8) differs
--> Sync only keys k7 and k8
Why Merkle trees?
- Compare entire datasets by exchanging just the root hash
- Identify exactly which keys differ in O(log N) comparisons
- Transfer only the differing keys (not the entire dataset)
Step 8: Handling Data Center Outage
DC1 (us-east) DC2 (eu-west)
+----------+----------+ +----------+----------+
| Node A1 | Node B1 | | Node A2 | Node B2 |
| Node C1 | Node D1 | | Node C2 | Node D2 |
+----------+----------+ +----------+----------+
| |
+--- Async replication ----------+
Replication strategy:
N = 3, with at least 1 replica in each DC
Example: 2 replicas in local DC + 1 in remote DC
If DC1 goes down:
- Traffic routes to DC2 (DNS/load balancer)
- DC2 has enough replicas to serve reads
- Writes continue to DC2 (async replicate back when DC1 recovers)
- Anti-entropy reconciles any divergence
Step 9: Write Path
LSM-Tree Storage Engine
Write request: put("user:123", "{name: Alice}")
Step 1: Write to WAL (Write-Ahead Log) on disk [durability]
Step 2: Write to MemTable (in-memory sorted structure, e.g., red-black tree)
Step 3: When MemTable reaches threshold (e.g., 64MB):
- Flush to disk as an SSTable (Sorted String Table)
- SSTable is immutable, sorted by key
Step 4: Background compaction merges SSTables:
Level 0: [SST1] [SST2] [SST3] (may overlap)
| \ | /
v merge
Level 1: [SST-merged-1] [SST-merged-2] (no overlap within level)
| |
v merge
Level 2: [SST-merged-all] (larger, fewer files)
Write path diagram:
Client --> [WAL] --> [MemTable] --> [Immutable MemTable] --> [SSTable (disk)]
(disk) (memory) (memory, pending flush) (sorted, immutable)
Step 10: Read Path
Read request: get("user:123")
Step 1: Check MemTable (in-memory, fast)
Found? --> Return value
Not found? --> Continue
Step 2: Check immutable MemTable (if flush in progress)
Found? --> Return value
Not found? --> Continue
Step 3: Check Bloom filters for each SSTable
Bloom filter says "definitely not here" --> Skip this SSTable
Bloom filter says "maybe here" --> Continue to Step 4
Step 4: Search SSTable (use sparse index to find block, then scan block)
Check Level 0 SSTables (newest first)
Then Level 1, Level 2, etc.
Return first match found (newest version)
Bloom Filters
A probabilistic data structure:
- "Key is NOT in this SSTable" --> 100% certain
- "Key MIGHT be in this SSTable" --> small false positive rate
Without bloom filter: Check every SSTable (slow)
With bloom filter: Skip 99%+ of SSTables that don't contain the key
Configuration: A bloom filter with 10 bits per key has ~1% false positive rate.
Step 11: Complete Architecture
+-------------------+
| Client |
+--------+----------+
|
+--------+----------+
| Coordinator |
| (any node can be |
| coordinator) |
+--------+----------+
|
+----------------------+----------------------+
| | |
+------+------+ +------+------+ +------+------+
| Node A | | Node B | | Node C |
| | | | | |
| [WAL] | | [WAL] | | [WAL] |
| [MemTable] | <--> | [MemTable] | <--> | [MemTable] |
| [SSTables] | gossip | [SSTables] | gossip | [SSTables] |
| [Bloom | | [Bloom | | [Bloom |
| Filters] | | Filters] | | Filters] |
| [Merkle | | [Merkle | | [Merkle |
| Trees] | | Trees] | | Trees] |
+-------------+ +-------------+ +-------------+
Key Concepts Summary Table
| Component | Purpose | Mechanism |
|---|---|---|
| Consistent hashing | Partition data across nodes | Hash ring + virtual nodes |
| Quorum (W, R, N) | Tunable consistency | W + R > N for strong consistency |
| Gossip protocol | Failure detection + membership | Periodic random peer exchange |
| Sloppy quorum | Availability during failures | First N healthy nodes on ring |
| Hinted handoff | Temporary failure recovery | Store hints, replay on recovery |
| Merkle trees | Permanent failure recovery | Tree comparison to find diffs |
| WAL | Durability | Append-only log before memory write |
| MemTable | Fast writes | In-memory sorted structure |
| SSTable | Persistent storage | Immutable sorted files on disk |
| Bloom filter | Fast negative lookups | Skip SSTables that don't have key |
| Vector clocks | Conflict detection | Track causal history per key |
Interview Walkthrough Checklist
[ ] Start with single-server (hash table + WAL)
[ ] Distribute with consistent hashing (virtual nodes)
[ ] Replication (N replicas, preference list)
[ ] Consistency (quorum protocol, tunable W/R/N)
[ ] Failure detection (gossip protocol)
[ ] Temporary failure (sloppy quorum + hinted handoff)
[ ] Permanent failure (anti-entropy with Merkle trees)
[ ] Write path (WAL -> MemTable -> SSTable)
[ ] Read path (MemTable -> Bloom filter -> SSTable)
[ ] Conflict resolution (LWW, vector clocks)
[ ] Multi-datacenter replication
Interview Tips
- Build incrementally: Single server, then distribute. Don't jump to the full Dynamo architecture.
- Justify every choice. "We use consistent hashing because it minimizes data movement when nodes join/leave."
- Quorum math is a must: W + R > N and why. Draw the overlap.
- Bloom filters are a read-path optimization gold nugget. Mention the false positive rate trade-off.
- Merkle trees show deep knowledge. Explain the log(N) comparison.
- Conflict resolution is where you show distributed systems maturity. Know LWW vs vector clocks.
Resources
- DDIA Chapter 5: Replication (quorum, conflict resolution)
- DDIA Chapter 6: Partitioning (consistent hashing, rebalancing)
- Amazon Dynamo Paper: "Dynamo: Amazon's Highly Available Key-Value Store" (2007)
- System Design Interview (Alex Xu): Chapter 6 - Design a Key-Value Store
- Apache Cassandra documentation (Dynamo-inspired)
- Riak documentation (vector clocks, CRDTs)
Previous: 28 - Design Rate Limiter | Next: 30 - Design Notification System