29 - Design Key-Value Store

Previous: 28 - Design Rate Limiter | Next: 30 - Design Notification System

Why This Matters in Interviews

Designing a distributed key-value store (essentially Amazon Dynamo) tests your knowledge of consistent hashing, replication, consistency models, failure handling, and storage engine internals. It is one of the most comprehensive system design questions.

Step 1: Requirements

Functional Requirements

put(key, value) -- store a key-value pair
get(key) -- retrieve the value for a key
delete(key) -- remove a key-value pair

Non-Functional Requirements

Requirement	Target
Latency	< 10ms for reads and writes (p99)
Availability	99.99% (always writable, AP system)
Scalability	Handle petabytes of data, millions of QPS
Durability	No data loss
Tunable consistency	Eventual to strong, configurable per operation
Automatic scaling	Add/remove nodes without downtime

Step 2: Single-Server Key-Value Store

In-Memory Hash Table

+----------------------------+
|    Hash Table (in RAM)     |
| key1 --> value1            |
| key2 --> value2            |
| key3 --> value3            |
+----------------------------+

Pros: Fast (O(1) lookups)
Cons: Limited by RAM, data lost on crash

Improving Single Server

+------------------+      +------------------+
| Write-Ahead Log  |      |   Hash Table     |
| (WAL on disk)    | ---> |   (in memory)    |
| append-only      |      |   fast lookups   |
+------------------+      +------------------+
                                   |
                           +-------+-------+
                           | Data file     |
                           | (compacted    |
                           |  periodically)|
                           +---------------+

WAL ensures durability: if the process crashes, replay the log to rebuild the hash table.

Compaction: Periodically merge and compress old data files to reclaim space.

Step 3: Distributed Architecture

Consistent Hashing for Data Partitioning

Hash ring with virtual nodes:

          Node A (vnode 1)
         /
        /
  Node C (vnode 3) ---- Node A (vnode 2)
        \              /
         \            /
      Node B (vnode 1)
         |
    Node B (vnode 2) --- Node C (vnode 1)

Key "user:123" hashes to position X on the ring
--> Walk clockwise to find first node --> Node B (vnode 1)
--> Replicate to next N-1 nodes on ring

Why virtual nodes?

Even distribution (a node with more capacity gets more vnodes)
Smoother rebalancing when nodes join/leave
Avoids hotspots from uneven key distribution

Data Replication

Replication factor N = 3

Key "user:123" --> Primary: Node B
                   Replica 1: Node C (next on ring)
                   Replica 2: Node A (next unique node on ring)

Write path:
  Client --> Coordinator --> Node B (primary)
                         --> Node C (replica)
                         --> Node A (replica)

Preference list: The list of N nodes responsible for a key, skipping virtual nodes owned by the same physical node.

Step 4: Consistency -- Quorum Protocol

W + R > N for Strong Consistency

N = 3 (replicas), W = 2 (write quorum), R = 2 (read quorum)

Write:  Send to 3, wait for 2 ACKs --> success
Read:   Send to 3, wait for 2 responses --> return latest

Since W + R = 4 > 3 = N, reads and writes overlap
--> At least one node has the latest value
--> Strong consistency (if no concurrent writes)

Tunable Consistency

Configuration	W	R	Behavior
Strong consistency	N	1	Every write hits all nodes; reads from any
Strong consistency	N/2+1	N/2+1	Quorum overlap
High read performance	1	N	Fast writes, reads check all nodes
High write performance	N	1	Fast reads, writes to all nodes
Eventual consistency	1	1	Fastest, but may read stale data

Dynamo/Cassandra default: N=3, W=2, R=2 (good balance).

Conflict Resolution

With eventual consistency, concurrent writes to the same key can conflict.

Strategy	How It Works	Used By
Last-Write-Wins (LWW)	Highest timestamp wins	Cassandra (default)
Vector clocks	Track causal history, detect conflicts	Dynamo (original)
CRDTs	Conflict-free data types that auto-merge	Riak
Application-level	Return all versions, app resolves	Dynamo (shopping cart)

Step 5: Failure Detection

Gossip Protocol

Every T seconds, each node:
  1. Picks a random peer
  2. Sends its membership list (node -> heartbeat counter)
  3. Peer merges the list with its own

Node A: {A:100, B:98, C:95, D:92}
Node B: {A:99, B:100, C:96, D:90}

After gossip:
Node A: {A:100, B:100, C:96, D:92}
Node B: {A:100, B:100, C:96, D:92}

If D's counter hasn't increased in T_fail seconds --> mark D as suspected down

Why gossip?

Decentralized (no single failure detector)
Eventually consistent membership view
Scalable: O(log N) rounds to propagate to all nodes

Step 6: Handling Temporary Failures

Sloppy Quorum + Hinted Handoff

Normal: Key "X" --> Node A, B, C (all healthy)

Node C is down:
  Key "X" --> Node A, B, D (D temporarily holds C's data)

  D stores data with a "hint":
    { key: "X", value: "...", hint: "intended for Node C" }

When C recovers:
  D sends hinted data to C --> D deletes the hint

This maintains availability during temporary failures.

Sloppy quorum: The first N healthy nodes on the ring handle the request, even if they aren't the designated nodes. This prioritizes availability over strict consistency.

Step 7: Handling Permanent Failures

Anti-Entropy with Merkle Trees

When a replica has been down for a long time, it needs to synchronize efficiently.

Merkle tree for a key range:

              Root Hash
             /          \
      Hash(1-4)       Hash(5-8)
       /    \          /    \
  Hash(1-2) Hash(3-4) Hash(5-6) Hash(7-8)
   /  \     /  \     /  \     /  \
  k1  k2   k3  k4   k5  k6   k7  k8

Comparison between two replicas:
  Root hashes differ?
    --> Compare children: left subtree matches, right differs
      --> Compare children: Hash(5-6) matches, Hash(7-8) differs
        --> Sync only keys k7 and k8

Why Merkle trees?

Compare entire datasets by exchanging just the root hash
Identify exactly which keys differ in O(log N) comparisons
Transfer only the differing keys (not the entire dataset)

Step 8: Handling Data Center Outage

DC1 (us-east)                    DC2 (eu-west)
+----------+----------+          +----------+----------+
| Node A1  | Node B1  |          | Node A2  | Node B2  |
| Node C1  | Node D1  |          | Node C2  | Node D2  |
+----------+----------+          +----------+----------+
         |                                |
         +--- Async replication ----------+

Replication strategy:
  N = 3, with at least 1 replica in each DC
  Example: 2 replicas in local DC + 1 in remote DC

If DC1 goes down:

Traffic routes to DC2 (DNS/load balancer)
DC2 has enough replicas to serve reads
Writes continue to DC2 (async replicate back when DC1 recovers)
Anti-entropy reconciles any divergence

Step 9: Write Path

LSM-Tree Storage Engine

Write request: put("user:123", "{name: Alice}")

Step 1: Write to WAL (Write-Ahead Log) on disk [durability]

Step 2: Write to MemTable (in-memory sorted structure, e.g., red-black tree)

Step 3: When MemTable reaches threshold (e.g., 64MB):
  - Flush to disk as an SSTable (Sorted String Table)
  - SSTable is immutable, sorted by key

Step 4: Background compaction merges SSTables:
  Level 0: [SST1] [SST2] [SST3]  (may overlap)
     |         \     |     /
     v          merge
  Level 1: [SST-merged-1] [SST-merged-2]  (no overlap within level)
     |                      |
     v                     merge
  Level 2: [SST-merged-all]  (larger, fewer files)

Write path diagram:

Client --> [WAL] --> [MemTable] --> [Immutable MemTable] --> [SSTable (disk)]
           (disk)     (memory)       (memory, pending flush)    (sorted, immutable)

Step 10: Read Path

Read request: get("user:123")

Step 1: Check MemTable (in-memory, fast)
  Found? --> Return value
  Not found? --> Continue

Step 2: Check immutable MemTable (if flush in progress)
  Found? --> Return value
  Not found? --> Continue

Step 3: Check Bloom filters for each SSTable
  Bloom filter says "definitely not here" --> Skip this SSTable
  Bloom filter says "maybe here" --> Continue to Step 4

Step 4: Search SSTable (use sparse index to find block, then scan block)
  Check Level 0 SSTables (newest first)
  Then Level 1, Level 2, etc.
  Return first match found (newest version)

Bloom Filters

A probabilistic data structure:
  - "Key is NOT in this SSTable" --> 100% certain
  - "Key MIGHT be in this SSTable" --> small false positive rate

Without bloom filter: Check every SSTable (slow)
With bloom filter: Skip 99%+ of SSTables that don't contain the key

Configuration: A bloom filter with 10 bits per key has ~1% false positive rate.

Step 11: Complete Architecture

                         +-------------------+
                         |      Client       |
                         +--------+----------+
                                  |
                         +--------+----------+
                         |   Coordinator     |
                         | (any node can be  |
                         |  coordinator)     |
                         +--------+----------+
                                  |
           +----------------------+----------------------+
           |                      |                      |
    +------+------+        +------+------+        +------+------+
    |   Node A    |        |   Node B    |        |   Node C    |
    |             |        |             |        |             |
    | [WAL]       |        | [WAL]       |        | [WAL]       |
    | [MemTable]  |  <-->  | [MemTable]  |  <-->  | [MemTable]  |
    | [SSTables]  | gossip | [SSTables]  | gossip | [SSTables]  |
    | [Bloom      |        | [Bloom      |        | [Bloom      |
    |  Filters]   |        |  Filters]   |        |  Filters]   |
    | [Merkle     |        | [Merkle     |        | [Merkle     |
    |  Trees]     |        |  Trees]     |        |  Trees]     |
    +-------------+        +-------------+        +-------------+

Key Concepts Summary Table

Component	Purpose	Mechanism
Consistent hashing	Partition data across nodes	Hash ring + virtual nodes
Quorum (W, R, N)	Tunable consistency	W + R > N for strong consistency
Gossip protocol	Failure detection + membership	Periodic random peer exchange
Sloppy quorum	Availability during failures	First N healthy nodes on ring
Hinted handoff	Temporary failure recovery	Store hints, replay on recovery
Merkle trees	Permanent failure recovery	Tree comparison to find diffs
WAL	Durability	Append-only log before memory write
MemTable	Fast writes	In-memory sorted structure
SSTable	Persistent storage	Immutable sorted files on disk
Bloom filter	Fast negative lookups	Skip SSTables that don't have key
Vector clocks	Conflict detection	Track causal history per key

Interview Walkthrough Checklist

[ ] Start with single-server (hash table + WAL)
[ ] Distribute with consistent hashing (virtual nodes)
[ ] Replication (N replicas, preference list)
[ ] Consistency (quorum protocol, tunable W/R/N)
[ ] Failure detection (gossip protocol)
[ ] Temporary failure (sloppy quorum + hinted handoff)
[ ] Permanent failure (anti-entropy with Merkle trees)
[ ] Write path (WAL -> MemTable -> SSTable)
[ ] Read path (MemTable -> Bloom filter -> SSTable)
[ ] Conflict resolution (LWW, vector clocks)
[ ] Multi-datacenter replication

Interview Tips

Build incrementally: Single server, then distribute. Don't jump to the full Dynamo architecture.
Justify every choice. "We use consistent hashing because it minimizes data movement when nodes join/leave."
Quorum math is a must: W + R > N and why. Draw the overlap.
Bloom filters are a read-path optimization gold nugget. Mention the false positive rate trade-off.
Merkle trees show deep knowledge. Explain the log(N) comparison.
Conflict resolution is where you show distributed systems maturity. Know LWW vs vector clocks.

Resources

DDIA Chapter 5: Replication (quorum, conflict resolution)
DDIA Chapter 6: Partitioning (consistent hashing, rebalancing)
Amazon Dynamo Paper: "Dynamo: Amazon's Highly Available Key-Value Store" (2007)
System Design Interview (Alex Xu): Chapter 6 - Design a Key-Value Store
Apache Cassandra documentation (Dynamo-inspired)
Riak documentation (vector clocks, CRDTs)

Previous: 28 - Design Rate Limiter | Next: 30 - Design Notification System