29 - Design Key-Value Store

Previous: 28 - Design Rate Limiter | Next: 30 - Design Notification System


Why This Matters in Interviews

Designing a distributed key-value store (essentially Amazon Dynamo) tests your knowledge of consistent hashing, replication, consistency models, failure handling, and storage engine internals. It is one of the most comprehensive system design questions.


Step 1: Requirements

Functional Requirements

  1. put(key, value) -- store a key-value pair
  2. get(key) -- retrieve the value for a key
  3. delete(key) -- remove a key-value pair

Non-Functional Requirements

RequirementTarget
Latency< 10ms for reads and writes (p99)
Availability99.99% (always writable, AP system)
ScalabilityHandle petabytes of data, millions of QPS
DurabilityNo data loss
Tunable consistencyEventual to strong, configurable per operation
Automatic scalingAdd/remove nodes without downtime

Step 2: Single-Server Key-Value Store

In-Memory Hash Table

+----------------------------+
|    Hash Table (in RAM)     |
| key1 --> value1            |
| key2 --> value2            |
| key3 --> value3            |
+----------------------------+

Pros: Fast (O(1) lookups)
Cons: Limited by RAM, data lost on crash

Improving Single Server

+------------------+      +------------------+
| Write-Ahead Log  |      |   Hash Table     |
| (WAL on disk)    | ---> |   (in memory)    |
| append-only      |      |   fast lookups   |
+------------------+      +------------------+
                                   |
                           +-------+-------+
                           | Data file     |
                           | (compacted    |
                           |  periodically)|
                           +---------------+

WAL ensures durability: if the process crashes, replay the log to rebuild the hash table.

Compaction: Periodically merge and compress old data files to reclaim space.


Step 3: Distributed Architecture

Consistent Hashing for Data Partitioning

Hash ring with virtual nodes:

          Node A (vnode 1)
         /
        /
  Node C (vnode 3) ---- Node A (vnode 2)
        \              /
         \            /
      Node B (vnode 1)
         |
    Node B (vnode 2) --- Node C (vnode 1)

Key "user:123" hashes to position X on the ring
--> Walk clockwise to find first node --> Node B (vnode 1)
--> Replicate to next N-1 nodes on ring

Why virtual nodes?

  • Even distribution (a node with more capacity gets more vnodes)
  • Smoother rebalancing when nodes join/leave
  • Avoids hotspots from uneven key distribution

Data Replication

Replication factor N = 3

Key "user:123" --> Primary: Node B
                   Replica 1: Node C (next on ring)
                   Replica 2: Node A (next unique node on ring)

Write path:
  Client --> Coordinator --> Node B (primary)
                         --> Node C (replica)
                         --> Node A (replica)

Preference list: The list of N nodes responsible for a key, skipping virtual nodes owned by the same physical node.


Step 4: Consistency -- Quorum Protocol

W + R > N for Strong Consistency

N = 3 (replicas), W = 2 (write quorum), R = 2 (read quorum)

Write:  Send to 3, wait for 2 ACKs --> success
Read:   Send to 3, wait for 2 responses --> return latest

Since W + R = 4 > 3 = N, reads and writes overlap
--> At least one node has the latest value
--> Strong consistency (if no concurrent writes)

Tunable Consistency

ConfigurationWRBehavior
Strong consistencyN1Every write hits all nodes; reads from any
Strong consistencyN/2+1N/2+1Quorum overlap
High read performance1NFast writes, reads check all nodes
High write performanceN1Fast reads, writes to all nodes
Eventual consistency11Fastest, but may read stale data

Dynamo/Cassandra default: N=3, W=2, R=2 (good balance).

Conflict Resolution

With eventual consistency, concurrent writes to the same key can conflict.

StrategyHow It WorksUsed By
Last-Write-Wins (LWW)Highest timestamp winsCassandra (default)
Vector clocksTrack causal history, detect conflictsDynamo (original)
CRDTsConflict-free data types that auto-mergeRiak
Application-levelReturn all versions, app resolvesDynamo (shopping cart)

Step 5: Failure Detection

Gossip Protocol

Every T seconds, each node:
  1. Picks a random peer
  2. Sends its membership list (node -> heartbeat counter)
  3. Peer merges the list with its own

Node A: {A:100, B:98, C:95, D:92}
Node B: {A:99, B:100, C:96, D:90}

After gossip:
Node A: {A:100, B:100, C:96, D:92}
Node B: {A:100, B:100, C:96, D:92}

If D's counter hasn't increased in T_fail seconds --> mark D as suspected down

Why gossip?

  • Decentralized (no single failure detector)
  • Eventually consistent membership view
  • Scalable: O(log N) rounds to propagate to all nodes

Step 6: Handling Temporary Failures

Sloppy Quorum + Hinted Handoff

Normal: Key "X" --> Node A, B, C (all healthy)

Node C is down:
  Key "X" --> Node A, B, D (D temporarily holds C's data)

  D stores data with a "hint":
    { key: "X", value: "...", hint: "intended for Node C" }

When C recovers:
  D sends hinted data to C --> D deletes the hint

This maintains availability during temporary failures.

Sloppy quorum: The first N healthy nodes on the ring handle the request, even if they aren't the designated nodes. This prioritizes availability over strict consistency.


Step 7: Handling Permanent Failures

Anti-Entropy with Merkle Trees

When a replica has been down for a long time, it needs to synchronize efficiently.

Merkle tree for a key range:

              Root Hash
             /          \
      Hash(1-4)       Hash(5-8)
       /    \          /    \
  Hash(1-2) Hash(3-4) Hash(5-6) Hash(7-8)
   /  \     /  \     /  \     /  \
  k1  k2   k3  k4   k5  k6   k7  k8

Comparison between two replicas:
  Root hashes differ?
    --> Compare children: left subtree matches, right differs
      --> Compare children: Hash(5-6) matches, Hash(7-8) differs
        --> Sync only keys k7 and k8

Why Merkle trees?

  • Compare entire datasets by exchanging just the root hash
  • Identify exactly which keys differ in O(log N) comparisons
  • Transfer only the differing keys (not the entire dataset)

Step 8: Handling Data Center Outage

DC1 (us-east)                    DC2 (eu-west)
+----------+----------+          +----------+----------+
| Node A1  | Node B1  |          | Node A2  | Node B2  |
| Node C1  | Node D1  |          | Node C2  | Node D2  |
+----------+----------+          +----------+----------+
         |                                |
         +--- Async replication ----------+

Replication strategy:
  N = 3, with at least 1 replica in each DC
  Example: 2 replicas in local DC + 1 in remote DC

If DC1 goes down:

  1. Traffic routes to DC2 (DNS/load balancer)
  2. DC2 has enough replicas to serve reads
  3. Writes continue to DC2 (async replicate back when DC1 recovers)
  4. Anti-entropy reconciles any divergence

Step 9: Write Path

LSM-Tree Storage Engine

Write request: put("user:123", "{name: Alice}")

Step 1: Write to WAL (Write-Ahead Log) on disk [durability]

Step 2: Write to MemTable (in-memory sorted structure, e.g., red-black tree)

Step 3: When MemTable reaches threshold (e.g., 64MB):
  - Flush to disk as an SSTable (Sorted String Table)
  - SSTable is immutable, sorted by key

Step 4: Background compaction merges SSTables:
  Level 0: [SST1] [SST2] [SST3]  (may overlap)
     |         \     |     /
     v          merge
  Level 1: [SST-merged-1] [SST-merged-2]  (no overlap within level)
     |                      |
     v                     merge
  Level 2: [SST-merged-all]  (larger, fewer files)
Write path diagram:

Client --> [WAL] --> [MemTable] --> [Immutable MemTable] --> [SSTable (disk)]
           (disk)     (memory)       (memory, pending flush)    (sorted, immutable)

Step 10: Read Path

Read request: get("user:123")

Step 1: Check MemTable (in-memory, fast)
  Found? --> Return value
  Not found? --> Continue

Step 2: Check immutable MemTable (if flush in progress)
  Found? --> Return value
  Not found? --> Continue

Step 3: Check Bloom filters for each SSTable
  Bloom filter says "definitely not here" --> Skip this SSTable
  Bloom filter says "maybe here" --> Continue to Step 4

Step 4: Search SSTable (use sparse index to find block, then scan block)
  Check Level 0 SSTables (newest first)
  Then Level 1, Level 2, etc.
  Return first match found (newest version)

Bloom Filters

A probabilistic data structure:
  - "Key is NOT in this SSTable" --> 100% certain
  - "Key MIGHT be in this SSTable" --> small false positive rate

Without bloom filter: Check every SSTable (slow)
With bloom filter: Skip 99%+ of SSTables that don't contain the key

Configuration: A bloom filter with 10 bits per key has ~1% false positive rate.


Step 11: Complete Architecture

                         +-------------------+
                         |      Client       |
                         +--------+----------+
                                  |
                         +--------+----------+
                         |   Coordinator     |
                         | (any node can be  |
                         |  coordinator)     |
                         +--------+----------+
                                  |
           +----------------------+----------------------+
           |                      |                      |
    +------+------+        +------+------+        +------+------+
    |   Node A    |        |   Node B    |        |   Node C    |
    |             |        |             |        |             |
    | [WAL]       |        | [WAL]       |        | [WAL]       |
    | [MemTable]  |  <-->  | [MemTable]  |  <-->  | [MemTable]  |
    | [SSTables]  | gossip | [SSTables]  | gossip | [SSTables]  |
    | [Bloom      |        | [Bloom      |        | [Bloom      |
    |  Filters]   |        |  Filters]   |        |  Filters]   |
    | [Merkle     |        | [Merkle     |        | [Merkle     |
    |  Trees]     |        |  Trees]     |        |  Trees]     |
    +-------------+        +-------------+        +-------------+

Key Concepts Summary Table

ComponentPurposeMechanism
Consistent hashingPartition data across nodesHash ring + virtual nodes
Quorum (W, R, N)Tunable consistencyW + R > N for strong consistency
Gossip protocolFailure detection + membershipPeriodic random peer exchange
Sloppy quorumAvailability during failuresFirst N healthy nodes on ring
Hinted handoffTemporary failure recoveryStore hints, replay on recovery
Merkle treesPermanent failure recoveryTree comparison to find diffs
WALDurabilityAppend-only log before memory write
MemTableFast writesIn-memory sorted structure
SSTablePersistent storageImmutable sorted files on disk
Bloom filterFast negative lookupsSkip SSTables that don't have key
Vector clocksConflict detectionTrack causal history per key

Interview Walkthrough Checklist

[ ] Start with single-server (hash table + WAL)
[ ] Distribute with consistent hashing (virtual nodes)
[ ] Replication (N replicas, preference list)
[ ] Consistency (quorum protocol, tunable W/R/N)
[ ] Failure detection (gossip protocol)
[ ] Temporary failure (sloppy quorum + hinted handoff)
[ ] Permanent failure (anti-entropy with Merkle trees)
[ ] Write path (WAL -> MemTable -> SSTable)
[ ] Read path (MemTable -> Bloom filter -> SSTable)
[ ] Conflict resolution (LWW, vector clocks)
[ ] Multi-datacenter replication

Interview Tips

  1. Build incrementally: Single server, then distribute. Don't jump to the full Dynamo architecture.
  2. Justify every choice. "We use consistent hashing because it minimizes data movement when nodes join/leave."
  3. Quorum math is a must: W + R > N and why. Draw the overlap.
  4. Bloom filters are a read-path optimization gold nugget. Mention the false positive rate trade-off.
  5. Merkle trees show deep knowledge. Explain the log(N) comparison.
  6. Conflict resolution is where you show distributed systems maturity. Know LWW vs vector clocks.

Resources

  • DDIA Chapter 5: Replication (quorum, conflict resolution)
  • DDIA Chapter 6: Partitioning (consistent hashing, rebalancing)
  • Amazon Dynamo Paper: "Dynamo: Amazon's Highly Available Key-Value Store" (2007)
  • System Design Interview (Alex Xu): Chapter 6 - Design a Key-Value Store
  • Apache Cassandra documentation (Dynamo-inspired)
  • Riak documentation (vector clocks, CRDTs)

Previous: 28 - Design Rate Limiter | Next: 30 - Design Notification System