Distributed Consensus
Why This Matters
Consensus is how distributed nodes agree on a single value (leader, commit, state change). Understanding Raft/Paxos distinguishes senior candidates from juniors.
The Consensus Problem
Goal: Get N nodes to agree on a value, even if some nodes crash.
Requirements:
- Agreement — all non-faulty nodes decide the same value
- Validity — decided value was proposed by some node
- Termination — all non-faulty nodes eventually decide
- Integrity — each node decides at most once
FLP Impossibility Result: In an asynchronous system with even ONE faulty node, deterministic consensus is IMPOSSIBLE. Real systems use timeouts (partial synchrony) to work around this.
Raft Consensus Algorithm
Why Raft?
Designed to be understandable (unlike Paxos). Used by etcd, CockroachDB, TiKV, Consul.
Roles
- Leader — handles all client requests, replicates to followers
- Follower — passive, responds to leader's requests
- Candidate — node trying to become leader
Leader Election
1. Follower doesn't hear from leader (election timeout: 150-300ms random)
2. Becomes Candidate, increments term, votes for itself
3. Requests votes from other nodes (RequestVote RPC)
4. Wins if gets majority votes (>N/2)
5. Becomes Leader, starts sending heartbeats
Split vote: Two candidates start election simultaneously → neither gets majority → timeout → retry with new term. Random timeouts make this rare.
Log Replication
Client → Leader: "Set x = 5"
1. Leader appends to its log (uncommitted)
2. Leader sends AppendEntries RPC to all followers
3. Followers append to their logs, acknowledge
4. Leader gets majority ACK → commits entry
5. Leader responds to client: "Success"
6. Leader notifies followers of commit (next heartbeat)
Safety Properties
- Leader Completeness: Elected leader has ALL committed entries
- Log Matching: If two logs agree at an index, they agree on all prior entries
- State Machine Safety: If a node applies entry at index i, no other node applies a different entry at i
Term
- Monotonically increasing counter
- Each term has at most one leader
- Acts as a logical clock — stale leaders discover they're outdated
Paxos
Basic Paxos (Single-Decree)
Agrees on a single value.
Roles:
- Proposer — proposes values
- Acceptor — votes on proposals
- Learner — learns the decided value
Two phases:
Phase 1 — Prepare:
Proposer → Acceptors: "Prepare(n)" (n = proposal number)
Acceptors: If n > any previously seen → "Promise(n)" + any accepted value
Phase 2 — Accept:
Proposer → Acceptors: "Accept(n, value)"
(if majority promised)
Acceptors: If no higher promise → "Accepted(n, value)"
If majority accept → value is chosen!
Multi-Paxos
- Repeated Paxos for a sequence of decisions (log entries)
- Stable leader skips Phase 1 (already has promises)
- Essentially what Raft simplified
Why Paxos is Hard
- Original paper is notoriously difficult to understand
- Many edge cases in practice
- "Paxos Made Simple" (Lamport) is still complex
- Raft was designed specifically because Paxos was too hard to implement correctly
ZAB (ZooKeeper Atomic Broadcast)
Used by Apache ZooKeeper for coordination.
Similar to Raft:
- Leader-based
- Majority quorum
- Ordered broadcast of state changes
Differences from Raft:
- Transaction-based (zxid = epoch + counter)
- Supports atomic broadcast (ordered delivery to all)
- Has explicit recovery phase on leader change
Leader Election Patterns
Why Leader Election?
- Single writer avoids conflicts
- Coordinator for distributed tasks
- One node makes scheduling decisions
Using External Coordination Service
Node A → ZooKeeper/etcd: "Create ephemeral node /leader"
→ Success → Node A is leader
Node B → ZooKeeper/etcd: "Create ephemeral node /leader"
→ Fails (already exists) → Node B watches for leader failure
Bully Algorithm
- Each node has a priority (ID)
- Highest-priority alive node becomes leader
- On leader failure, next highest takes over
- Simple but not partition-tolerant
Fencing Tokens
Problem: Old leader doesn't know it's been replaced (network partition, GC pause)
Solution: Each leader gets a monotonically increasing fencing token. Storage rejects operations with old tokens.
Leader A (token 33) → gets partitioned
Leader B elected (token 34)
Leader A recovers, tries to write with token 33 → Storage rejects (33 < 34)
Split Brain
Problem
Network partition causes two nodes to both think they're leader.
[Node A - Leader] ←PARTITION→ [Node B - Leader]
Both accept writes → DATA INCONSISTENCY
Solutions
- Quorum — need majority to operate (Raft/Paxos)
- Fencing tokens — storage rejects old leader's writes
- STONITH (Shoot The Other Node In The Head) — kill the other node
- Witness/Tiebreaker — odd number of nodes or external witness
Practical Uses of Consensus
| Use Case | System | Consensus For |
|---|---|---|
| Configuration management | ZooKeeper, etcd, Consul | Consistent config across cluster |
| Service discovery | Consul, etcd | Agree on service locations |
| Distributed locking | ZooKeeper, etcd, Redis (Redlock) | Mutual exclusion |
| Leader election | ZooKeeper, etcd | Who is the primary |
| Metadata management | Kafka (KRaft), HDFS NameNode | Cluster metadata |
| Distributed DB | CockroachDB, Spanner, TiDB | Transaction commit |
Byzantine Fault Tolerance (BFT)
Crash faults vs Byzantine faults
- Crash fault: Node stops responding (handled by Raft/Paxos)
- Byzantine fault: Node sends WRONG/MALICIOUS data
BFT Requirements
- Need 3f+1 nodes to tolerate f Byzantine faults
- PBFT (Practical Byzantine Fault Tolerance) — O(n²) messages
- Used in: Blockchain, some financial systems
Interview tip: For most system design interviews, you only need crash fault tolerance (Raft/Paxos). BFT is mentioned only for blockchain or adversarial environments.
Resources
- 📖 DDIA Chapter 9: Consistency and Consensus
- 🔗 Raft paper — In Search of an Understandable Consensus Algorithm
- 🔗 Raft visualization (interactive)
- 🔗 Paxos Made Simple (Lamport)
- 🎥 MIT 6.824 — Raft lecture
- 🔗 ZooKeeper paper
Previous: 10 - CAP Theorem & Consistency Models | Next: 12 - Replication & Fault Tolerance