Distributed Consensus

Why This Matters

Consensus is how distributed nodes agree on a single value (leader, commit, state change). Understanding Raft/Paxos distinguishes senior candidates from juniors.

The Consensus Problem

Goal: Get N nodes to agree on a value, even if some nodes crash.

Requirements:

Agreement — all non-faulty nodes decide the same value
Validity — decided value was proposed by some node
Termination — all non-faulty nodes eventually decide
Integrity — each node decides at most once

FLP Impossibility Result: In an asynchronous system with even ONE faulty node, deterministic consensus is IMPOSSIBLE. Real systems use timeouts (partial synchrony) to work around this.

Raft Consensus Algorithm

Why Raft?

Designed to be understandable (unlike Paxos). Used by etcd, CockroachDB, TiKV, Consul.

Roles

Leader — handles all client requests, replicates to followers
Follower — passive, responds to leader's requests
Candidate — node trying to become leader

Leader Election

1. Follower doesn't hear from leader (election timeout: 150-300ms random)
2. Becomes Candidate, increments term, votes for itself
3. Requests votes from other nodes (RequestVote RPC)
4. Wins if gets majority votes (>N/2)
5. Becomes Leader, starts sending heartbeats

Split vote: Two candidates start election simultaneously → neither gets majority → timeout → retry with new term. Random timeouts make this rare.

Log Replication

Client → Leader: "Set x = 5"
  1. Leader appends to its log (uncommitted)
  2. Leader sends AppendEntries RPC to all followers
  3. Followers append to their logs, acknowledge
  4. Leader gets majority ACK → commits entry
  5. Leader responds to client: "Success"
  6. Leader notifies followers of commit (next heartbeat)

Safety Properties

Leader Completeness: Elected leader has ALL committed entries
Log Matching: If two logs agree at an index, they agree on all prior entries
State Machine Safety: If a node applies entry at index i, no other node applies a different entry at i

Term

Monotonically increasing counter
Each term has at most one leader
Acts as a logical clock — stale leaders discover they're outdated

Paxos

Basic Paxos (Single-Decree)

Agrees on a single value.

Roles:

Proposer — proposes values
Acceptor — votes on proposals
Learner — learns the decided value

Two phases:

Phase 1 — Prepare:

Proposer → Acceptors: "Prepare(n)" (n = proposal number)
Acceptors: If n > any previously seen → "Promise(n)" + any accepted value

Phase 2 — Accept:

Proposer → Acceptors: "Accept(n, value)"
  (if majority promised)
Acceptors: If no higher promise → "Accepted(n, value)"
If majority accept → value is chosen!

Multi-Paxos

Repeated Paxos for a sequence of decisions (log entries)
Stable leader skips Phase 1 (already has promises)
Essentially what Raft simplified

Why Paxos is Hard

Original paper is notoriously difficult to understand
Many edge cases in practice
"Paxos Made Simple" (Lamport) is still complex
Raft was designed specifically because Paxos was too hard to implement correctly

ZAB (ZooKeeper Atomic Broadcast)

Used by Apache ZooKeeper for coordination.

Similar to Raft:

Leader-based
Majority quorum
Ordered broadcast of state changes

Differences from Raft:

Transaction-based (zxid = epoch + counter)
Supports atomic broadcast (ordered delivery to all)
Has explicit recovery phase on leader change

Leader Election Patterns

Why Leader Election?

Single writer avoids conflicts
Coordinator for distributed tasks
One node makes scheduling decisions

Using External Coordination Service

Node A → ZooKeeper/etcd: "Create ephemeral node /leader"
  → Success → Node A is leader
Node B → ZooKeeper/etcd: "Create ephemeral node /leader"
  → Fails (already exists) → Node B watches for leader failure

Bully Algorithm

Each node has a priority (ID)
Highest-priority alive node becomes leader
On leader failure, next highest takes over
Simple but not partition-tolerant

Fencing Tokens

Problem: Old leader doesn't know it's been replaced (network partition, GC pause)

Solution: Each leader gets a monotonically increasing fencing token. Storage rejects operations with old tokens.

Leader A (token 33) → gets partitioned
Leader B elected (token 34)
Leader A recovers, tries to write with token 33 → Storage rejects (33 < 34)

Split Brain

Problem

Network partition causes two nodes to both think they're leader.

[Node A - Leader] ←PARTITION→ [Node B - Leader]
Both accept writes → DATA INCONSISTENCY

Solutions

Quorum — need majority to operate (Raft/Paxos)
Fencing tokens — storage rejects old leader's writes
STONITH (Shoot The Other Node In The Head) — kill the other node
Witness/Tiebreaker — odd number of nodes or external witness

Practical Uses of Consensus

Use Case	System	Consensus For
Configuration management	ZooKeeper, etcd, Consul	Consistent config across cluster
Service discovery	Consul, etcd	Agree on service locations
Distributed locking	ZooKeeper, etcd, Redis (Redlock)	Mutual exclusion
Leader election	ZooKeeper, etcd	Who is the primary
Metadata management	Kafka (KRaft), HDFS NameNode	Cluster metadata
Distributed DB	CockroachDB, Spanner, TiDB	Transaction commit

Byzantine Fault Tolerance (BFT)

Crash faults vs Byzantine faults

Crash fault: Node stops responding (handled by Raft/Paxos)
Byzantine fault: Node sends WRONG/MALICIOUS data

BFT Requirements

Need 3f+1 nodes to tolerate f Byzantine faults
PBFT (Practical Byzantine Fault Tolerance) — O(n²) messages
Used in: Blockchain, some financial systems

Interview tip: For most system design interviews, you only need crash fault tolerance (Raft/Paxos). BFT is mentioned only for blockchain or adversarial environments.

Resources

📖 DDIA Chapter 9: Consistency and Consensus
🔗 Raft paper — In Search of an Understandable Consensus Algorithm
🔗 Raft visualization (interactive)
🔗 Paxos Made Simple (Lamport)
🎥 MIT 6.824 — Raft lecture
🔗 ZooKeeper paper

Previous: 10 - CAP Theorem & Consistency Models | Next: 12 - Replication & Fault Tolerance