Distributed Consensus

Why This Matters

Consensus is how distributed nodes agree on a single value (leader, commit, state change). Understanding Raft/Paxos distinguishes senior candidates from juniors.


The Consensus Problem

Goal: Get N nodes to agree on a value, even if some nodes crash.

Requirements:

  1. Agreement — all non-faulty nodes decide the same value
  2. Validity — decided value was proposed by some node
  3. Termination — all non-faulty nodes eventually decide
  4. Integrity — each node decides at most once

FLP Impossibility Result: In an asynchronous system with even ONE faulty node, deterministic consensus is IMPOSSIBLE. Real systems use timeouts (partial synchrony) to work around this.


Raft Consensus Algorithm

Why Raft?

Designed to be understandable (unlike Paxos). Used by etcd, CockroachDB, TiKV, Consul.

Roles

  • Leader — handles all client requests, replicates to followers
  • Follower — passive, responds to leader's requests
  • Candidate — node trying to become leader

Leader Election

1. Follower doesn't hear from leader (election timeout: 150-300ms random)
2. Becomes Candidate, increments term, votes for itself
3. Requests votes from other nodes (RequestVote RPC)
4. Wins if gets majority votes (>N/2)
5. Becomes Leader, starts sending heartbeats

Split vote: Two candidates start election simultaneously → neither gets majority → timeout → retry with new term. Random timeouts make this rare.

Log Replication

Client → Leader: "Set x = 5"
  1. Leader appends to its log (uncommitted)
  2. Leader sends AppendEntries RPC to all followers
  3. Followers append to their logs, acknowledge
  4. Leader gets majority ACK → commits entry
  5. Leader responds to client: "Success"
  6. Leader notifies followers of commit (next heartbeat)

Safety Properties

  • Leader Completeness: Elected leader has ALL committed entries
  • Log Matching: If two logs agree at an index, they agree on all prior entries
  • State Machine Safety: If a node applies entry at index i, no other node applies a different entry at i

Term

  • Monotonically increasing counter
  • Each term has at most one leader
  • Acts as a logical clock — stale leaders discover they're outdated

Paxos

Basic Paxos (Single-Decree)

Agrees on a single value.

Roles:

  • Proposer — proposes values
  • Acceptor — votes on proposals
  • Learner — learns the decided value

Two phases:

Phase 1 — Prepare:

Proposer → Acceptors: "Prepare(n)" (n = proposal number)
Acceptors: If n > any previously seen → "Promise(n)" + any accepted value

Phase 2 — Accept:

Proposer → Acceptors: "Accept(n, value)"
  (if majority promised)
Acceptors: If no higher promise → "Accepted(n, value)"
If majority accept → value is chosen!

Multi-Paxos

  • Repeated Paxos for a sequence of decisions (log entries)
  • Stable leader skips Phase 1 (already has promises)
  • Essentially what Raft simplified

Why Paxos is Hard

  • Original paper is notoriously difficult to understand
  • Many edge cases in practice
  • "Paxos Made Simple" (Lamport) is still complex
  • Raft was designed specifically because Paxos was too hard to implement correctly

ZAB (ZooKeeper Atomic Broadcast)

Used by Apache ZooKeeper for coordination.

Similar to Raft:

  • Leader-based
  • Majority quorum
  • Ordered broadcast of state changes

Differences from Raft:

  • Transaction-based (zxid = epoch + counter)
  • Supports atomic broadcast (ordered delivery to all)
  • Has explicit recovery phase on leader change

Leader Election Patterns

Why Leader Election?

  • Single writer avoids conflicts
  • Coordinator for distributed tasks
  • One node makes scheduling decisions

Using External Coordination Service

Node A → ZooKeeper/etcd: "Create ephemeral node /leader"
  → Success → Node A is leader
Node B → ZooKeeper/etcd: "Create ephemeral node /leader"
  → Fails (already exists) → Node B watches for leader failure

Bully Algorithm

  • Each node has a priority (ID)
  • Highest-priority alive node becomes leader
  • On leader failure, next highest takes over
  • Simple but not partition-tolerant

Fencing Tokens

Problem: Old leader doesn't know it's been replaced (network partition, GC pause)

Solution: Each leader gets a monotonically increasing fencing token. Storage rejects operations with old tokens.

Leader A (token 33) → gets partitioned
Leader B elected (token 34)
Leader A recovers, tries to write with token 33 → Storage rejects (33 < 34)

Split Brain

Problem

Network partition causes two nodes to both think they're leader.

[Node A - Leader] ←PARTITION→ [Node B - Leader]
Both accept writes → DATA INCONSISTENCY

Solutions

  1. Quorum — need majority to operate (Raft/Paxos)
  2. Fencing tokens — storage rejects old leader's writes
  3. STONITH (Shoot The Other Node In The Head) — kill the other node
  4. Witness/Tiebreaker — odd number of nodes or external witness

Practical Uses of Consensus

Use CaseSystemConsensus For
Configuration managementZooKeeper, etcd, ConsulConsistent config across cluster
Service discoveryConsul, etcdAgree on service locations
Distributed lockingZooKeeper, etcd, Redis (Redlock)Mutual exclusion
Leader electionZooKeeper, etcdWho is the primary
Metadata managementKafka (KRaft), HDFS NameNodeCluster metadata
Distributed DBCockroachDB, Spanner, TiDBTransaction commit

Byzantine Fault Tolerance (BFT)

Crash faults vs Byzantine faults

  • Crash fault: Node stops responding (handled by Raft/Paxos)
  • Byzantine fault: Node sends WRONG/MALICIOUS data

BFT Requirements

  • Need 3f+1 nodes to tolerate f Byzantine faults
  • PBFT (Practical Byzantine Fault Tolerance) — O(n²) messages
  • Used in: Blockchain, some financial systems

Interview tip: For most system design interviews, you only need crash fault tolerance (Raft/Paxos). BFT is mentioned only for blockchain or adversarial environments.


Resources


Previous: 10 - CAP Theorem & Consistency Models | Next: 12 - Replication & Fault Tolerance