Distributed Transactions
Why This Matters
When a single operation spans multiple services or databases, how do you ensure atomicity? This is one of the hardest problems in distributed systems and a FAANG favorite.
The Problem
Order Service: Create order (DB 1)
Payment Service: Charge card (DB 2)
Inventory Service: Reserve stock (DB 3)
Notification Service: Send email (External)
If payment succeeds but inventory fails → inconsistent state!
Local ACID transactions don't span multiple databases/services.
Two-Phase Commit (2PC)
How It Works
Phase 1 — Prepare (Voting):
Coordinator → All Participants: "Prepare to commit TX-123"
Participant A: Acquires locks, writes to WAL → "Yes, ready"
Participant B: Acquires locks, writes to WAL → "Yes, ready"
Participant C: Can't proceed → "No, abort"
Phase 2 — Commit/Abort (Decision):
If ALL voted Yes:
Coordinator → All: "Commit TX-123"
Participants: Commit and release locks
If ANY voted No:
Coordinator → All: "Abort TX-123"
Participants: Rollback and release locks
Problems with 2PC
| Problem | Description |
|---|---|
| Blocking | If coordinator crashes after Phase 1, participants hold locks indefinitely |
| Single point of failure | Coordinator failure blocks all participants |
| Performance | Requires 2 round trips + held locks = high latency |
| Availability | Any participant failure = entire TX fails |
| Scalability | Doesn't scale to many participants or across WANs |
Where 2PC Is Used
- Within a single database (internal transactions)
- XA transactions (Java JTA) across 2-3 databases
- NOT across microservices at FAANG scale
Three-Phase Commit (3PC)
Adds a pre-commit phase to reduce blocking:
Phase 1: CanCommit? → Yes/No
Phase 2: PreCommit (write changes, don't commit yet)
Phase 3: DoCommit (actually commit)
Advantage: Non-blocking (participants can decide on timeout) Disadvantage: More messages, doesn't handle network partitions well, rarely used in practice
Saga Pattern ★★★
The standard solution for distributed transactions in microservices.
Concept
Break a distributed transaction into a sequence of local transactions, each with a compensating action:
T1 → T2 → T3 → T4 (all succeed = done!)
T1 → T2 → T3 (fails!) → C2 → C1 (compensate in reverse)
Orchestration Saga
A central orchestrator directs the saga:
Saga Orchestrator:
1. Tell Order Service: "Create order" → OK
2. Tell Payment Service: "Charge card" → OK
3. Tell Inventory Service: "Reserve stock" → FAILED!
4. Tell Payment Service: "Refund card" (compensate)
5. Tell Order Service: "Cancel order" (compensate)
Pros: Central control, easy to understand flow, good for complex sagas Cons: Orchestrator is SPOF, tight coupling to orchestrator
Choreography Saga
Each service reacts to events and triggers the next step:
Order Service: "OrderCreated" →
Payment Service: listens → charges card → "PaymentCompleted" →
Inventory Service: listens → reserves stock → "StockReserved" →
Done!
If Inventory fails:
Inventory: "StockReservationFailed" →
Payment Service: listens → refunds → "PaymentRefunded" →
Order Service: listens → cancels order
Pros: Decoupled, no SPOF, scales well Cons: Hard to track/debug, complex failure flows, potential for missing events
Orchestration vs Choreography
| Aspect | Orchestration | Choreography |
|---|---|---|
| Control | Centralized | Decentralized |
| Coupling | Services coupled to orchestrator | Services coupled via events |
| Visibility | Easy to see full flow | Hard to trace end-to-end |
| Complexity | Simple logic, complex coordinator | Simple services, complex interactions |
| Best for | Complex sagas (>4 steps) | Simple sagas (2-3 steps) |
Compensating Transactions
Design Principles
- Every action has a compensating action (undo)
- Compensation may not be a perfect undo (refund vs. charge)
- Design services to support compensation from the start
Examples
| Action | Compensation |
|---|---|
| Create order | Cancel order |
| Charge payment | Issue refund |
| Reserve inventory | Release reservation |
| Send email | Send "oops" email (can't unsend!) |
| Ship package | Create return label |
Semantic vs Exact Undo
- Exact undo: DELETE the row you INSERTed
- Semantic undo: Create a new "cancellation" record (for audit trail)
- Prefer semantic undo — maintains history
TCC (Try-Confirm-Cancel)
How It Works
Try Phase: Reserve resources (don't commit)
Order: PENDING, Payment: HELD, Inventory: RESERVED
Confirm Phase: Commit all reservations
Order: CONFIRMED, Payment: CHARGED, Inventory: DEDUCTED
Cancel Phase: Release all reservations (if any Try fails)
Order: CANCELLED, Payment: RELEASED, Inventory: RELEASED
Difference from Saga
- TCC reserves resources in Try phase (isolation)
- Saga executes real transactions that may need compensation
- TCC has better isolation but requires services to support reservation
Outbox Pattern
Problem
Service needs to update DB AND send an event. These aren't atomic:
1. Update DB ✅
2. Publish event ❌ (broker down) → inconsistency!
Solution: Transactional Outbox
1. In one DB transaction:
- Update business table
- INSERT event into "outbox" table
2. Separate process (CDC or poller) reads outbox → publishes to broker
3. Mark outbox entry as published
sqlBEGIN; UPDATE orders SET status = 'CONFIRMED' WHERE id = 123; INSERT INTO outbox (event_type, payload) VALUES ('OrderConfirmed', '{"id": 123}'); COMMIT;
CDC (Change Data Capture): Tools like Debezium read DB transaction log → publish to Kafka. No polling needed.
Idempotency ★★★
Why Critical
In distributed systems, messages may be delivered multiple times. Every consumer must handle duplicates safely.
Implementation
// Consumer receives message
1. Check: has idempotency_key been processed?
→ Yes: return stored result (skip processing)
→ No: process message, store result with key
// Storage
CREATE TABLE processed_messages (
idempotency_key VARCHAR PRIMARY KEY,
result JSONB,
created_at TIMESTAMP
);
Naturally Idempotent Operations
- SET x = 5 (can repeat safely)
- DELETE WHERE id = 123 (second delete is no-op)
- PUT /resource/123 (replace, not increment)
Non-Idempotent Operations (Need Protection)
- INCREMENT balance BY 100 (repeating doubles the increment)
- POST /payments (repeating creates duplicate charges)
- INSERT into table (repeating creates duplicate rows)
Interview Decision Framework
Do you need distributed transactions?
→ Can you avoid them? (restructure to single service) → DO THAT
→ Must span services?
→ 2-3 services, simple flow → Choreography Saga
→ 4+ services, complex flow → Orchestration Saga
→ Need strong isolation → TCC
→ Within single DB → Use local ACID
Resources
- 📖 DDIA Chapter 7: Transactions
- 📖 DDIA Chapter 9: Consistency and Consensus
- 📖 "Microservices Patterns" by Chris Richardson — Chapter 4: Sagas
- 🔗 Saga Pattern (Microsoft)
- 🎥 Chris Richardson — Sagas
- 🔗 Outbox Pattern (Debezium)
Previous: 12 - Replication & Fault Tolerance | Next: 14 - Clocks & Ordering