24 - Reliability Engineering

Previous: 23 - Data Pipelines & ETL | Next: 25 - Monitoring, Logging & Tracing

Why This Matters in Interviews

Every FAANG system design question implicitly asks: "What happens when things fail?" Reliability engineering is how you answer. Interviewers want to see you reason about failure modes, recovery strategies, and the business trade-offs behind uptime targets.

SLA / SLO / SLI

SLI (measurement) --> SLO (target) --> SLA (contract)

Example:
  SLI: "p99 latency of /checkout endpoint = 340ms"
  SLO: "p99 latency of /checkout < 500ms, 99.9% of the time"
  SLA: "If p99 latency exceeds 500ms for >0.1% of requests/month,
        customer receives 10% credit"

Term	Definition	Owned By	Example
SLI (Service Level Indicator)	Quantitative metric measuring service behavior	Engineering	Request latency, error rate, throughput
SLO (Service Level Objective)	Target value for an SLI	Engineering + Product	99.95% availability per month
SLA (Service Level Agreement)	Contract with consequences if SLO is breached	Business + Legal	Financial penalties, credits

Common SLIs

SLI	Measurement	Good For
Availability	Successful requests / total requests	API services
Latency	p50, p95, p99 response time	User-facing services
Throughput	Requests per second processed	Data pipelines
Error rate	5xx responses / total responses	Any HTTP service
Durability	Data loss events per year	Storage systems
Freshness	Age of most recent data update	Analytics, search index

Nines Table

Availability	Downtime/year	Downtime/month	Downtime/day
99% (two 9s)	3.65 days	7.3 hours	14.4 min
99.9% (three 9s)	8.77 hours	43.8 min	1.44 min
99.95%	4.38 hours	21.9 min	43.2 sec
99.99% (four 9s)	52.6 min	4.38 min	8.64 sec
99.999% (five 9s)	5.26 min	26.3 sec	0.86 sec

Error Budgets

The error budget is the inverse of the SLO: how much failure is tolerable.

SLO = 99.95% availability
Error budget = 0.05% = 21.9 minutes of downtime per month

Budget remaining:  [=============-----]  78% remaining
                    ^                ^
               Used: 4.8 min    Budget: 21.9 min

When budget is healthy: Ship features, take risks, deploy frequently When budget is burning: Slow down releases, focus on reliability work When budget is exhausted: Feature freeze, all hands on reliability

This creates a data-driven negotiation between velocity and reliability.

Failure Modes

Failure Mode	Description	Example	Detection Difficulty
Crash failure	Process stops entirely	OOM kill, segfault	Easy (health check fails)
Omission failure	Fails to send/receive messages	Dropped packets, full queue	Medium (timeout needed)
Timing failure	Response outside time bounds	Slow query, GC pause	Medium (latency monitoring)
Response failure	Incorrect response	Bug returns wrong data	Hard (needs validation)
Byzantine failure	Arbitrary/malicious behavior	Corrupted memory, hacked node	Very hard (needs consensus)

Fail-Stop vs Fail-Silent vs Byzantine

Fail-Stop:    Node crashes and everyone knows it's down
Fail-Silent:  Node stops responding but doesn't announce failure
Byzantine:    Node sends conflicting/incorrect messages to different peers

Circuit Breaker Pattern

Prevents cascading failures by stopping calls to a failing service.

States

        success              failure threshold
  +------+-----+      +--------+---------+
  |             |      |                  |
  v             |      v                  |
CLOSED ----[failures >= threshold]----> OPEN
  ^                                      |
  |                                      | (timeout expires)
  |                                      v
  +-------[success]------- HALF-OPEN <---+
  |                            |
  +-------[failure]------------+
                            (back to OPEN)

State	Behavior	Transitions
Closed	Requests pass through normally; failures counted	--> Open (threshold hit)
Open	All requests fail fast (no calls to downstream)	--> Half-Open (after timeout)
Half-Open	Allow limited probe requests to test recovery	--> Closed (probes succeed) or Open (probes fail)

Implementation Considerations

Track failures in a rolling window (not cumulative)
Configure per-dependency (payment service vs image service)
Return fallback responses when open (cached data, defaults, graceful error)
Log state transitions for observability

Bulkhead Pattern

Isolates components so failure in one doesn't sink everything.

Without bulkhead:
  [Shared Thread Pool: 100 threads]
  Service A (stuck) uses 98 threads --> Service B, C starved

With bulkhead:
  [Pool A: 40 threads] [Pool B: 30 threads] [Pool C: 30 threads]
  Service A (stuck) uses 40 threads --> B and C unaffected

Types:

Thread pool isolation: Separate thread pools per dependency
Connection pool isolation: Separate DB/HTTP connection pools
Process isolation: Separate containers/pods per service
Regional isolation: Separate failure domains (AZ, region)

Retry Strategies

Exponential Backoff with Jitter

Attempt 1: wait 0ms        (immediate)
Attempt 2: wait ~200ms     (base * 2^1 + jitter)
Attempt 3: wait ~800ms     (base * 2^2 + jitter)
Attempt 4: wait ~3200ms    (base * 2^3 + jitter)
Attempt 5: give up

Without jitter (thundering herd):
  All clients retry at: 100ms, 200ms, 400ms, 800ms  <-- synchronized spikes

With jitter:
  Client A: 87ms, 230ms, 510ms, 900ms
  Client B: 120ms, 180ms, 620ms, 1100ms   <-- spread out

Retry Decision Matrix

Condition	Should Retry?	Why
500 Internal Server Error	Yes	Likely transient
503 Service Unavailable	Yes	Server overloaded temporarily
429 Too Many Requests	Yes (with backoff)	Rate limited, wait and try
400 Bad Request	No	Client error, retrying won't help
404 Not Found	No	Resource doesn't exist
Timeout	Yes (carefully)	May cause duplicate if request succeeded
Connection refused	Yes	Server may be restarting

Key Rules

Always set a max retry count (avoid infinite loops)
Always add jitter (avoid thundering herd)
Ensure idempotency (retries may duplicate the request)
Use retry budgets (limit total retries across all clients)

Timeout Strategies

Client --> Gateway --> Service A --> Service B --> Database

Timeouts must cascade inward:
  Client timeout:    10s
  Gateway timeout:    8s
  Service A timeout:  5s
  Service B timeout:  3s
  DB query timeout:   1s

Strategy	Description	When to Use
Connect timeout	Max time to establish connection	Always (short: 1-3s)
Read timeout	Max time waiting for response	Always (based on SLO)
Deadline propagation	Pass remaining time budget downstream	Microservices chains
Adaptive timeout	Adjust based on p99 of recent calls	High-traffic services

Anti-pattern: No timeout at all. A single hung connection can exhaust the thread pool.

Graceful Degradation

When the system is under stress, serve reduced functionality rather than failing entirely.

Scenario	Degraded Response	Full Response
Recommendation service down	Show popular items	Personalized recommendations
Image service slow	Show placeholder/low-res	Full-res images
Search index stale	Serve slightly stale results	Fresh results
Payment gateway timeout	Queue order, process later	Instant confirmation

Load Shedding

When demand exceeds capacity, intentionally drop low-priority work to protect critical paths.

Incoming requests:  ████████████████████  (150% capacity)

Without shedding:   All requests slow/timeout (100% degraded)
With shedding:      ██████████████ processed (critical)
                    ██████ rejected with 503 (non-critical)
                    70% served well, 30% fast-failed

Priority tiers:

Critical: Payment, auth, core API (never shed)
Important: Search, recommendations (shed under extreme load)
Best-effort: Analytics, prefetch, background sync (shed first)

Health Checks

Shallow vs Deep

Shallow health check:           Deep health check:
GET /health                     GET /health?deep=true
Response: {"status": "ok"}      Response: {
                                  "status": "degraded",
                                  "db": "ok",
                                  "cache": "ok",
                                  "payment_api": "timeout",
                                  "disk_space": "82%"
                                }

Type	What It Checks	Use Case	Risk
Shallow (liveness)	Process is running	Kubernetes liveness probe	May report healthy when deps are down
Deep (readiness)	Dependencies are reachable	Kubernetes readiness probe	Expensive; can cascade failure if dep check is slow

Best practice: Use shallow for liveness, deep for readiness. Never let a deep health check take longer than the probe timeout.

Feature Flags for Reliability

if feature_flag("new_checkout_flow"):
    return new_checkout()
else:
    return legacy_checkout()  # safe fallback

Reliability use cases:

Kill switch: Instantly disable a feature causing incidents
Gradual rollout: 1% -> 5% -> 25% -> 100% (catch issues early)
A/B testing: Route traffic without deployments
Circuit breaking: Disable integration with a failing third party

Blast Radius Reduction

Limit the impact of any single failure.

Technique	How It Helps
Cell-based architecture	Each cell serves a subset of users independently
Regional isolation	Failure in us-east-1 doesn't affect eu-west-1
Canary deployments	New code hits 1-5% of traffic first
Feature flags	Disable broken features without rollback
Bulkheads	Isolate resource pools per service
Shuffle sharding	Assign customers to overlapping-but-distinct resource sets

Dependency Management

                  +-- Service B (critical) --+
Service A --------|                           |--> Response
                  +-- Service C (optional) ---+

If C is down:
  - Don't fail the whole request
  - Return partial response without C's data
  - Log the degradation for visibility

Rules:

Classify every dependency as critical or optional
Critical dependencies get circuit breakers + retries
Optional dependencies get aggressive timeouts + fallbacks
Never let an optional dependency take down a critical path

Interview Tips

Always discuss failure modes in system design. "What happens when X goes down?" should be part of your answer before the interviewer asks.
Error budgets show maturity. Mention the trade-off between shipping velocity and reliability.
Circuit breaker + retry + timeout is a reliability triad. Know all three cold.
Load shedding vs graceful degradation are different tools: shedding rejects requests, degradation serves reduced responses.
SLIs before SLOs. You can't set targets without measuring first.
Blast radius is a top concern at FAANG scale. Always discuss isolation strategies.

Resources

Google SRE Book: Chapters on SLOs, error budgets, handling overload
DDIA Chapter 8: The Trouble with Distributed Systems (faults, timeouts)
Release It! - Michael Nygard (circuit breakers, bulkheads, stability patterns)
Implementing SLOs - Alex Hidalgo
Netflix Tech Blog: "Making the Netflix API More Resilient"
AWS Architecture Blog: "Shuffle Sharding"

Previous: 23 - Data Pipelines & ETL | Next: 25 - Monitoring, Logging & Tracing