24 - Reliability Engineering

Previous: 23 - Data Pipelines & ETL | Next: 25 - Monitoring, Logging & Tracing


Why This Matters in Interviews

Every FAANG system design question implicitly asks: "What happens when things fail?" Reliability engineering is how you answer. Interviewers want to see you reason about failure modes, recovery strategies, and the business trade-offs behind uptime targets.


SLA / SLO / SLI

SLI (measurement) --> SLO (target) --> SLA (contract)

Example:
  SLI: "p99 latency of /checkout endpoint = 340ms"
  SLO: "p99 latency of /checkout < 500ms, 99.9% of the time"
  SLA: "If p99 latency exceeds 500ms for >0.1% of requests/month,
        customer receives 10% credit"
TermDefinitionOwned ByExample
SLI (Service Level Indicator)Quantitative metric measuring service behaviorEngineeringRequest latency, error rate, throughput
SLO (Service Level Objective)Target value for an SLIEngineering + Product99.95% availability per month
SLA (Service Level Agreement)Contract with consequences if SLO is breachedBusiness + LegalFinancial penalties, credits

Common SLIs

SLIMeasurementGood For
AvailabilitySuccessful requests / total requestsAPI services
Latencyp50, p95, p99 response timeUser-facing services
ThroughputRequests per second processedData pipelines
Error rate5xx responses / total responsesAny HTTP service
DurabilityData loss events per yearStorage systems
FreshnessAge of most recent data updateAnalytics, search index

Nines Table

AvailabilityDowntime/yearDowntime/monthDowntime/day
99% (two 9s)3.65 days7.3 hours14.4 min
99.9% (three 9s)8.77 hours43.8 min1.44 min
99.95%4.38 hours21.9 min43.2 sec
99.99% (four 9s)52.6 min4.38 min8.64 sec
99.999% (five 9s)5.26 min26.3 sec0.86 sec

Error Budgets

The error budget is the inverse of the SLO: how much failure is tolerable.

SLO = 99.95% availability
Error budget = 0.05% = 21.9 minutes of downtime per month

Budget remaining:  [=============-----]  78% remaining
                    ^                ^
               Used: 4.8 min    Budget: 21.9 min

When budget is healthy: Ship features, take risks, deploy frequently When budget is burning: Slow down releases, focus on reliability work When budget is exhausted: Feature freeze, all hands on reliability

This creates a data-driven negotiation between velocity and reliability.


Failure Modes

Failure ModeDescriptionExampleDetection Difficulty
Crash failureProcess stops entirelyOOM kill, segfaultEasy (health check fails)
Omission failureFails to send/receive messagesDropped packets, full queueMedium (timeout needed)
Timing failureResponse outside time boundsSlow query, GC pauseMedium (latency monitoring)
Response failureIncorrect responseBug returns wrong dataHard (needs validation)
Byzantine failureArbitrary/malicious behaviorCorrupted memory, hacked nodeVery hard (needs consensus)

Fail-Stop vs Fail-Silent vs Byzantine

Fail-Stop:    Node crashes and everyone knows it's down
Fail-Silent:  Node stops responding but doesn't announce failure
Byzantine:    Node sends conflicting/incorrect messages to different peers

Circuit Breaker Pattern

Prevents cascading failures by stopping calls to a failing service.

States

        success              failure threshold
  +------+-----+      +--------+---------+
  |             |      |                  |
  v             |      v                  |
CLOSED ----[failures >= threshold]----> OPEN
  ^                                      |
  |                                      | (timeout expires)
  |                                      v
  +-------[success]------- HALF-OPEN <---+
  |                            |
  +-------[failure]------------+
                            (back to OPEN)
StateBehaviorTransitions
ClosedRequests pass through normally; failures counted--> Open (threshold hit)
OpenAll requests fail fast (no calls to downstream)--> Half-Open (after timeout)
Half-OpenAllow limited probe requests to test recovery--> Closed (probes succeed) or Open (probes fail)

Implementation Considerations

  • Track failures in a rolling window (not cumulative)
  • Configure per-dependency (payment service vs image service)
  • Return fallback responses when open (cached data, defaults, graceful error)
  • Log state transitions for observability

Bulkhead Pattern

Isolates components so failure in one doesn't sink everything.

Without bulkhead:
  [Shared Thread Pool: 100 threads]
  Service A (stuck) uses 98 threads --> Service B, C starved

With bulkhead:
  [Pool A: 40 threads] [Pool B: 30 threads] [Pool C: 30 threads]
  Service A (stuck) uses 40 threads --> B and C unaffected

Types:

  • Thread pool isolation: Separate thread pools per dependency
  • Connection pool isolation: Separate DB/HTTP connection pools
  • Process isolation: Separate containers/pods per service
  • Regional isolation: Separate failure domains (AZ, region)

Retry Strategies

Exponential Backoff with Jitter

Attempt 1: wait 0ms        (immediate)
Attempt 2: wait ~200ms     (base * 2^1 + jitter)
Attempt 3: wait ~800ms     (base * 2^2 + jitter)
Attempt 4: wait ~3200ms    (base * 2^3 + jitter)
Attempt 5: give up

Without jitter (thundering herd):
  All clients retry at: 100ms, 200ms, 400ms, 800ms  <-- synchronized spikes

With jitter:
  Client A: 87ms, 230ms, 510ms, 900ms
  Client B: 120ms, 180ms, 620ms, 1100ms   <-- spread out

Retry Decision Matrix

ConditionShould Retry?Why
500 Internal Server ErrorYesLikely transient
503 Service UnavailableYesServer overloaded temporarily
429 Too Many RequestsYes (with backoff)Rate limited, wait and try
400 Bad RequestNoClient error, retrying won't help
404 Not FoundNoResource doesn't exist
TimeoutYes (carefully)May cause duplicate if request succeeded
Connection refusedYesServer may be restarting

Key Rules

  • Always set a max retry count (avoid infinite loops)
  • Always add jitter (avoid thundering herd)
  • Ensure idempotency (retries may duplicate the request)
  • Use retry budgets (limit total retries across all clients)

Timeout Strategies

Client --> Gateway --> Service A --> Service B --> Database

Timeouts must cascade inward:
  Client timeout:    10s
  Gateway timeout:    8s
  Service A timeout:  5s
  Service B timeout:  3s
  DB query timeout:   1s
StrategyDescriptionWhen to Use
Connect timeoutMax time to establish connectionAlways (short: 1-3s)
Read timeoutMax time waiting for responseAlways (based on SLO)
Deadline propagationPass remaining time budget downstreamMicroservices chains
Adaptive timeoutAdjust based on p99 of recent callsHigh-traffic services

Anti-pattern: No timeout at all. A single hung connection can exhaust the thread pool.


Graceful Degradation

When the system is under stress, serve reduced functionality rather than failing entirely.

ScenarioDegraded ResponseFull Response
Recommendation service downShow popular itemsPersonalized recommendations
Image service slowShow placeholder/low-resFull-res images
Search index staleServe slightly stale resultsFresh results
Payment gateway timeoutQueue order, process laterInstant confirmation

Load Shedding

When demand exceeds capacity, intentionally drop low-priority work to protect critical paths.

Incoming requests:  ████████████████████  (150% capacity)

Without shedding:   All requests slow/timeout (100% degraded)
With shedding:      ██████████████ processed (critical)
                    ██████ rejected with 503 (non-critical)
                    70% served well, 30% fast-failed

Priority tiers:

  1. Critical: Payment, auth, core API (never shed)
  2. Important: Search, recommendations (shed under extreme load)
  3. Best-effort: Analytics, prefetch, background sync (shed first)

Health Checks

Shallow vs Deep

Shallow health check:           Deep health check:
GET /health                     GET /health?deep=true
Response: {"status": "ok"}      Response: {
                                  "status": "degraded",
                                  "db": "ok",
                                  "cache": "ok",
                                  "payment_api": "timeout",
                                  "disk_space": "82%"
                                }
TypeWhat It ChecksUse CaseRisk
Shallow (liveness)Process is runningKubernetes liveness probeMay report healthy when deps are down
Deep (readiness)Dependencies are reachableKubernetes readiness probeExpensive; can cascade failure if dep check is slow

Best practice: Use shallow for liveness, deep for readiness. Never let a deep health check take longer than the probe timeout.


Feature Flags for Reliability

if feature_flag("new_checkout_flow"):
    return new_checkout()
else:
    return legacy_checkout()  # safe fallback

Reliability use cases:

  • Kill switch: Instantly disable a feature causing incidents
  • Gradual rollout: 1% -> 5% -> 25% -> 100% (catch issues early)
  • A/B testing: Route traffic without deployments
  • Circuit breaking: Disable integration with a failing third party

Blast Radius Reduction

Limit the impact of any single failure.

TechniqueHow It Helps
Cell-based architectureEach cell serves a subset of users independently
Regional isolationFailure in us-east-1 doesn't affect eu-west-1
Canary deploymentsNew code hits 1-5% of traffic first
Feature flagsDisable broken features without rollback
BulkheadsIsolate resource pools per service
Shuffle shardingAssign customers to overlapping-but-distinct resource sets

Dependency Management

                  +-- Service B (critical) --+
Service A --------|                           |--> Response
                  +-- Service C (optional) ---+

If C is down:
  - Don't fail the whole request
  - Return partial response without C's data
  - Log the degradation for visibility

Rules:

  1. Classify every dependency as critical or optional
  2. Critical dependencies get circuit breakers + retries
  3. Optional dependencies get aggressive timeouts + fallbacks
  4. Never let an optional dependency take down a critical path

Interview Tips

  1. Always discuss failure modes in system design. "What happens when X goes down?" should be part of your answer before the interviewer asks.
  2. Error budgets show maturity. Mention the trade-off between shipping velocity and reliability.
  3. Circuit breaker + retry + timeout is a reliability triad. Know all three cold.
  4. Load shedding vs graceful degradation are different tools: shedding rejects requests, degradation serves reduced responses.
  5. SLIs before SLOs. You can't set targets without measuring first.
  6. Blast radius is a top concern at FAANG scale. Always discuss isolation strategies.

Resources

  • Google SRE Book: Chapters on SLOs, error budgets, handling overload
  • DDIA Chapter 8: The Trouble with Distributed Systems (faults, timeouts)
  • Release It! - Michael Nygard (circuit breakers, bulkheads, stability patterns)
  • Implementing SLOs - Alex Hidalgo
  • Netflix Tech Blog: "Making the Netflix API More Resilient"
  • AWS Architecture Blog: "Shuffle Sharding"

Previous: 23 - Data Pipelines & ETL | Next: 25 - Monitoring, Logging & Tracing