24 - Reliability Engineering
Previous: 23 - Data Pipelines & ETL | Next: 25 - Monitoring, Logging & Tracing
Why This Matters in Interviews
Every FAANG system design question implicitly asks: "What happens when things fail?" Reliability engineering is how you answer. Interviewers want to see you reason about failure modes, recovery strategies, and the business trade-offs behind uptime targets.
SLA / SLO / SLI
SLI (measurement) --> SLO (target) --> SLA (contract)
Example:
SLI: "p99 latency of /checkout endpoint = 340ms"
SLO: "p99 latency of /checkout < 500ms, 99.9% of the time"
SLA: "If p99 latency exceeds 500ms for >0.1% of requests/month,
customer receives 10% credit"
| Term | Definition | Owned By | Example |
|---|---|---|---|
| SLI (Service Level Indicator) | Quantitative metric measuring service behavior | Engineering | Request latency, error rate, throughput |
| SLO (Service Level Objective) | Target value for an SLI | Engineering + Product | 99.95% availability per month |
| SLA (Service Level Agreement) | Contract with consequences if SLO is breached | Business + Legal | Financial penalties, credits |
Common SLIs
| SLI | Measurement | Good For |
|---|---|---|
| Availability | Successful requests / total requests | API services |
| Latency | p50, p95, p99 response time | User-facing services |
| Throughput | Requests per second processed | Data pipelines |
| Error rate | 5xx responses / total responses | Any HTTP service |
| Durability | Data loss events per year | Storage systems |
| Freshness | Age of most recent data update | Analytics, search index |
Nines Table
| Availability | Downtime/year | Downtime/month | Downtime/day |
|---|---|---|---|
| 99% (two 9s) | 3.65 days | 7.3 hours | 14.4 min |
| 99.9% (three 9s) | 8.77 hours | 43.8 min | 1.44 min |
| 99.95% | 4.38 hours | 21.9 min | 43.2 sec |
| 99.99% (four 9s) | 52.6 min | 4.38 min | 8.64 sec |
| 99.999% (five 9s) | 5.26 min | 26.3 sec | 0.86 sec |
Error Budgets
The error budget is the inverse of the SLO: how much failure is tolerable.
SLO = 99.95% availability
Error budget = 0.05% = 21.9 minutes of downtime per month
Budget remaining: [=============-----] 78% remaining
^ ^
Used: 4.8 min Budget: 21.9 min
When budget is healthy: Ship features, take risks, deploy frequently When budget is burning: Slow down releases, focus on reliability work When budget is exhausted: Feature freeze, all hands on reliability
This creates a data-driven negotiation between velocity and reliability.
Failure Modes
| Failure Mode | Description | Example | Detection Difficulty |
|---|---|---|---|
| Crash failure | Process stops entirely | OOM kill, segfault | Easy (health check fails) |
| Omission failure | Fails to send/receive messages | Dropped packets, full queue | Medium (timeout needed) |
| Timing failure | Response outside time bounds | Slow query, GC pause | Medium (latency monitoring) |
| Response failure | Incorrect response | Bug returns wrong data | Hard (needs validation) |
| Byzantine failure | Arbitrary/malicious behavior | Corrupted memory, hacked node | Very hard (needs consensus) |
Fail-Stop vs Fail-Silent vs Byzantine
Fail-Stop: Node crashes and everyone knows it's down
Fail-Silent: Node stops responding but doesn't announce failure
Byzantine: Node sends conflicting/incorrect messages to different peers
Circuit Breaker Pattern
Prevents cascading failures by stopping calls to a failing service.
States
success failure threshold
+------+-----+ +--------+---------+
| | | |
v | v |
CLOSED ----[failures >= threshold]----> OPEN
^ |
| | (timeout expires)
| v
+-------[success]------- HALF-OPEN <---+
| |
+-------[failure]------------+
(back to OPEN)
| State | Behavior | Transitions |
|---|---|---|
| Closed | Requests pass through normally; failures counted | --> Open (threshold hit) |
| Open | All requests fail fast (no calls to downstream) | --> Half-Open (after timeout) |
| Half-Open | Allow limited probe requests to test recovery | --> Closed (probes succeed) or Open (probes fail) |
Implementation Considerations
- Track failures in a rolling window (not cumulative)
- Configure per-dependency (payment service vs image service)
- Return fallback responses when open (cached data, defaults, graceful error)
- Log state transitions for observability
Bulkhead Pattern
Isolates components so failure in one doesn't sink everything.
Without bulkhead:
[Shared Thread Pool: 100 threads]
Service A (stuck) uses 98 threads --> Service B, C starved
With bulkhead:
[Pool A: 40 threads] [Pool B: 30 threads] [Pool C: 30 threads]
Service A (stuck) uses 40 threads --> B and C unaffected
Types:
- Thread pool isolation: Separate thread pools per dependency
- Connection pool isolation: Separate DB/HTTP connection pools
- Process isolation: Separate containers/pods per service
- Regional isolation: Separate failure domains (AZ, region)
Retry Strategies
Exponential Backoff with Jitter
Attempt 1: wait 0ms (immediate)
Attempt 2: wait ~200ms (base * 2^1 + jitter)
Attempt 3: wait ~800ms (base * 2^2 + jitter)
Attempt 4: wait ~3200ms (base * 2^3 + jitter)
Attempt 5: give up
Without jitter (thundering herd):
All clients retry at: 100ms, 200ms, 400ms, 800ms <-- synchronized spikes
With jitter:
Client A: 87ms, 230ms, 510ms, 900ms
Client B: 120ms, 180ms, 620ms, 1100ms <-- spread out
Retry Decision Matrix
| Condition | Should Retry? | Why |
|---|---|---|
| 500 Internal Server Error | Yes | Likely transient |
| 503 Service Unavailable | Yes | Server overloaded temporarily |
| 429 Too Many Requests | Yes (with backoff) | Rate limited, wait and try |
| 400 Bad Request | No | Client error, retrying won't help |
| 404 Not Found | No | Resource doesn't exist |
| Timeout | Yes (carefully) | May cause duplicate if request succeeded |
| Connection refused | Yes | Server may be restarting |
Key Rules
- Always set a max retry count (avoid infinite loops)
- Always add jitter (avoid thundering herd)
- Ensure idempotency (retries may duplicate the request)
- Use retry budgets (limit total retries across all clients)
Timeout Strategies
Client --> Gateway --> Service A --> Service B --> Database
Timeouts must cascade inward:
Client timeout: 10s
Gateway timeout: 8s
Service A timeout: 5s
Service B timeout: 3s
DB query timeout: 1s
| Strategy | Description | When to Use |
|---|---|---|
| Connect timeout | Max time to establish connection | Always (short: 1-3s) |
| Read timeout | Max time waiting for response | Always (based on SLO) |
| Deadline propagation | Pass remaining time budget downstream | Microservices chains |
| Adaptive timeout | Adjust based on p99 of recent calls | High-traffic services |
Anti-pattern: No timeout at all. A single hung connection can exhaust the thread pool.
Graceful Degradation
When the system is under stress, serve reduced functionality rather than failing entirely.
| Scenario | Degraded Response | Full Response |
|---|---|---|
| Recommendation service down | Show popular items | Personalized recommendations |
| Image service slow | Show placeholder/low-res | Full-res images |
| Search index stale | Serve slightly stale results | Fresh results |
| Payment gateway timeout | Queue order, process later | Instant confirmation |
Load Shedding
When demand exceeds capacity, intentionally drop low-priority work to protect critical paths.
Incoming requests: ████████████████████ (150% capacity)
Without shedding: All requests slow/timeout (100% degraded)
With shedding: ██████████████ processed (critical)
██████ rejected with 503 (non-critical)
70% served well, 30% fast-failed
Priority tiers:
- Critical: Payment, auth, core API (never shed)
- Important: Search, recommendations (shed under extreme load)
- Best-effort: Analytics, prefetch, background sync (shed first)
Health Checks
Shallow vs Deep
Shallow health check: Deep health check:
GET /health GET /health?deep=true
Response: {"status": "ok"} Response: {
"status": "degraded",
"db": "ok",
"cache": "ok",
"payment_api": "timeout",
"disk_space": "82%"
}
| Type | What It Checks | Use Case | Risk |
|---|---|---|---|
| Shallow (liveness) | Process is running | Kubernetes liveness probe | May report healthy when deps are down |
| Deep (readiness) | Dependencies are reachable | Kubernetes readiness probe | Expensive; can cascade failure if dep check is slow |
Best practice: Use shallow for liveness, deep for readiness. Never let a deep health check take longer than the probe timeout.
Feature Flags for Reliability
if feature_flag("new_checkout_flow"):
return new_checkout()
else:
return legacy_checkout() # safe fallback
Reliability use cases:
- Kill switch: Instantly disable a feature causing incidents
- Gradual rollout: 1% -> 5% -> 25% -> 100% (catch issues early)
- A/B testing: Route traffic without deployments
- Circuit breaking: Disable integration with a failing third party
Blast Radius Reduction
Limit the impact of any single failure.
| Technique | How It Helps |
|---|---|
| Cell-based architecture | Each cell serves a subset of users independently |
| Regional isolation | Failure in us-east-1 doesn't affect eu-west-1 |
| Canary deployments | New code hits 1-5% of traffic first |
| Feature flags | Disable broken features without rollback |
| Bulkheads | Isolate resource pools per service |
| Shuffle sharding | Assign customers to overlapping-but-distinct resource sets |
Dependency Management
+-- Service B (critical) --+
Service A --------| |--> Response
+-- Service C (optional) ---+
If C is down:
- Don't fail the whole request
- Return partial response without C's data
- Log the degradation for visibility
Rules:
- Classify every dependency as critical or optional
- Critical dependencies get circuit breakers + retries
- Optional dependencies get aggressive timeouts + fallbacks
- Never let an optional dependency take down a critical path
Interview Tips
- Always discuss failure modes in system design. "What happens when X goes down?" should be part of your answer before the interviewer asks.
- Error budgets show maturity. Mention the trade-off between shipping velocity and reliability.
- Circuit breaker + retry + timeout is a reliability triad. Know all three cold.
- Load shedding vs graceful degradation are different tools: shedding rejects requests, degradation serves reduced responses.
- SLIs before SLOs. You can't set targets without measuring first.
- Blast radius is a top concern at FAANG scale. Always discuss isolation strategies.
Resources
- Google SRE Book: Chapters on SLOs, error budgets, handling overload
- DDIA Chapter 8: The Trouble with Distributed Systems (faults, timeouts)
- Release It! - Michael Nygard (circuit breakers, bulkheads, stability patterns)
- Implementing SLOs - Alex Hidalgo
- Netflix Tech Blog: "Making the Netflix API More Resilient"
- AWS Architecture Blog: "Shuffle Sharding"
Previous: 23 - Data Pipelines & ETL | Next: 25 - Monitoring, Logging & Tracing