25 - Monitoring, Logging & Tracing
Previous: 24 - Reliability Engineering | Next: 26 - Security in Distributed Systems
Why This Matters in Interviews
"How would you debug a latency spike in production?" is a question that separates juniors from seniors. The three pillars of observability -- metrics, logs, and traces -- are how you answer. FAANG interviewers expect you to design monitoring into your system, not bolt it on.
The Three Pillars of Observability
+------------------+ +------------------+ +------------------+
| METRICS | | LOGS | | TRACES |
| | | | | |
| What is happening| | Why it happened | | Where it happened|
| (aggregates) | | (events) | | (request path) |
| | | | | |
| Prometheus | | ELK Stack | | Jaeger |
| Grafana | | Loki | | Zipkin |
| Datadog | | CloudWatch Logs | | OpenTelemetry |
+------------------+ +------------------+ +------------------+
| | |
+----------+------------+-----------+-----------+
| |
[Correlation ID links them all together]
| Pillar | Data Type | Cardinality | Cost | Best For |
|---|---|---|---|---|
| Metrics | Numeric time series | Low (pre-aggregated) | Low | Alerting, dashboards, trends |
| Logs | Structured/unstructured text | High (per event) | High | Root cause analysis, audit trails |
| Traces | Request-scoped spans | Medium (sampled) | Medium | Latency analysis, dependency mapping |
Metrics
RED Method (Request-Driven)
For services (anything that handles requests):
| Metric | What to Measure | Alert When |
|---|---|---|
| Rate | Requests per second | Sudden drop or spike |
| Errors | Failed requests per second (or error %) | Error rate > threshold |
| Duration | Latency distribution (p50, p95, p99) | p99 > SLO target |
USE Method (Resource-Driven)
For infrastructure (CPU, memory, disk, network):
| Metric | What to Measure | Alert When |
|---|---|---|
| Utilization | % of resource capacity in use | > 80% sustained |
| Saturation | Queue depth / backlog | Growing unboundedly |
| Errors | Error count on the resource | Any hardware errors |
Prometheus + Grafana Stack
+----------+ scrape +-----------+ query +---------+
| Services | ------------> | Prometheus | ------------> | Grafana |
| /metrics | (pull-based) | (TSDB) | (PromQL) | (viz) |
+----------+ +-----------+ +---------+
|
+----------+
| Alertmanager |
| (routing, |
| dedup, |
| silence) |
+----------+
Why pull-based (Prometheus)?
- Service doesn't need to know about monitoring infrastructure
- Prometheus controls scrape rate (no push storms)
- Easy to detect if a target is down (scrape fails)
Key Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing value | Total requests served |
| Gauge | Value that goes up and down | Current active connections |
| Histogram | Distribution of values in buckets | Request latency percentiles |
| Summary | Client-side calculated percentiles | Similar to histogram but pre-computed |
Structured Logging
Why Structured?
Unstructured (hard to query):
"2024-03-15 10:23:45 ERROR User 12345 failed to checkout: payment timeout"
Structured (easy to query):
{
"timestamp": "2024-03-15T10:23:45Z",
"level": "ERROR",
"service": "checkout",
"user_id": "12345",
"action": "checkout",
"error": "payment_timeout",
"correlation_id": "abc-123-def",
"latency_ms": 5023
}
Structured logs let you filter: service=checkout AND error=payment_timeout AND latency_ms > 5000
ELK Stack
+----------+ +-----------+ +---------------+ +-----------+
| Services | -> | Logstash | -> | Elasticsearch | -> | Kibana |
| (stdout) | | (parse, | | (index, | | (search, |
| | | enrich, | | store, | | visualize|
| | | route) | | full-text) | | |
+----------+ +-----------+ +---------------+ +-----------+
Alternative lightweight pipeline:
Services -> Filebeat -> Elasticsearch -> Kibana
Services -> Fluentd -> Loki -> Grafana (cheaper)
Correlation IDs
A single ID that follows a request across every service it touches.
User request --> API Gateway (generates correlation_id: "abc-123")
|
+--> Auth Service (logs with correlation_id: "abc-123")
+--> Order Service (logs with correlation_id: "abc-123")
+--> Payment Service (logs with correlation_id: "abc-123")
+--> Inventory Service (logs with correlation_id: "abc-123")
Implementation: Propagate via HTTP header (e.g., X-Request-ID) or gRPC metadata. Every log line includes it.
Query: correlation_id = "abc-123" returns the full story of one request across all services.
Distributed Tracing
Concepts: Traces and Spans
Trace (full request journey):
|<--------------------------- Trace abc-123 ---------------------------->|
Spans (individual operations):
|-- API Gateway (12ms) --|
|-- Auth Service (3ms) --|
|-- Order Service (45ms) --------------------------------------|
|-- DB Query (8ms) --|
|-- Payment Service (30ms) ----------------------|
|-- Stripe API (25ms) -------------------|
|-- Inventory Service (5ms) --|
| Concept | Definition |
|---|---|
| Trace | End-to-end journey of a single request, identified by a trace ID |
| Span | A single unit of work within a trace (service call, DB query, etc.) |
| Parent span | The span that initiated the current span |
| Span context | Trace ID + span ID + flags, propagated across service boundaries |
| Baggage | Key-value pairs propagated through the trace (user_id, tenant_id) |
OpenTelemetry
The emerging standard for instrumentation across all three pillars.
+-------------+ +------------------+ +-------------+
| Application | ---> | OTel Collector | ---> | Backend |
| (SDK) | | (receive, | | - Jaeger |
| - traces | | process, | | - Zipkin |
| - metrics | | export) | | - Prometheus|
| - logs | +------------------+ | - Loki |
+-------------+ +-------------+
Why OpenTelemetry?
- Vendor-neutral instrumentation (switch backends without code changes)
- Auto-instrumentation for common frameworks (HTTP, gRPC, DB clients)
- Unified SDK for metrics + traces + logs
- W3C Trace Context standard for propagation
Jaeger vs Zipkin
| Feature | Jaeger | Zipkin |
|---|---|---|
| Origin | Uber | |
| Language | Go | Java |
| Storage | Cassandra, Elasticsearch, Kafka | Cassandra, Elasticsearch, MySQL |
| Adaptive sampling | Yes | No (fixed-rate) |
| Architecture | Collector + Query + Agent | Single binary or distributed |
| OpenTelemetry | Native support | Via exporter |
Sampling Strategies
At FAANG scale, tracing every request is too expensive. Sampling reduces volume.
| Strategy | Description | Trade-off |
|---|---|---|
| Head-based (probabilistic) | Decide at request start: trace 1% | Simple; misses rare errors |
| Tail-based | Decide after request completes | Captures errors/slow requests; higher resource cost |
| Adaptive | Adjust rate based on traffic volume | Balances cost and coverage |
| Priority | Always trace certain paths (checkout, auth) | Ensures critical path visibility |
Alerting Best Practices
Signal vs Noise
Good alert: Bad alert:
- Actionable - "CPU at 75%" (so what?)
- Tied to user impact - "Disk at 80%" (for 2 seconds)
- Has a runbook - Fires 50 times/day
- Fires rarely but meaningfully - No one knows what to do
Alerting Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Alert fatigue | Too many alerts, team ignores them | Reduce to symptoms, not causes |
| Missing context | "Error rate high" with no details | Include service, endpoint, error type |
| No runbook | On-call doesn't know what to do | Every alert links to a runbook |
| Alerting on causes | "Pod restarted" (Kubernetes handles this) | Alert on user-facing symptoms |
| No severity levels | Everything pages the same way | P1 = page, P2 = Slack, P3 = ticket |
Alert Hierarchy
Page (wake someone up):
- SLO is burning fast (error budget consumption > 5x normal rate)
- User-facing errors > threshold for > 5 minutes
Notify (Slack/email):
- SLO is burning slowly (elevated but not critical)
- Background job failure rate elevated
Ticket (fix this week):
- Approaching capacity limits
- Non-critical dependency degradation
Dashboards
Four Golden Signals Dashboard (per service)
+-----------------------------+-----------------------------+
| Latency (p50, p95, p99) | Traffic (requests/sec) |
| [line chart over time] | [line chart over time] |
+-----------------------------+-----------------------------+
| Error Rate (%) | Saturation (CPU, mem, queue|
| [line chart + threshold] | [gauge charts] |
+-----------------------------+-----------------------------+
Dashboard Design Rules
- Top-level: Business metrics (orders/min, active users)
- Service-level: RED metrics per service
- Infrastructure: USE metrics per node/pod
- Drill-down: Click through from symptom to cause
SLI/SLO Monitoring
SLO: 99.9% of requests < 500ms over 30 days
Dashboard:
Current SLI: 99.87% (below target!)
Error budget: 21.6 min remaining this month
Burn rate: 2.3x normal (will exhaust in 4 days)
[============================--------] 78% budget consumed
Multi-window, multi-burn-rate alerting:
- Fast burn (10x) over 5 min: page immediately
- Moderate burn (2x) over 6 hours: notify
- Slow burn (1.5x) over 3 days: create ticket
This approach from the Google SRE book minimizes false positives while catching real SLO breaches.
Anomaly Detection
| Approach | How It Works | Good For |
|---|---|---|
| Static thresholds | Alert if value > X | Stable, predictable metrics |
| Standard deviation | Alert if > N sigma from rolling mean | Metrics with known distribution |
| Seasonal decomposition | Model daily/weekly patterns, alert on deviation | Traffic with time-of-day patterns |
| ML-based | Train model on normal behavior | Complex, multi-dimensional signals |
On-Call Practices
| Practice | Why |
|---|---|
| Runbooks for every alert | Reduce cognitive load at 3 AM |
| Blameless postmortems | Encourage honest root cause analysis |
| On-call rotations | Prevent burnout (1 week on, N weeks off) |
| Escalation paths | Clear chain when primary can't resolve |
| Incident commander role | One person coordinates during major incidents |
| Follow-the-sun | Distribute on-call across time zones |
Interview Tips
- "How would you debug this?" Always answer: metrics first (what is happening?), then traces (where?), then logs (why?).
- Correlation IDs are essential. Mention them in any microservices design.
- Sampling trade-offs show depth. Don't claim you trace every request at 1M QPS.
- Alert on symptoms, not causes. "Users see errors" > "Pod restarted."
- SLO burn rate alerting is the gold standard. Reference it for bonus points.
- OpenTelemetry is the current industry direction. Show you know the convergence story.
Common Interview Questions
- "Your p99 latency spiked 3x. Walk me through how you investigate."
- "Design a monitoring system for a microservices architecture."
- "How do you avoid alert fatigue?"
- "What's the difference between monitoring and observability?"
- "How would you trace a request through 10 microservices?"
Resources
- Google SRE Book: Chapter 6 (Monitoring), Chapter 4 (SLOs)
- Distributed Systems Observability - Cindy Sridharan (free e-book)
- DDIA Chapter 1: Reliability, maintainability, operability
- OpenTelemetry documentation: https://opentelemetry.io/docs/
- Grafana Labs blog: "The RED Method" and "The USE Method"
- Google SRE Workbook: "Alerting on SLOs"
Previous: 24 - Reliability Engineering | Next: 26 - Security in Distributed Systems