25 - Monitoring, Logging & Tracing

Previous: 24 - Reliability Engineering | Next: 26 - Security in Distributed Systems

Why This Matters in Interviews

"How would you debug a latency spike in production?" is a question that separates juniors from seniors. The three pillars of observability -- metrics, logs, and traces -- are how you answer. FAANG interviewers expect you to design monitoring into your system, not bolt it on.

The Three Pillars of Observability

+------------------+    +------------------+    +------------------+
|     METRICS      |    |      LOGS        |    |     TRACES       |
|                  |    |                  |    |                  |
| What is happening|    | Why it happened  |    | Where it happened|
| (aggregates)     |    | (events)         |    | (request path)   |
|                  |    |                  |    |                  |
| Prometheus       |    | ELK Stack        |    | Jaeger           |
| Grafana          |    | Loki             |    | Zipkin           |
| Datadog          |    | CloudWatch Logs  |    | OpenTelemetry    |
+------------------+    +------------------+    +------------------+
        |                       |                       |
        +----------+------------+-----------+-----------+
                   |                        |
            [Correlation ID links them all together]

Pillar	Data Type	Cardinality	Cost	Best For
Metrics	Numeric time series	Low (pre-aggregated)	Low	Alerting, dashboards, trends
Logs	Structured/unstructured text	High (per event)	High	Root cause analysis, audit trails
Traces	Request-scoped spans	Medium (sampled)	Medium	Latency analysis, dependency mapping

Metrics

RED Method (Request-Driven)

For services (anything that handles requests):

Metric	What to Measure	Alert When
Rate	Requests per second	Sudden drop or spike
Errors	Failed requests per second (or error %)	Error rate > threshold
Duration	Latency distribution (p50, p95, p99)	p99 > SLO target

USE Method (Resource-Driven)

For infrastructure (CPU, memory, disk, network):

Metric	What to Measure	Alert When
Utilization	% of resource capacity in use	> 80% sustained
Saturation	Queue depth / backlog	Growing unboundedly
Errors	Error count on the resource	Any hardware errors

Prometheus + Grafana Stack

+----------+     scrape     +-----------+     query     +---------+
| Services | ------------> | Prometheus | ------------> | Grafana |
| /metrics |   (pull-based) | (TSDB)    |   (PromQL)   | (viz)   |
+----------+               +-----------+               +---------+
                                |
                          +----------+
                          | Alertmanager |
                          | (routing,    |
                          |  dedup,      |
                          |  silence)    |
                          +----------+

Why pull-based (Prometheus)?

Service doesn't need to know about monitoring infrastructure
Prometheus controls scrape rate (no push storms)
Easy to detect if a target is down (scrape fails)

Key Metric Types

Type	Description	Example
Counter	Monotonically increasing value	Total requests served
Gauge	Value that goes up and down	Current active connections
Histogram	Distribution of values in buckets	Request latency percentiles
Summary	Client-side calculated percentiles	Similar to histogram but pre-computed

Structured Logging

Why Structured?

Unstructured (hard to query):
  "2024-03-15 10:23:45 ERROR User 12345 failed to checkout: payment timeout"

Structured (easy to query):
  {
    "timestamp": "2024-03-15T10:23:45Z",
    "level": "ERROR",
    "service": "checkout",
    "user_id": "12345",
    "action": "checkout",
    "error": "payment_timeout",
    "correlation_id": "abc-123-def",
    "latency_ms": 5023
  }

Structured logs let you filter: service=checkout AND error=payment_timeout AND latency_ms > 5000

ELK Stack

+----------+    +-----------+    +---------------+    +-----------+
| Services | -> | Logstash  | -> | Elasticsearch | -> | Kibana    |
| (stdout) |    | (parse,   |    | (index,       |    | (search,  |
|          |    |  enrich,  |    |  store,       |    |  visualize|
|          |    |  route)   |    |  full-text)   |    |           |
+----------+    +-----------+    +---------------+    +-----------+

Alternative lightweight pipeline:
  Services -> Filebeat -> Elasticsearch -> Kibana
  Services -> Fluentd -> Loki -> Grafana (cheaper)

Correlation IDs

A single ID that follows a request across every service it touches.

User request --> API Gateway (generates correlation_id: "abc-123")
  |
  +--> Auth Service    (logs with correlation_id: "abc-123")
  +--> Order Service   (logs with correlation_id: "abc-123")
       +--> Payment Service (logs with correlation_id: "abc-123")
       +--> Inventory Service (logs with correlation_id: "abc-123")

Implementation: Propagate via HTTP header (e.g., X-Request-ID) or gRPC metadata. Every log line includes it.

Query: correlation_id = "abc-123" returns the full story of one request across all services.

Distributed Tracing

Concepts: Traces and Spans

Trace (full request journey):
|<--------------------------- Trace abc-123 ---------------------------->|

Spans (individual operations):
|-- API Gateway (12ms) --|
   |-- Auth Service (3ms) --|
   |-- Order Service (45ms) --------------------------------------|
      |-- DB Query (8ms) --|
      |-- Payment Service (30ms) ----------------------|
         |-- Stripe API (25ms) -------------------|
      |-- Inventory Service (5ms) --|

Concept	Definition
Trace	End-to-end journey of a single request, identified by a trace ID
Span	A single unit of work within a trace (service call, DB query, etc.)
Parent span	The span that initiated the current span
Span context	Trace ID + span ID + flags, propagated across service boundaries
Baggage	Key-value pairs propagated through the trace (user_id, tenant_id)

OpenTelemetry

The emerging standard for instrumentation across all three pillars.

+-------------+      +------------------+      +-------------+
| Application | ---> | OTel Collector   | ---> | Backend     |
| (SDK)       |      | (receive,        |      | - Jaeger    |
| - traces    |      |  process,        |      | - Zipkin    |
| - metrics   |      |  export)         |      | - Prometheus|
| - logs      |      +------------------+      | - Loki      |
+-------------+                                +-------------+

Why OpenTelemetry?

Vendor-neutral instrumentation (switch backends without code changes)
Auto-instrumentation for common frameworks (HTTP, gRPC, DB clients)
Unified SDK for metrics + traces + logs
W3C Trace Context standard for propagation

Jaeger vs Zipkin

Feature	Jaeger	Zipkin
Origin	Uber	Twitter
Language	Go	Java
Storage	Cassandra, Elasticsearch, Kafka	Cassandra, Elasticsearch, MySQL
Adaptive sampling	Yes	No (fixed-rate)
Architecture	Collector + Query + Agent	Single binary or distributed
OpenTelemetry	Native support	Via exporter

Sampling Strategies

At FAANG scale, tracing every request is too expensive. Sampling reduces volume.

Strategy	Description	Trade-off
Head-based (probabilistic)	Decide at request start: trace 1%	Simple; misses rare errors
Tail-based	Decide after request completes	Captures errors/slow requests; higher resource cost
Adaptive	Adjust rate based on traffic volume	Balances cost and coverage
Priority	Always trace certain paths (checkout, auth)	Ensures critical path visibility

Alerting Best Practices

Signal vs Noise

Good alert:                          Bad alert:
- Actionable                         - "CPU at 75%" (so what?)
- Tied to user impact                - "Disk at 80%" (for 2 seconds)
- Has a runbook                      - Fires 50 times/day
- Fires rarely but meaningfully      - No one knows what to do

Alerting Anti-Patterns

Anti-Pattern	Problem	Fix
Alert fatigue	Too many alerts, team ignores them	Reduce to symptoms, not causes
Missing context	"Error rate high" with no details	Include service, endpoint, error type
No runbook	On-call doesn't know what to do	Every alert links to a runbook
Alerting on causes	"Pod restarted" (Kubernetes handles this)	Alert on user-facing symptoms
No severity levels	Everything pages the same way	P1 = page, P2 = Slack, P3 = ticket

Alert Hierarchy

Page (wake someone up):
  - SLO is burning fast (error budget consumption > 5x normal rate)
  - User-facing errors > threshold for > 5 minutes

Notify (Slack/email):
  - SLO is burning slowly (elevated but not critical)
  - Background job failure rate elevated

Ticket (fix this week):
  - Approaching capacity limits
  - Non-critical dependency degradation

Dashboards

Four Golden Signals Dashboard (per service)

+-----------------------------+-----------------------------+
|  Latency (p50, p95, p99)   |  Traffic (requests/sec)     |
|  [line chart over time]     |  [line chart over time]     |
+-----------------------------+-----------------------------+
|  Error Rate (%)             |  Saturation (CPU, mem, queue|
|  [line chart + threshold]   |  [gauge charts]             |
+-----------------------------+-----------------------------+

Dashboard Design Rules

Top-level: Business metrics (orders/min, active users)
Service-level: RED metrics per service
Infrastructure: USE metrics per node/pod
Drill-down: Click through from symptom to cause

SLI/SLO Monitoring

SLO: 99.9% of requests < 500ms over 30 days

Dashboard:
  Current SLI:          99.87%  (below target!)
  Error budget:         21.6 min remaining this month
  Burn rate:            2.3x normal  (will exhaust in 4 days)

  [============================--------]  78% budget consumed

Multi-window, multi-burn-rate alerting:

Fast burn (10x) over 5 min: page immediately
Moderate burn (2x) over 6 hours: notify
Slow burn (1.5x) over 3 days: create ticket

This approach from the Google SRE book minimizes false positives while catching real SLO breaches.

Anomaly Detection

Approach	How It Works	Good For
Static thresholds	Alert if value > X	Stable, predictable metrics
Standard deviation	Alert if > N sigma from rolling mean	Metrics with known distribution
Seasonal decomposition	Model daily/weekly patterns, alert on deviation	Traffic with time-of-day patterns
ML-based	Train model on normal behavior	Complex, multi-dimensional signals

On-Call Practices

Practice	Why
Runbooks for every alert	Reduce cognitive load at 3 AM
Blameless postmortems	Encourage honest root cause analysis
On-call rotations	Prevent burnout (1 week on, N weeks off)
Escalation paths	Clear chain when primary can't resolve
Incident commander role	One person coordinates during major incidents
Follow-the-sun	Distribute on-call across time zones

Interview Tips

"How would you debug this?" Always answer: metrics first (what is happening?), then traces (where?), then logs (why?).
Correlation IDs are essential. Mention them in any microservices design.
Sampling trade-offs show depth. Don't claim you trace every request at 1M QPS.
Alert on symptoms, not causes. "Users see errors" > "Pod restarted."
SLO burn rate alerting is the gold standard. Reference it for bonus points.
OpenTelemetry is the current industry direction. Show you know the convergence story.

Common Interview Questions

"Your p99 latency spiked 3x. Walk me through how you investigate."
"Design a monitoring system for a microservices architecture."
"How do you avoid alert fatigue?"
"What's the difference between monitoring and observability?"
"How would you trace a request through 10 microservices?"

Resources

Google SRE Book: Chapter 6 (Monitoring), Chapter 4 (SLOs)
Distributed Systems Observability - Cindy Sridharan (free e-book)
DDIA Chapter 1: Reliability, maintainability, operability
OpenTelemetry documentation: https://opentelemetry.io/docs/
Grafana Labs blog: "The RED Method" and "The USE Method"
Google SRE Workbook: "Alerting on SLOs"

Previous: 24 - Reliability Engineering | Next: 26 - Security in Distributed Systems