25 - Monitoring, Logging & Tracing

Previous: 24 - Reliability Engineering | Next: 26 - Security in Distributed Systems


Why This Matters in Interviews

"How would you debug a latency spike in production?" is a question that separates juniors from seniors. The three pillars of observability -- metrics, logs, and traces -- are how you answer. FAANG interviewers expect you to design monitoring into your system, not bolt it on.


The Three Pillars of Observability

+------------------+    +------------------+    +------------------+
|     METRICS      |    |      LOGS        |    |     TRACES       |
|                  |    |                  |    |                  |
| What is happening|    | Why it happened  |    | Where it happened|
| (aggregates)     |    | (events)         |    | (request path)   |
|                  |    |                  |    |                  |
| Prometheus       |    | ELK Stack        |    | Jaeger           |
| Grafana          |    | Loki             |    | Zipkin           |
| Datadog          |    | CloudWatch Logs  |    | OpenTelemetry    |
+------------------+    +------------------+    +------------------+
        |                       |                       |
        +----------+------------+-----------+-----------+
                   |                        |
            [Correlation ID links them all together]
PillarData TypeCardinalityCostBest For
MetricsNumeric time seriesLow (pre-aggregated)LowAlerting, dashboards, trends
LogsStructured/unstructured textHigh (per event)HighRoot cause analysis, audit trails
TracesRequest-scoped spansMedium (sampled)MediumLatency analysis, dependency mapping

Metrics

RED Method (Request-Driven)

For services (anything that handles requests):

MetricWhat to MeasureAlert When
RateRequests per secondSudden drop or spike
ErrorsFailed requests per second (or error %)Error rate > threshold
DurationLatency distribution (p50, p95, p99)p99 > SLO target

USE Method (Resource-Driven)

For infrastructure (CPU, memory, disk, network):

MetricWhat to MeasureAlert When
Utilization% of resource capacity in use> 80% sustained
SaturationQueue depth / backlogGrowing unboundedly
ErrorsError count on the resourceAny hardware errors

Prometheus + Grafana Stack

+----------+     scrape     +-----------+     query     +---------+
| Services | ------------> | Prometheus | ------------> | Grafana |
| /metrics |   (pull-based) | (TSDB)    |   (PromQL)   | (viz)   |
+----------+               +-----------+               +---------+
                                |
                          +----------+
                          | Alertmanager |
                          | (routing,    |
                          |  dedup,      |
                          |  silence)    |
                          +----------+

Why pull-based (Prometheus)?

  • Service doesn't need to know about monitoring infrastructure
  • Prometheus controls scrape rate (no push storms)
  • Easy to detect if a target is down (scrape fails)

Key Metric Types

TypeDescriptionExample
CounterMonotonically increasing valueTotal requests served
GaugeValue that goes up and downCurrent active connections
HistogramDistribution of values in bucketsRequest latency percentiles
SummaryClient-side calculated percentilesSimilar to histogram but pre-computed

Structured Logging

Why Structured?

Unstructured (hard to query):
  "2024-03-15 10:23:45 ERROR User 12345 failed to checkout: payment timeout"

Structured (easy to query):
  {
    "timestamp": "2024-03-15T10:23:45Z",
    "level": "ERROR",
    "service": "checkout",
    "user_id": "12345",
    "action": "checkout",
    "error": "payment_timeout",
    "correlation_id": "abc-123-def",
    "latency_ms": 5023
  }

Structured logs let you filter: service=checkout AND error=payment_timeout AND latency_ms > 5000

ELK Stack

+----------+    +-----------+    +---------------+    +-----------+
| Services | -> | Logstash  | -> | Elasticsearch | -> | Kibana    |
| (stdout) |    | (parse,   |    | (index,       |    | (search,  |
|          |    |  enrich,  |    |  store,       |    |  visualize|
|          |    |  route)   |    |  full-text)   |    |           |
+----------+    +-----------+    +---------------+    +-----------+

Alternative lightweight pipeline:
  Services -> Filebeat -> Elasticsearch -> Kibana
  Services -> Fluentd -> Loki -> Grafana (cheaper)

Correlation IDs

A single ID that follows a request across every service it touches.

User request --> API Gateway (generates correlation_id: "abc-123")
  |
  +--> Auth Service    (logs with correlation_id: "abc-123")
  +--> Order Service   (logs with correlation_id: "abc-123")
       +--> Payment Service (logs with correlation_id: "abc-123")
       +--> Inventory Service (logs with correlation_id: "abc-123")

Implementation: Propagate via HTTP header (e.g., X-Request-ID) or gRPC metadata. Every log line includes it.

Query: correlation_id = "abc-123" returns the full story of one request across all services.


Distributed Tracing

Concepts: Traces and Spans

Trace (full request journey):
|<--------------------------- Trace abc-123 ---------------------------->|

Spans (individual operations):
|-- API Gateway (12ms) --|
   |-- Auth Service (3ms) --|
   |-- Order Service (45ms) --------------------------------------|
      |-- DB Query (8ms) --|
      |-- Payment Service (30ms) ----------------------|
         |-- Stripe API (25ms) -------------------|
      |-- Inventory Service (5ms) --|
ConceptDefinition
TraceEnd-to-end journey of a single request, identified by a trace ID
SpanA single unit of work within a trace (service call, DB query, etc.)
Parent spanThe span that initiated the current span
Span contextTrace ID + span ID + flags, propagated across service boundaries
BaggageKey-value pairs propagated through the trace (user_id, tenant_id)

OpenTelemetry

The emerging standard for instrumentation across all three pillars.

+-------------+      +------------------+      +-------------+
| Application | ---> | OTel Collector   | ---> | Backend     |
| (SDK)       |      | (receive,        |      | - Jaeger    |
| - traces    |      |  process,        |      | - Zipkin    |
| - metrics   |      |  export)         |      | - Prometheus|
| - logs      |      +------------------+      | - Loki      |
+-------------+                                +-------------+

Why OpenTelemetry?

  • Vendor-neutral instrumentation (switch backends without code changes)
  • Auto-instrumentation for common frameworks (HTTP, gRPC, DB clients)
  • Unified SDK for metrics + traces + logs
  • W3C Trace Context standard for propagation

Jaeger vs Zipkin

FeatureJaegerZipkin
OriginUberTwitter
LanguageGoJava
StorageCassandra, Elasticsearch, KafkaCassandra, Elasticsearch, MySQL
Adaptive samplingYesNo (fixed-rate)
ArchitectureCollector + Query + AgentSingle binary or distributed
OpenTelemetryNative supportVia exporter

Sampling Strategies

At FAANG scale, tracing every request is too expensive. Sampling reduces volume.

StrategyDescriptionTrade-off
Head-based (probabilistic)Decide at request start: trace 1%Simple; misses rare errors
Tail-basedDecide after request completesCaptures errors/slow requests; higher resource cost
AdaptiveAdjust rate based on traffic volumeBalances cost and coverage
PriorityAlways trace certain paths (checkout, auth)Ensures critical path visibility

Alerting Best Practices

Signal vs Noise

Good alert:                          Bad alert:
- Actionable                         - "CPU at 75%" (so what?)
- Tied to user impact                - "Disk at 80%" (for 2 seconds)
- Has a runbook                      - Fires 50 times/day
- Fires rarely but meaningfully      - No one knows what to do

Alerting Anti-Patterns

Anti-PatternProblemFix
Alert fatigueToo many alerts, team ignores themReduce to symptoms, not causes
Missing context"Error rate high" with no detailsInclude service, endpoint, error type
No runbookOn-call doesn't know what to doEvery alert links to a runbook
Alerting on causes"Pod restarted" (Kubernetes handles this)Alert on user-facing symptoms
No severity levelsEverything pages the same wayP1 = page, P2 = Slack, P3 = ticket

Alert Hierarchy

Page (wake someone up):
  - SLO is burning fast (error budget consumption > 5x normal rate)
  - User-facing errors > threshold for > 5 minutes

Notify (Slack/email):
  - SLO is burning slowly (elevated but not critical)
  - Background job failure rate elevated

Ticket (fix this week):
  - Approaching capacity limits
  - Non-critical dependency degradation

Dashboards

Four Golden Signals Dashboard (per service)

+-----------------------------+-----------------------------+
|  Latency (p50, p95, p99)   |  Traffic (requests/sec)     |
|  [line chart over time]     |  [line chart over time]     |
+-----------------------------+-----------------------------+
|  Error Rate (%)             |  Saturation (CPU, mem, queue|
|  [line chart + threshold]   |  [gauge charts]             |
+-----------------------------+-----------------------------+

Dashboard Design Rules

  1. Top-level: Business metrics (orders/min, active users)
  2. Service-level: RED metrics per service
  3. Infrastructure: USE metrics per node/pod
  4. Drill-down: Click through from symptom to cause

SLI/SLO Monitoring

SLO: 99.9% of requests < 500ms over 30 days

Dashboard:
  Current SLI:          99.87%  (below target!)
  Error budget:         21.6 min remaining this month
  Burn rate:            2.3x normal  (will exhaust in 4 days)

  [============================--------]  78% budget consumed

Multi-window, multi-burn-rate alerting:

  • Fast burn (10x) over 5 min: page immediately
  • Moderate burn (2x) over 6 hours: notify
  • Slow burn (1.5x) over 3 days: create ticket

This approach from the Google SRE book minimizes false positives while catching real SLO breaches.


Anomaly Detection

ApproachHow It WorksGood For
Static thresholdsAlert if value > XStable, predictable metrics
Standard deviationAlert if > N sigma from rolling meanMetrics with known distribution
Seasonal decompositionModel daily/weekly patterns, alert on deviationTraffic with time-of-day patterns
ML-basedTrain model on normal behaviorComplex, multi-dimensional signals

On-Call Practices

PracticeWhy
Runbooks for every alertReduce cognitive load at 3 AM
Blameless postmortemsEncourage honest root cause analysis
On-call rotationsPrevent burnout (1 week on, N weeks off)
Escalation pathsClear chain when primary can't resolve
Incident commander roleOne person coordinates during major incidents
Follow-the-sunDistribute on-call across time zones

Interview Tips

  1. "How would you debug this?" Always answer: metrics first (what is happening?), then traces (where?), then logs (why?).
  2. Correlation IDs are essential. Mention them in any microservices design.
  3. Sampling trade-offs show depth. Don't claim you trace every request at 1M QPS.
  4. Alert on symptoms, not causes. "Users see errors" > "Pod restarted."
  5. SLO burn rate alerting is the gold standard. Reference it for bonus points.
  6. OpenTelemetry is the current industry direction. Show you know the convergence story.

Common Interview Questions

  • "Your p99 latency spiked 3x. Walk me through how you investigate."
  • "Design a monitoring system for a microservices architecture."
  • "How do you avoid alert fatigue?"
  • "What's the difference between monitoring and observability?"
  • "How would you trace a request through 10 microservices?"

Resources

  • Google SRE Book: Chapter 6 (Monitoring), Chapter 4 (SLOs)
  • Distributed Systems Observability - Cindy Sridharan (free e-book)
  • DDIA Chapter 1: Reliability, maintainability, operability
  • OpenTelemetry documentation: https://opentelemetry.io/docs/
  • Grafana Labs blog: "The RED Method" and "The USE Method"
  • Google SRE Workbook: "Alerting on SLOs"

Previous: 24 - Reliability Engineering | Next: 26 - Security in Distributed Systems