20 - Monitoring & Logging

The Observability Stack

Three pillars of observability:

PillarQuestionTools
Metrics"How is it performing?"Prometheus, Grafana, Datadog
Logs"What happened?"EFK/ELK, Loki, CloudWatch
Traces"Where is the bottleneck?"Jaeger, Zipkin, Tempo

Metrics with Prometheus + Grafana

Architecture

┌─── Kubernetes Cluster ───────────────────────────────┐
│                                                      │
│  App Pods expose /metrics                            │
│  ┌─────┐ ┌─────┐ ┌─────┐                             │
│  │:9090│ │:9090│ │:9090│ ← Prometheus scrapes these  │
│  └──┬──┘ └──┬──┘ └──┬──┘                             │
│     └───────┼───────┘                                │
│             │                                        │
│  ┌──────────▼────────────┐                           │
│  │   Prometheus          │ ← Stores time-series data │
│  │   (scraper + TSDB)    │                           │
│  └──────────┬────────────┘                           │
│             │                                        │
│  ┌──────────▼────────────┐                           │
│  │   Grafana             │ ← Dashboards & alerts     │
│  │   (visualization)     │                           │
│  └───────────────────────┘                           │
│                                                      │
│  ┌───────────────────────┐                           │
│  │   AlertManager        │ ← Routes alerts to        │
│  │                       │   Slack, PagerDuty, etc.  │
│  └───────────────────────┘                           │
└──────────────────────────────────────────────────────┘

Install with kube-prometheus-stack

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install monitoring prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace

This installs: Prometheus, Grafana, AlertManager, node-exporter, kube-state-metrics.

Key Metrics to Monitor

Cluster Level:

  • Node CPU/memory/disk usage
  • Pod count vs capacity
  • API server request latency

Application Level (RED method):

  • Rate: requests per second
  • Errors: error rate
  • Duration: request latency

Application Level (USE method for infrastructure):

  • Utilization: CPU/memory usage %
  • Saturation: queue depth, pending requests
  • Errors: error count

ServiceMonitor (Tell Prometheus What to Scrape)

yaml
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api-monitor namespace: monitoring spec: selector: matchLabels: app: api endpoints: - port: metrics path: /metrics interval: 15s namespaceSelector: matchNames: - production

PrometheusRule (Alerts)

yaml
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: api-alerts spec: groups: - name: api.rules rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning

Logging

Logging Architecture

┌─── Node ────────────────────────────────┐
│                                         │
│  Pod stdout/stderr → Container Runtime  │
│                       ↓                 │
│  /var/log/containers/*.log              │
│                       ↓                 │
│  ┌─── DaemonSet ─────────────┐          │
│  │  Fluent Bit / Fluentd     │          │
│  │  (collects + ships logs)  │          │
│  └──────────┬────────────────┘          │
└─────────────┼───────────────────────────┘
              │
              ▼
┌─── Log Backend ──────────────────────┐
│  Elasticsearch / Loki / CloudWatch   │
│              ↓                       │
│  Kibana / Grafana (search + query)   │
└──────────────────────────────────────┘

EFK Stack (Elasticsearch + Fluentd + Kibana)

bash
# Install with Helm helm install elasticsearch elastic/elasticsearch -n logging --create-namespace helm install kibana elastic/kibana -n logging helm install fluentd fluent/fluentd -n logging

Loki (Lightweight Alternative)

bash
# Install Loki + Promtail helm install loki grafana/loki-stack -n logging --create-namespace \ --set grafana.enabled=false \ --set promtail.enabled=true

Application Logging Best Practices

bash
# Log to stdout/stderr (not files) # K8s captures stdout/stderr automatically # Use structured logging (JSON) {"level":"info","ts":"2024-01-15T10:30:00Z","msg":"request handled","method":"GET","path":"/api/users","status":200,"duration_ms":45} # Not this: INFO 2024-01-15 10:30:00 - Request handled: GET /api/users 200 45ms

Why structured logs:

  • Machine-parseable
  • Easy to filter and aggregate
  • Works with any log backend

Distributed Tracing

Track requests across microservices:

User → API Gateway → Auth Service → User Service → Database
         │              │               │
       Span 1        Span 2          Span 3
       ◄──────────── Trace ──────────────────►

OpenTelemetry (Standard)

yaml
# Auto-instrument with OpenTelemetry Collector apiVersion: apps/v1 kind: Deployment metadata: name: otel-collector spec: template: spec: containers: - name: collector image: otel/opentelemetry-collector-contrib args: ["--config=/etc/otel/config.yaml"]

Kubernetes Dashboard & Built-in Tools

bash
# Built-in metrics (requires metrics-server) kubectl top nodes kubectl top pods kubectl top pods --containers # Events kubectl get events --sort-by='.lastTimestamp' kubectl get events -n production --field-selector type=Warning # Logs kubectl logs -f deployment/api --all-containers kubectl logs -l app=api --since=1h # Install metrics-server (if not present) kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

FAANG Interview Angle

Common questions:

  1. "How would you set up monitoring for a K8s cluster?"
  2. "How do you handle logging in a microservices architecture?"
  3. "What metrics would you monitor for an application?"
  4. "Explain distributed tracing"
  5. "How do you debug a pod that keeps crashing?"

Key answers:

  • Prometheus + Grafana for metrics, EFK/Loki for logs, Jaeger/Tempo for traces
  • DaemonSet log collectors shipping to centralized backend; structured JSON logs to stdout
  • RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for infra
  • Traces follow a request across services; each hop is a span; all spans form a trace
  • kubectl describe pod (events), kubectl logs --previous, resource limits, probes, events

Official Links