20 - Monitoring & Logging
The Observability Stack
Three pillars of observability:
| Pillar | Question | Tools |
|---|---|---|
| Metrics | "How is it performing?" | Prometheus, Grafana, Datadog |
| Logs | "What happened?" | EFK/ELK, Loki, CloudWatch |
| Traces | "Where is the bottleneck?" | Jaeger, Zipkin, Tempo |
Metrics with Prometheus + Grafana
Architecture
┌─── Kubernetes Cluster ───────────────────────────────┐
│ │
│ App Pods expose /metrics │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │:9090│ │:9090│ │:9090│ ← Prometheus scrapes these │
│ └──┬──┘ └──┬──┘ └──┬──┘ │
│ └───────┼───────┘ │
│ │ │
│ ┌──────────▼────────────┐ │
│ │ Prometheus │ ← Stores time-series data │
│ │ (scraper + TSDB) │ │
│ └──────────┬────────────┘ │
│ │ │
│ ┌──────────▼────────────┐ │
│ │ Grafana │ ← Dashboards & alerts │
│ │ (visualization) │ │
│ └───────────────────────┘ │
│ │
│ ┌───────────────────────┐ │
│ │ AlertManager │ ← Routes alerts to │
│ │ │ Slack, PagerDuty, etc. │
│ └───────────────────────┘ │
└──────────────────────────────────────────────────────┘
Install with kube-prometheus-stack
bashhelm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install monitoring prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace
This installs: Prometheus, Grafana, AlertManager, node-exporter, kube-state-metrics.
Key Metrics to Monitor
Cluster Level:
- Node CPU/memory/disk usage
- Pod count vs capacity
- API server request latency
Application Level (RED method):
- Rate: requests per second
- Errors: error rate
- Duration: request latency
Application Level (USE method for infrastructure):
- Utilization: CPU/memory usage %
- Saturation: queue depth, pending requests
- Errors: error count
ServiceMonitor (Tell Prometheus What to Scrape)
yamlapiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: api-monitor namespace: monitoring spec: selector: matchLabels: app: api endpoints: - port: metrics path: /metrics interval: 15s namespaceSelector: matchNames: - production
PrometheusRule (Alerts)
yamlapiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: api-alerts spec: groups: - name: api.rules rules: - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning
Logging
Logging Architecture
┌─── Node ────────────────────────────────┐
│ │
│ Pod stdout/stderr → Container Runtime │
│ ↓ │
│ /var/log/containers/*.log │
│ ↓ │
│ ┌─── DaemonSet ─────────────┐ │
│ │ Fluent Bit / Fluentd │ │
│ │ (collects + ships logs) │ │
│ └──────────┬────────────────┘ │
└─────────────┼───────────────────────────┘
│
▼
┌─── Log Backend ──────────────────────┐
│ Elasticsearch / Loki / CloudWatch │
│ ↓ │
│ Kibana / Grafana (search + query) │
└──────────────────────────────────────┘
EFK Stack (Elasticsearch + Fluentd + Kibana)
bash# Install with Helm helm install elasticsearch elastic/elasticsearch -n logging --create-namespace helm install kibana elastic/kibana -n logging helm install fluentd fluent/fluentd -n logging
Loki (Lightweight Alternative)
bash# Install Loki + Promtail helm install loki grafana/loki-stack -n logging --create-namespace \ --set grafana.enabled=false \ --set promtail.enabled=true
Application Logging Best Practices
bash# Log to stdout/stderr (not files) # K8s captures stdout/stderr automatically # Use structured logging (JSON) {"level":"info","ts":"2024-01-15T10:30:00Z","msg":"request handled","method":"GET","path":"/api/users","status":200,"duration_ms":45} # Not this: INFO 2024-01-15 10:30:00 - Request handled: GET /api/users 200 45ms
Why structured logs:
- Machine-parseable
- Easy to filter and aggregate
- Works with any log backend
Distributed Tracing
Track requests across microservices:
User → API Gateway → Auth Service → User Service → Database
│ │ │
Span 1 Span 2 Span 3
◄──────────── Trace ──────────────────►
OpenTelemetry (Standard)
yaml# Auto-instrument with OpenTelemetry Collector apiVersion: apps/v1 kind: Deployment metadata: name: otel-collector spec: template: spec: containers: - name: collector image: otel/opentelemetry-collector-contrib args: ["--config=/etc/otel/config.yaml"]
Kubernetes Dashboard & Built-in Tools
bash# Built-in metrics (requires metrics-server) kubectl top nodes kubectl top pods kubectl top pods --containers # Events kubectl get events --sort-by='.lastTimestamp' kubectl get events -n production --field-selector type=Warning # Logs kubectl logs -f deployment/api --all-containers kubectl logs -l app=api --since=1h # Install metrics-server (if not present) kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
FAANG Interview Angle
Common questions:
- "How would you set up monitoring for a K8s cluster?"
- "How do you handle logging in a microservices architecture?"
- "What metrics would you monitor for an application?"
- "Explain distributed tracing"
- "How do you debug a pod that keeps crashing?"
Key answers:
- Prometheus + Grafana for metrics, EFK/Loki for logs, Jaeger/Tempo for traces
- DaemonSet log collectors shipping to centralized backend; structured JSON logs to stdout
- RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for infra
- Traces follow a request across services; each hop is a span; all spans form a trace
kubectl describe pod(events),kubectl logs --previous, resource limits, probes, events