20 - Monitoring & Logging

The Observability Stack

Three pillars of observability:

Pillar	Question	Tools
Metrics	"How is it performing?"	Prometheus, Grafana, Datadog
Logs	"What happened?"	EFK/ELK, Loki, CloudWatch
Traces	"Where is the bottleneck?"	Jaeger, Zipkin, Tempo

Metrics with Prometheus + Grafana

Architecture

┌─── Kubernetes Cluster ───────────────────────────────┐
│                                                      │
│  App Pods expose /metrics                            │
│  ┌─────┐ ┌─────┐ ┌─────┐                             │
│  │:9090│ │:9090│ │:9090│ ← Prometheus scrapes these  │
│  └──┬──┘ └──┬──┘ └──┬──┘                             │
│     └───────┼───────┘                                │
│             │                                        │
│  ┌──────────▼────────────┐                           │
│  │   Prometheus          │ ← Stores time-series data │
│  │   (scraper + TSDB)    │                           │
│  └──────────┬────────────┘                           │
│             │                                        │
│  ┌──────────▼────────────┐                           │
│  │   Grafana             │ ← Dashboards & alerts     │
│  │   (visualization)     │                           │
│  └───────────────────────┘                           │
│                                                      │
│  ┌───────────────────────┐                           │
│  │   AlertManager        │ ← Routes alerts to        │
│  │                       │   Slack, PagerDuty, etc.  │
│  └───────────────────────┘                           │
└──────────────────────────────────────────────────────┘

Install with kube-prometheus-stack

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace

This installs: Prometheus, Grafana, AlertManager, node-exporter, kube-state-metrics.

Key Metrics to Monitor

Cluster Level:

Node CPU/memory/disk usage
Pod count vs capacity
API server request latency

Application Level (RED method):

Rate: requests per second
Errors: error rate
Duration: request latency

Application Level (USE method for infrastructure):

Utilization: CPU/memory usage %
Saturation: queue depth, pending requests
Errors: error count

ServiceMonitor (Tell Prometheus What to Scrape)

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
    - port: metrics
      path: /metrics
      interval: 15s
  namespaceSelector:
    matchNames:
      - production

PrometheusRule (Alerts)

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-alerts
spec:
  groups:
    - name: api.rules
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.service }}"
        
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning

Logging

Logging Architecture

┌─── Node ────────────────────────────────┐
│                                         │
│  Pod stdout/stderr → Container Runtime  │
│                       ↓                 │
│  /var/log/containers/*.log              │
│                       ↓                 │
│  ┌─── DaemonSet ─────────────┐          │
│  │  Fluent Bit / Fluentd     │          │
│  │  (collects + ships logs)  │          │
│  └──────────┬────────────────┘          │
└─────────────┼───────────────────────────┘
              │
              ▼
┌─── Log Backend ──────────────────────┐
│  Elasticsearch / Loki / CloudWatch   │
│              ↓                       │
│  Kibana / Grafana (search + query)   │
└──────────────────────────────────────┘

EFK Stack (Elasticsearch + Fluentd + Kibana)

bash
# Install with Helm
helm install elasticsearch elastic/elasticsearch -n logging --create-namespace
helm install kibana elastic/kibana -n logging
helm install fluentd fluent/fluentd -n logging

Loki (Lightweight Alternative)

bash
# Install Loki + Promtail
helm install loki grafana/loki-stack -n logging --create-namespace \
  --set grafana.enabled=false \
  --set promtail.enabled=true

Application Logging Best Practices

bash
# Log to stdout/stderr (not files)
# K8s captures stdout/stderr automatically

# Use structured logging (JSON)
{"level":"info","ts":"2024-01-15T10:30:00Z","msg":"request handled","method":"GET","path":"/api/users","status":200,"duration_ms":45}

# Not this:
INFO 2024-01-15 10:30:00 - Request handled: GET /api/users 200 45ms

Why structured logs:

Machine-parseable
Easy to filter and aggregate
Works with any log backend

Distributed Tracing

Track requests across microservices:

User → API Gateway → Auth Service → User Service → Database
         │              │               │
       Span 1        Span 2          Span 3
       ◄──────────── Trace ──────────────────►

OpenTelemetry (Standard)

yaml
# Auto-instrument with OpenTelemetry Collector
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  template:
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib
          args: ["--config=/etc/otel/config.yaml"]

Kubernetes Dashboard & Built-in Tools

bash
# Built-in metrics (requires metrics-server)
kubectl top nodes
kubectl top pods
kubectl top pods --containers

# Events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -n production --field-selector type=Warning

# Logs
kubectl logs -f deployment/api --all-containers
kubectl logs -l app=api --since=1h

# Install metrics-server (if not present)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

FAANG Interview Angle

Common questions:

"How would you set up monitoring for a K8s cluster?"
"How do you handle logging in a microservices architecture?"
"What metrics would you monitor for an application?"
"Explain distributed tracing"
"How do you debug a pod that keeps crashing?"

Key answers:

Prometheus + Grafana for metrics, EFK/Loki for logs, Jaeger/Tempo for traces
DaemonSet log collectors shipping to centralized backend; structured JSON logs to stdout
RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for infra
Traces follow a request across services; each hop is a span; all spans form a trace
kubectl describe pod (events), kubectl logs --previous, resource limits, probes, events