21 - Auto-Scaling
Three Types of Scaling in Kubernetes
| Type | What It Scales | Based On |
|---|---|---|
| HPA (Horizontal Pod Autoscaler) | Number of pods | CPU, memory, custom metrics |
| VPA (Vertical Pod Autoscaler) | Pod resource requests/limits | Historical usage |
| Cluster Autoscaler | Number of nodes | Pending pods |
Horizontal Pod Autoscaler (HPA)
Automatically adjusts the number of pod replicas.
Prerequisites
bash# Metrics Server must be installed kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml # Verify kubectl top pods
HPA Based on CPU
yamlapiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Scale when CPU > 70% behavior: scaleUp: stabilizationWindowSeconds: 60 # Wait 60s before scaling up policies: - type: Percent value: 100 # Double pods at most periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down policies: - type: Percent value: 10 # Remove 10% of pods at most periodSeconds: 60
HPA Based on Multiple Metrics
yamlapiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api minReplicas: 3 maxReplicas: 50 metrics: # CPU utilization - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Memory utilization - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 # Custom metric (requests per second from Prometheus) - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "1000" # External metric (SQS queue length) - type: External external: metric: name: sqs_queue_length selector: matchLabels: queue: "orders" target: type: Value value: "100"
HPA with kubectl
bash# Create quickly kubectl autoscale deployment api --min=2 --max=20 --cpu-percent=70 # Check status kubectl get hpa kubectl describe hpa api-hpa # See scaling events kubectl get events --field-selector involvedObject.name=api-hpa
How HPA Works
Every 15 seconds (default):
1. Fetch metrics from metrics-server (or custom metrics API)
2. Calculate desired replicas:
desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
Example: 3 pods at 90% CPU, target 70%
desired = ceil(3 × (90/70)) = ceil(3.86) = 4
3. Apply stabilization window and scaling policies
4. Update Deployment replicas if needed
Vertical Pod Autoscaler (VPA)
Automatically adjusts pod CPU/memory requests and limits.
bash# Install VPA kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml
yamlapiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: api-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: api updatePolicy: updateMode: "Auto" # Off | Initial | Recreate | Auto resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 4 memory: 8Gi
| Mode | Behavior |
|---|---|
Off | Only recommends (no action) |
Initial | Only sets on pod creation |
Recreate | Restarts pods to apply new requests |
Auto | Updates in-place if possible, otherwise recreates |
Warning: Don't use HPA and VPA on the same metric (e.g., both scaling on CPU). HPA scales pods, VPA resizes them -- they'll fight.
Cluster Autoscaler
Adds/removes nodes when pods can't be scheduled or nodes are underutilized.
bash# On EKS eksctl create nodegroup --cluster=my-cluster \ --name=workers --nodes-min=2 --nodes-max=20 --asg-access # On GKE (built-in) gcloud container clusters update my-cluster \ --enable-autoscaling --min-nodes=2 --max-nodes=20
How Cluster Autoscaler Works
1. Pod can't be scheduled (Pending state)
→ Cluster Autoscaler detects this
→ Requests new node from cloud provider
→ Node joins cluster → pod gets scheduled
2. Node underutilized (< 50% resources for 10+ min)
→ Cluster Autoscaler checks if pods can be moved
→ Drains the node (evicts pods)
→ Removes node from cloud provider
Karpenter (Modern Alternative to Cluster Autoscaler)
yaml# Karpenter NodePool (AWS) apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: kubernetes.io/arch operator: In values: ["amd64", "arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] - key: karpenter.k8s.aws/instance-family operator: In values: ["m5", "m6i", "c5", "c6i"] nodeClassRef: name: default limits: cpu: "100" memory: 400Gi disruption: consolidationPolicy: WhenUnderutilized
Karpenter advantages over Cluster Autoscaler:
- Faster scaling (seconds vs minutes)
- Right-sizes instances (picks optimal instance type)
- Supports Spot instances natively
- No node groups required
KEDA (Kubernetes Event-Driven Autoscaling)
Scale based on event sources (queues, streams, cron):
yamlapiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: order-processor spec: scaleTargetRef: name: order-processor minReplicaCount: 0 # Scale to zero! maxReplicaCount: 100 triggers: - type: aws-sqs-queue metadata: queueURL: https://sqs.us-east-1.amazonaws.com/123456/orders queueLength: "5" # 1 pod per 5 messages - type: cron metadata: timezone: America/New_York start: "0 8 * * *" # Scale up at 8 AM end: "0 20 * * *" # Scale down at 8 PM desiredReplicas: "10"
Scaling Strategy Summary
Traffic spike → HPA adds pods (seconds)
↓
Pods pending → Cluster Autoscaler/Karpenter adds nodes (minutes)
↓
Traffic drops → HPA removes pods (stabilization window)
↓
Nodes empty → Cluster Autoscaler removes nodes
FAANG Interview Angle
Common questions:
- "How does auto-scaling work in Kubernetes?"
- "What's the difference between HPA, VPA, and Cluster Autoscaler?"
- "How would you handle a sudden traffic spike?"
- "How do you scale to zero?"
- "What metrics would you use for auto-scaling?"
Key answers:
- HPA scales pods horizontally, VPA adjusts resources vertically, CA adds/removes nodes
- HPA: more pods for stateless. VPA: better resources for stateful. CA: more nodes for capacity
- Pre-configured HPA with aggressive scale-up, cluster autoscaler for node capacity, possibly over-provision
- KEDA enables scale-to-zero based on external event sources
- CPU/memory for general, custom metrics (RPS, queue depth) for event-driven