21 - Auto-Scaling

Three Types of Scaling in Kubernetes

TypeWhat It ScalesBased On
HPA (Horizontal Pod Autoscaler)Number of podsCPU, memory, custom metrics
VPA (Vertical Pod Autoscaler)Pod resource requests/limitsHistorical usage
Cluster AutoscalerNumber of nodesPending pods

Horizontal Pod Autoscaler (HPA)

Automatically adjusts the number of pod replicas.

Prerequisites

bash
# Metrics Server must be installed kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml # Verify kubectl top pods

HPA Based on CPU

yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Scale when CPU > 70% behavior: scaleUp: stabilizationWindowSeconds: 60 # Wait 60s before scaling up policies: - type: Percent value: 100 # Double pods at most periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down policies: - type: Percent value: 10 # Remove 10% of pods at most periodSeconds: 60

HPA Based on Multiple Metrics

yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api minReplicas: 3 maxReplicas: 50 metrics: # CPU utilization - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 # Memory utilization - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 # Custom metric (requests per second from Prometheus) - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "1000" # External metric (SQS queue length) - type: External external: metric: name: sqs_queue_length selector: matchLabels: queue: "orders" target: type: Value value: "100"

HPA with kubectl

bash
# Create quickly kubectl autoscale deployment api --min=2 --max=20 --cpu-percent=70 # Check status kubectl get hpa kubectl describe hpa api-hpa # See scaling events kubectl get events --field-selector involvedObject.name=api-hpa

How HPA Works

Every 15 seconds (default):

1. Fetch metrics from metrics-server (or custom metrics API)
2. Calculate desired replicas:
   desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
   
   Example: 3 pods at 90% CPU, target 70%
   desired = ceil(3 × (90/70)) = ceil(3.86) = 4

3. Apply stabilization window and scaling policies
4. Update Deployment replicas if needed

Vertical Pod Autoscaler (VPA)

Automatically adjusts pod CPU/memory requests and limits.

bash
# Install VPA kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml
yaml
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: api-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: api updatePolicy: updateMode: "Auto" # Off | Initial | Recreate | Auto resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 4 memory: 8Gi
ModeBehavior
OffOnly recommends (no action)
InitialOnly sets on pod creation
RecreateRestarts pods to apply new requests
AutoUpdates in-place if possible, otherwise recreates

Warning: Don't use HPA and VPA on the same metric (e.g., both scaling on CPU). HPA scales pods, VPA resizes them -- they'll fight.

Cluster Autoscaler

Adds/removes nodes when pods can't be scheduled or nodes are underutilized.

bash
# On EKS eksctl create nodegroup --cluster=my-cluster \ --name=workers --nodes-min=2 --nodes-max=20 --asg-access # On GKE (built-in) gcloud container clusters update my-cluster \ --enable-autoscaling --min-nodes=2 --max-nodes=20

How Cluster Autoscaler Works

1. Pod can't be scheduled (Pending state)
   → Cluster Autoscaler detects this
   → Requests new node from cloud provider
   → Node joins cluster → pod gets scheduled

2. Node underutilized (< 50% resources for 10+ min)
   → Cluster Autoscaler checks if pods can be moved
   → Drains the node (evicts pods)
   → Removes node from cloud provider

Karpenter (Modern Alternative to Cluster Autoscaler)

yaml
# Karpenter NodePool (AWS) apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: requirements: - key: kubernetes.io/arch operator: In values: ["amd64", "arm64"] - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] - key: karpenter.k8s.aws/instance-family operator: In values: ["m5", "m6i", "c5", "c6i"] nodeClassRef: name: default limits: cpu: "100" memory: 400Gi disruption: consolidationPolicy: WhenUnderutilized

Karpenter advantages over Cluster Autoscaler:

  • Faster scaling (seconds vs minutes)
  • Right-sizes instances (picks optimal instance type)
  • Supports Spot instances natively
  • No node groups required

KEDA (Kubernetes Event-Driven Autoscaling)

Scale based on event sources (queues, streams, cron):

yaml
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: order-processor spec: scaleTargetRef: name: order-processor minReplicaCount: 0 # Scale to zero! maxReplicaCount: 100 triggers: - type: aws-sqs-queue metadata: queueURL: https://sqs.us-east-1.amazonaws.com/123456/orders queueLength: "5" # 1 pod per 5 messages - type: cron metadata: timezone: America/New_York start: "0 8 * * *" # Scale up at 8 AM end: "0 20 * * *" # Scale down at 8 PM desiredReplicas: "10"

Scaling Strategy Summary

Traffic spike → HPA adds pods (seconds)
                  ↓
Pods pending → Cluster Autoscaler/Karpenter adds nodes (minutes)
                  ↓
Traffic drops → HPA removes pods (stabilization window)
                  ↓
Nodes empty → Cluster Autoscaler removes nodes

FAANG Interview Angle

Common questions:

  1. "How does auto-scaling work in Kubernetes?"
  2. "What's the difference between HPA, VPA, and Cluster Autoscaler?"
  3. "How would you handle a sudden traffic spike?"
  4. "How do you scale to zero?"
  5. "What metrics would you use for auto-scaling?"

Key answers:

  • HPA scales pods horizontally, VPA adjusts resources vertically, CA adds/removes nodes
  • HPA: more pods for stateless. VPA: better resources for stateful. CA: more nodes for capacity
  • Pre-configured HPA with aggressive scale-up, cluster autoscaler for node capacity, possibly over-provision
  • KEDA enables scale-to-zero based on external event sources
  • CPU/memory for general, custom metrics (RPS, queue depth) for event-driven

Official Links