21 - Auto-Scaling

Three Types of Scaling in Kubernetes

Type	What It Scales	Based On
HPA (Horizontal Pod Autoscaler)	Number of pods	CPU, memory, custom metrics
VPA (Vertical Pod Autoscaler)	Pod resource requests/limits	Historical usage
Cluster Autoscaler	Number of nodes	Pending pods

Horizontal Pod Autoscaler (HPA)

Automatically adjusts the number of pod replicas.

Prerequisites

bash
# Metrics Server must be installed
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl top pods

HPA Based on CPU

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70  # Scale when CPU > 70%
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up
      policies:
        - type: Percent
          value: 100                    # Double pods at most
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10                     # Remove 10% of pods at most
          periodSeconds: 60

HPA Based on Multiple Metrics

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    # CPU utilization
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    
    # Memory utilization
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    
    # Custom metric (requests per second from Prometheus)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
    
    # External metric (SQS queue length)
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: "orders"
        target:
          type: Value
          value: "100"

HPA with kubectl

bash
# Create quickly
kubectl autoscale deployment api --min=2 --max=20 --cpu-percent=70

# Check status
kubectl get hpa
kubectl describe hpa api-hpa

# See scaling events
kubectl get events --field-selector involvedObject.name=api-hpa

How HPA Works

Every 15 seconds (default):

1. Fetch metrics from metrics-server (or custom metrics API)
2. Calculate desired replicas:
   desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
   
   Example: 3 pods at 90% CPU, target 70%
   desired = ceil(3 × (90/70)) = ceil(3.86) = 4

3. Apply stabilization window and scaling policies
4. Update Deployment replicas if needed

Vertical Pod Autoscaler (VPA)

Automatically adjusts pod CPU/memory requests and limits.

bash
# Install VPA
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/latest/download/vertical-pod-autoscaler.yaml

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Auto"  # Off | Initial | Recreate | Auto
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

Mode	Behavior
`Off`	Only recommends (no action)
`Initial`	Only sets on pod creation
`Recreate`	Restarts pods to apply new requests
`Auto`	Updates in-place if possible, otherwise recreates

Warning: Don't use HPA and VPA on the same metric (e.g., both scaling on CPU). HPA scales pods, VPA resizes them -- they'll fight.

Cluster Autoscaler

Adds/removes nodes when pods can't be scheduled or nodes are underutilized.

bash
# On EKS
eksctl create nodegroup --cluster=my-cluster \
  --name=workers --nodes-min=2 --nodes-max=20 --asg-access

# On GKE (built-in)
gcloud container clusters update my-cluster \
  --enable-autoscaling --min-nodes=2 --max-nodes=20

How Cluster Autoscaler Works

1. Pod can't be scheduled (Pending state)
   → Cluster Autoscaler detects this
   → Requests new node from cloud provider
   → Node joins cluster → pod gets scheduled

2. Node underutilized (< 50% resources for 10+ min)
   → Cluster Autoscaler checks if pods can be moved
   → Drains the node (evicts pods)
   → Removes node from cloud provider

Karpenter (Modern Alternative to Cluster Autoscaler)

yaml
# Karpenter NodePool (AWS)
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["m5", "m6i", "c5", "c6i"]
      nodeClassRef:
        name: default
  limits:
    cpu: "100"
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenUnderutilized

Karpenter advantages over Cluster Autoscaler:

Faster scaling (seconds vs minutes)
Right-sizes instances (picks optimal instance type)
Supports Spot instances natively
No node groups required

KEDA (Kubernetes Event-Driven Autoscaling)

Scale based on event sources (queues, streams, cron):

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0    # Scale to zero!
  maxReplicaCount: 100
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456/orders
        queueLength: "5"  # 1 pod per 5 messages
    
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 8 * * *"    # Scale up at 8 AM
        end: "0 20 * * *"     # Scale down at 8 PM
        desiredReplicas: "10"

Scaling Strategy Summary

Traffic spike → HPA adds pods (seconds)
                  ↓
Pods pending → Cluster Autoscaler/Karpenter adds nodes (minutes)
                  ↓
Traffic drops → HPA removes pods (stabilization window)
                  ↓
Nodes empty → Cluster Autoscaler removes nodes

FAANG Interview Angle

Common questions:

"How does auto-scaling work in Kubernetes?"
"What's the difference between HPA, VPA, and Cluster Autoscaler?"
"How would you handle a sudden traffic spike?"
"How do you scale to zero?"
"What metrics would you use for auto-scaling?"

Key answers:

HPA scales pods horizontally, VPA adjusts resources vertically, CA adds/removes nodes
HPA: more pods for stateless. VPA: better resources for stateful. CA: more nodes for capacity
Pre-configured HPA with aggressive scale-up, cluster autoscaler for node capacity, possibly over-provision
KEDA enables scale-to-zero based on external event sources
CPU/memory for general, custom metrics (RPS, queue depth) for event-driven