The Invisible Load Spike: Why Your Autoscaler Fails When It Matters Most (And How to Fix It)

CNCF Ambassador, Author & Kubernetes Engineer. Check out my book AcingTheCKA.com!
Your autoscaler is a liar. It promises to scale your workloads when demand spikes, but in production, when revenue is on the line, it chokes. The HorizontalPodAutoscaler (HPA) controller loop runs every 15 seconds, metrics-server takes another 30 seconds to stabilize, and by the time new pods reach Ready your users have already refreshed their browsers and blamed your "slow API" on Twitter.
In dev environments, you can't wait for production traffic to validate autoscaling behavior. You need synthetic, deterministic load that exercises the entire control loop: HPA decision-making, scheduler placement, kubelet startup latency, and metrics collection. The polinux/stress image provides precisely this: a maintained, security-compliant CPU and memory workload generator that exposes every weakness in your autoscaling configuration before customers do.
Why polinux/stress Is the Right Tool
The polinux/stress image is an actively maintained fork of the Linux stress utility with modern base images that pass CVE scanning. It accepts --cpu N arguments to spawn N CPU-intensive workers and --vm with --vm-bytes for memory pressure. Unlike HTTP load generators (wrk, ab, locust), it isolates resource consumption from network I/O, making it ideal for validating:
- HPA controller calculations: The controller queries metrics-server for pod-level CPU utilization, computes
currentMetricValue / desiredMetricValue, and scales when the ratio exceeds thresholds
Metrics-server scraping latency: The default scraping interval and CPU initialization period (5 minutes) can delay scaling decisions
CFS throttling behavior: Containers requesting more CPU than their limit experience throttling via
cpu.cfs_quota_us, visible incontainer_cpu_cfs_throttled_periods_totalmetricsScheduler and kubelet latency: Measures time from HPA scale-up decision to pod
Readystatus, including image pulls and CNI setup
Compared to the unmaintained vish/stress, polinux/stress provides the same functionality while meeting modern security compliance requirements.
Hands-On Demo: HPA Scale-Up and Scale-Down
Use this Killercoda scenario to follow along!
Setup: Deploy the Workload with Zero-Load Baseline
Create a namespace and deployment starting in an idle state. This establishes a performance baseline before triggering load:
# stress-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
name: autoscale-demo
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-stress
namespace: autoscale-demo
spec:
replicas: 1
selector:
matchLabels:
app: cpu-stress
template:
metadata:
labels:
app: cpu-stress
spec:
containers:
- name: stress-ctr
image: polinux/stress
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "500m"
memory: "256Mi"
command: ["/bin/sh", "-c"]
args: ["sleep infinity"]
Deploy it:
kubectl apply -f stress-deployment.yaml
# Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=cpu-stress -n autoscale-demo --timeout=60s
# Verify zero-load state
kubectl top pod -n autoscale-demo
# Expected: cpu-stress-xxxxx ~1m ~10Mi (near-zero CPU usage)
Configure HPA with CPU Target
Create an HPA targeting 50% CPU utilization:
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cpu-stress-hpa
namespace: autoscale-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-stress
minReplicas: 1
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleDown:
stabilizationWindowSeconds: 60 # Faster scale-down for demo
scaleUp:
stabilizationWindowSeconds: 0 # Immediate scale-up
Apply and verify:
kubectl apply -f hpa.yaml
# Wait for metrics to populate (15-30 seconds)
sleep 30
# Check HPA baseline
kubectl get hpa -n autoscale-demo -w
Establish Baseline Metrics
Before triggering load, validate that metrics-server is functioning and HPA recognizes the idle state:
# Describe HPA to see events and current status
kubectl describe hpa cpu-stress-hpa -n autoscale-demo
# Expected output includes:
# Metrics:
# Resource cpu on pods (as a percentage of request): 0% (0m) / 50%
# Current number of replicas: 1
# ScalingActive condition: True
This confirms the HPA control loop is active and baseline CPU is near zero.
Trigger Scale-Up with High CPU Load
Patch the deployment to generate CPU load:
kubectl patch deployment cpu-stress -n autoscale-demo --type='json' \
-p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value": "100m"},
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "128Mi"},
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "500m"},
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "256Mi"},
{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["stress"]},
{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["--cpu", "2", "--verbose"]}
]'
# Wait for rollout
kubectl rollout status deployment/cpu-stress -n autoscale-demo
This combined patch accomplishes multiple objectives:
Resource Optimization for Single-Node:
CPU request: 500m → 100m (allows up to 6 pods on a node with ~1 CPU available)
CPU limit: 1000m → 500m (prevents individual pods from consuming entire node CPU)
Memory request: 256Mi → 128Mi (reduces memory pressure)
Memory limit: 512Mi → 256Mi (prevents OOM issues on small nodes)
Load Generation:
Command change:
["/bin/sh", "-c"]→["stress"]2 CPU workers: Spawns 2 stress processes attempting to consume CPU continuously
No timeout: Runs indefinitely until patched back to sleep
With the new configuration, each pod creates:
Actual usage: ~200m per pod (2 workers generating load)
Utilization: 200m / 100m request = 200%
HPA action: Current utilization (200%) exceeds target (50%), triggers scale-up
Monitor scale-up with timestamps:
# Watch HPA in real-time (updates every 15s)
kubectl get hpa cpu-stress-hpa -n autoscale-demo -w
# In another terminal, watch pods scaling
kubectl get pods -n autoscale-demo -w
Expected timeline:
T+15-30s: Metrics-server scrapes kubelet, updates cache
T+30-45s: HPA polls metrics-server, sees 200% utilization
T+45s: HPA updates deployment replicas (calculated as
ceil(1 * (200/50)) = 4)T+60-90s: New pods reach
RunningandReadystateT+90-120s: Metrics stabilize across all pods
Expected HPA output after scale-up:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
cpu-stress-hpa Deployment/cpu-stress 200%/50% 1 6 4
# After pods stabilize and share 1 CPU on single node:
cpu-stress-hpa Deployment/cpu-stress 166%/50% 1 6 6
The HPA will scale to maximum replicas (6) because each pod with 2 CPU workers generates approximately 200m CPU load, creating 200% utilization (200m actual / 100m request). After scaling to 6 pods on a single-node cluster with 1 total CPU, each pod receives ~166m average, maintaining 166% utilization and keeping HPA at maxReplicas.
Verify Scaling Behavior and Resource Distribution
# Check final pod count
kubectl get pods -n autoscale-demo -l app=cpu-stress
# Check per-pod CPU usage (wait 60s for metrics stability)
sleep 60
kubectl top pod -n autoscale-demo
# Expected: Each pod consuming ~166m (1000m node capacity / 6 pods)
# View HPA events showing scale decisions
kubectl describe hpa cpu-stress-hpa -n autoscale-demo | grep -A 10 "Events:"
# Expected events:
# - ScalingReplicaSet: New size: 4; reason: cpu resource utilization above target
# - ScalingReplicaSet: New size: 6; reason: cpu resource utilization above target
# Verify updated resource configuration
kubectl get deployment cpu-stress -n autoscale-demo -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq
# Expected:
# {
# "limits": {"cpu": "500m", "memory": "256Mi"},
# "requests": {"cpu": "100m", "memory": "128Mi"}
# }
# Check node allocation
kubectl describe node <worker-node> | grep -A 5 "Allocated resources:"
# Expected: cpu ~825m (225m system + 600m from 6 pods @ 100m each)
If you have Prometheus installed, validate CPU throttling:
# Throttling periods per pod (indicates hitting CPU limits)
rate(container_cpu_cfs_throttled_periods_total{namespace="autoscale-demo"}[5m])
Test Scale-Down with Stabilization Window
Return to zero-load baseline and observe scale-down behavior:
kubectl patch deployment cpu-stress -n autoscale-demo --type='json' \
-p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/bin/sh", "-c"]},
{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["sleep infinity"]}
]'
# Wait for rollout
kubectl rollout status deployment/cpu-stress -n autoscale-demo
We set the HPA to have a 60-second stabilization window for scale-down to increase speed. Monitor the behavior:
# Watch HPA - note TARGETS drop but REPLICAS remain at 6
kubectl get hpa cpu-stress-hpa -n autoscale-demo -w
# Expected:
# T+0s: cpu-stress-hpa Deployment/cpu-stress 0%/50% 1 6 6
# T+300s: cpu-stress-hpa Deployment/cpu-stress 0%/50% 1 6 1
After 60 seconds, replicas scale down to 1. This delay prevents rapid scaling oscillations during temporary load drops.
Advanced: Multi-Metric Autoscaling with Memory
HPA supports scaling on multiple metrics simultaneously, selecting the maximum scale recommendation:
# hpa-multi-metric.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: stress-multi-hpa
namespace: autoscale-demo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-stress
minReplicas: 2
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 120 # Faster scale-down for testing
policies:
- type: Percent
value: 50 # Remove max 50% of pods
periodSeconds: 60
- type: Pods
value: 2 # Or max 2 pods
periodSeconds: 60
selectPolicy: Min # Use most conservative policy
scaleUp:
stabilizationWindowSeconds: 0 # Immediate scale-up
policies:
- type: Percent
value: 100 # Double pods
periodSeconds: 15
- type: Pods
value: 4 # Or add 4 pods
periodSeconds: 15
selectPolicy: Max # Use most aggressive policy
Update the deployment to stress both CPU and memory:
kubectl patch deployment cpu-stress -n autoscale-demo --type='json' \
-p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["stress"]},
{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["--cpu", "2", "--vm", "1", "--vm-bytes", "100M", "--verbose"]}
]'
# Apply multi-metric HPA
kubectl apply -f hpa-multi-metric.yaml
# Wait for metrics to stabilize (60 seconds)
sleep 60
# Check multi-metric status
kubectl get hpa stress-multi-hpa -n autoscale-demo
Expected behavior:
CPU: 1000m / 500m = 200% utilization → recommends ~7 replicas
Memory: 200Mi / 256Mi = 78% utilization → recommends ~3 replicas
HPA decision: Selects maximum (7 replicas)
Validating Scale-Up Latency and Metrics Pipeline
Measure end-to-end latency from load spike to pod readiness:
# Reset to baseline
kubectl patch deployment cpu-stress -n autoscale-demo --type='json' \
-p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/bin/sh", "-c"]},
{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["sleep infinity"]}
]'
# Wait for scale-down to minReplicas
sleep 180
# Trigger load spike at T=0 and record timestamp
echo "Load spike triggered at: $(date +%H:%M:%S)"
kubectl patch deployment cpu-stress -n autoscale-demo --type='json' \
-p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["stress"]},
{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["--cpu", "30", "--verbose"]}
]'
# Monitor with timestamps every 10 seconds
for i in {1..12}; do
echo "--- T+$((i*10))s at $(date +%H:%M:%S) ---"
kubectl get hpa stress-multi-hpa -n autoscale-demo --no-headers
kubectl get pods -n autoscale-demo -l app=cpu-stress --no-headers | wc -l
sleep 10
done
Typical latency breakdown:
T+0-30s: Metrics-server CPU initialization delay (default 30s via
--horizontal-pod-autoscaler-initial-readiness-delay)T+30-45s: HPA controller detects high utilization
T+45-60s: Scheduler places new pods, image pull (if not cached)
T+60-90s: Pods reach
Ready, CNI initialization completeT+90-120s: Metrics include new pods, utilization stabilizes
This exposes real control-plane latencies that mirror production behavior.
PromQL Queries for Observability
Monitor HPA and kubelet behavior with Prometheus:
# HPA decision latency (controller reconciliation time)
rate(hpa_controller_reconciliation_duration_seconds_sum[5m])
/ rate(hpa_controller_reconciliation_duration_seconds_count[5m])
# CPU throttling per pod (indicates hitting limits)
rate(container_cpu_cfs_throttled_periods_total{namespace="autoscale-demo"}[5m])
# Metrics-server scrape latency
histogram_quantile(0.99,
rate(metrics_server_kubelet_request_duration_seconds_bucket[5m]))
# Pod startup time (requires kube-state-metrics)
kube_pod_start_time - kube_pod_created
Production-Ready Configuration Patterns
Approach A: Conservative (Avoid Flapping)
scaleDown.stabilizationWindowSeconds: 300(5 min)scaleUp.stabilizationWindowSeconds: 60Target utilization: 70-80%
Pros: Minimal pod churn, predictable costs, cache warmth preserved
Cons: Slower response to traffic drops, potential over-provisioning
When: Steady workloads with gradual changes, stateful apps with warm caches
Approach B: Aggressive (Low Latency)
scaleDown.stabilizationWindowSeconds: 60scaleUp.stabilizationWindowSeconds: 0Target utilization: 40-50%
Pros: Fast scale-up, headroom for bursts, responsive to spikes
Cons: Higher pod churn, wasted capacity, cold cache restarts
When: Latency-sensitive APIs, unpredictable spikes, stateless workloads
Approach C: Custom Metrics (Advanced)
Use custom metrics (requests/sec, queue depth) instead of CPU for application-aware scaling. Requires Prometheus Adapter or KEDA.
Pros: Application semantics, not resource proxies; accounts for P99 latency
Cons: Complexity, external dependencies, custom instrumentation
When: RPS-based workloads, async job queues, SLA-driven scaling
The polinux/stress image removes the guesswork from autoscaling validation while meeting modern security requirements. By establishing a zero-load baseline and then generating deterministic CPU and memory load, you validate HPA calculations, expose metrics-server latency, measure scheduler performance, and identify throttling issues—all before production traffic exercises these code paths. Starting small and scaling gradually reveals configuration problems early, ensuring your autoscaler actually works when it matters most.
Next Steps
Expand Autoscaling Scope
Vertical Pod Autoscaler (VPA)
Move beyond horizontal scaling to automatically adjust CPU and memory requests/limits based on actual usage patterns. VPA analyzes historical consumption, peak usage, and OOM events to calculate optimal resource recommendations.
# Install VPA (separate project, not part of core Kubernetes)
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Create VPA for your stress workload
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: stress-vpa
namespace: autoscale-demo
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-stress
updatePolicy:
updateMode: "Auto" # Auto applies recommendations
resourcePolicy:
containerPolicies:
- containerName: stress-ctr
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "1Gi"
EOF
VPA recommends target, lower bound, and upper bound resource values based on actual consumption trends.
Cluster Autoscaler / Karpenter
Scale the node layer to ensure sufficient capacity for your autoscaling pods. Cluster Autoscaler adds/removes nodes based on pending pods and node utilization.
KEDA (Event-Driven Autoscaling)
Move beyond resource-based metrics to scale on application events like queue depth, Kafka lag, HTTP requests/sec. KEDA is a CNCF-graduated project supporting 60+ scalers.
# Install KEDA
kubectl apply -f https://github.com/kedacore/keda/releases/latest/download/keda-2.13.0.yaml
# Example: Scale based on Prometheus metrics
cat <<EOF | kubectl apply -f -
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: prometheus-scaler
spec:
scaleTargetRef:
name: cpu-stress
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_total
threshold: '100'
query: sum(rate(http_requests_total[2m]))
EOF
This scales based on actual application semantics rather than CPU/memory proxies.
Advanced HPA Configuration
Custom Metrics with Prometheus Adapter
Integrate Prometheus metrics into HPA for application-aware autoscaling:
# Install Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter
# Configure HPA with custom metric
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: custom-metrics-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cpu-stress
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
EOF
Scheduled Autoscaling
Pre-scale workloads for predictable traffic patterns using CronJobs that patch HPA min/max replicas:
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-business-hours
spec:
schedule: "0 8 * * 1-5" # 8 AM weekdays
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl
command:
- /bin/sh
- -c
- kubectl patch hpa cpu-stress-hpa -n autoscale-demo --patch '{"spec":{"minReplicas":5}}'
restartPolicy: OnFailure
This proactively scales before demand surges, eliminating cold-start latency.
Production Observability
Integrate Grafana Dashboards
Visualize HPA metrics, pod lifecycle events, and resource utilization:
HPA metrics:
kube_horizontalpodautoscaler_status_current_replicas,kube_horizontalpodautoscaler_status_desired_replicasScaling events: Query Kubernetes events for
ScalingReplicaSetreasonsPod startup latency:
kube_pod_start_time - kube_pod_created
Alert on Autoscaling Issues
Set up Prometheus alerts for scaling failures:
- alert: HPAMaxedOut
expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
for: 15m
annotations:
summary: "HPA {{ $labels.horizontalpodautoscaler }} has been at max replicas for 15+ minutes"
- alert: HPAScalingDisabled
expr: kube_horizontalpodautoscaler_status_condition{condition="ScalingActive", status="false"} == 1
annotations:
summary: "HPA {{ $labels.horizontalpodautoscaler }} scaling is disabled"
Cost and Performance Optimization
Manual Resource Request Tuning
After observing HPA behavior, adjust resource requests to improve efficiency:
Over-provisioning: If HPA rarely scales and CPU stays at 20-30%, reduce requests
Under-provisioning: If HPA constantly maxes out and pods throttle, increase requests
Right-sizing: Match requests to P95 usage, not peak
Load Testing in CI/CD
Integrate autoscaling validation into your pipeline using k6 or Locust:
# .gitlab-ci.yml example
autoscaling-test:
stage: test
script:
- kubectl apply -f hpa.yaml
- k6 run --vus 100 --duration 5m load-test.js
- kubectl get hpa cpu-stress-hpa -o jsonpath='{.status.currentReplicas}' | grep -v "^1$" # Verify scale-up
This catches autoscaling regressions before production.
StormForge Resources for ML-Powered Optimization
StormForge provides machine learning-driven Kubernetes optimization that goes beyond manual tuning. Their platform automates resource rightsizing, HPA configuration, and cost optimization using observability data and experimentation.
StormForge Optimize Live
Automated Resource Recommendations
StormForge Optimize Live continuously analyzes production workloads and generates ML-based recommendations for CPU/memory requests, limits, and HPA target utilization.
Key features:
Continuous optimization: Monitors resource usage and application behavior to identify cost-saving configurations automatically
HPA target tuning: Recommends optimal HPA target utilization percentages (not just static 50%)
OOM protection: Configures temporary memory bump-ups in response to OOM events
One-click apply: Directly applies recommendations to your cluster via the StormForge controller
Getting Started with StormForge
StormForge provides a comprehensive tutorial for integrating their platform:
- Install the StormForge agent:
# Add Helm repo
helm repo add stormforge https://registry.stormforge.io/chartrepo/library
helm repo update
# Install agent with your API token
helm install stormforge-agent stormforge/stormforge-agent \
--namespace stormforge-system \
--create-namespace \
--set authorization.token=YOUR_API_TOKEN
- Deploy an optimization experiment:
# Create optimization resource
stormforge optimize create nginx-optimization \
--application=nginx \
--namespace=autoscale-demo
- Review recommendations:
stormforge optimize recommendations nginx-optimization
- Apply optimizations:
Visit the StormForge web UI athttps://app.stormforge.ioand click "Apply Recommendations" to deploy optimized resource configurations directly to your cluster.
StormForge and Karpenter Integration
For AWS EKS clusters, StormForge integrates with Karpenter for full-stack optimization:
StormForge optimizes workload requests: ML analyzes usage patterns and adjusts CPU/memory requests
Karpenter provisions nodes: Observes incoming pod requests and provisions right-sized nodes
Continuous feedback loop: As StormForge reduces requests, Karpenter consolidates nodes, reducing costs
AWS provides a detailed walkthrough showing a 30-40% cost reduction using this combination.
StormForge Documentation and Tutorials
Official Documentation
Settings guide: Configure CPU/memory optimization, HPA target ranges, and OOM handling
Optimization tutorial: Step-by-step guide for workload rightsizing
Video Resources
Platform overview: "Solving the Kubernetes Efficiency Problem with StormForge"
Product demo: "StormForge Optimize Live Product Demo" showing top-down visibility and recommendation workflows
HPA/VPA deep dive: "HPA and VPA For Pods With StormForge" comparing horizontal vs vertical scaling with automated optimization
Key Advantage Over Manual Tuning
StormForge uses ML to analyze thousands of containers simultaneously, identifying optimization opportunities invisible to manual analysis. It automates the tedious process of profiling workloads, calculating percentile-based resource needs, and updating configurations—achieving enterprise-scale optimization with minimal manual intervention.
Continuous Optimization in CI/CD
StormForge supports automated optimization schedules via CI/CD pipelines:
# Example GitLab CI integration
stormforge-optimize:
stage: optimize
schedule:
- cron: "0 2 * * 0" # Weekly optimization run
script:
- stormforge optimize run nginx-optimization
- stormforge optimize recommendations nginx-optimization > recommendations.yaml
- kubectl apply -f recommendations.yaml
This ensures resource configurations evolve with changing workload patterns.
The progression from basic HPA demos with polinux/stress to production-grade autoscaling with VPA, KEDA, custom metrics, and ML-powered optimization platforms like StormForge provides a complete path from learning fundamentals to achieving enterprise-scale efficiency.





