Tuning an nginx ingress under burst load

2026-01-18 · ~4 min read

A product launch sent us roughly 8× normal traffic for about ninety minutes. The application pods scaled fine. ingress-nginx did not. We dropped about 0.6% of requests at the L7 layer before we caught up. These are the four knobs that mattered.

1. worker connections

The default max-worker-connections is 16384 per worker. That sounds generous until you realize each upstream HTTP/1.1 keepalive connection counts, and ingress-nginx defaults to one worker per CPU. On 4-core nodes we ran out of slots well before saturating CPU. Bumping to 65536 and disabling worker-rlimit-nofile defaults fixed it.

2. upstream keepalive

By default, the upstream keepalive pool is small. Under burst load, nginx ends up establishing new TCP connections to the backend for every burst of requests, and those handshakes pile up. Setting upstream-keepalive-connections: 1024 and upstream-keepalive-requests: 10000 in the controller ConfigMap halved P99 latency.

3. proxy-body-size and buffer sizes

Not strictly about throughput, but: the default proxy-body-size of 1m caused a small percentage of POST requests with attachments to 413 during the spike. We raised it to 25m, which is more than enough for our endpoints and less than the cluster's actual memory headroom. Same logic for proxy-buffer-size — we doubled it to handle larger response headers from one backend that liked to set many cookies.

4. HPA targets on the ingress controller itself

The biggest miss in hindsight. The application's HPA was tuned. The ingress-nginx HPA was at the default targetCPUUtilizationPercentage: 80, which on our infrastructure meant scaling decisions every ~3 minutes — too slow for a sub-2-minute traffic ramp. We changed to a custom metric (active connections per pod) with a target half of saturation, and the controller scaled out before we noticed the spike.

what didn't matter

We spent an embarrassing amount of time looking at gzip configuration and TCP buffer sizes before realizing they weren't the bottleneck. Two lessons:

Profile before tuning. nginx -T and the controller's /metrics endpoint together tell you most of what you need.
The default config is a reasonable starting point, but it's optimized for "steady moderate" — not bursts.