
Why Horizontal Scaling Didn’t Improve Throughput
Horizontal scaling does not improve throughput when the service instances are not the bottleneck. Adding pods, containers, or servers can lower per-instance CPU while total useful work stays flat because every new instance sends more concurrent work into the same database, lock, queue, cache, rate limit, or downstream API.
That is the frustrating scaling failure: the application tier looks healthier after the rollout, but users do not get more completed requests. Latency gets worse, connection pools fill, retries increase, and the system spends more time waiting on shared resources than doing useful work.
For the broader reliability cluster around overload, admission control, retries, circuit breakers, and cache behavior, see the Backend Reliability hub.
More Instances Reduced CPU, Not The Bottleneck
Imagine a checkout summary endpoint that is scaled from 8 pods to 32 pods before a traffic campaign.
The first dashboard looks encouraging:
| Metric | Before scaling | After scaling |
|---|---|---|
| Application pods | 8 | 32 |
| Average pod CPU | 72% | 28% |
| Average pod memory | 61% | 40% |
| Offered traffic | 650 rps | 1,200 rps |
| Completed useful responses | 610 rps | 660 rps |
| p95 latency | 240 ms | 1,850 ms |
| Database connection wait p95 | 12 ms | 620 ms |
| Payment-settings API throttles | 0 | rising |
| Retry volume | low | high |
The team added 4x more application capacity. Completed throughput barely moved.
The reason is visible only when "throughput" is separated from "goodput." Throughput is the work offered to the system. Goodput is the work completed successfully and quickly enough to be useful to callers. AWS's Builders Library article on using load shedding to avoid overload uses that distinction directly: systems can process or attempt more work while useful completed work plateaus or falls.
In this example, horizontal scaling increased offered load. It did not increase goodput because the application tier was no longer the limiting resource.
The Handler Looked Stateless
The endpoint itself can look ordinary:
async function getCheckoutSummary(req: Request) {
const user = await db.user.findUnique({
where: { id: req.userId },
select: { id: true, accountId: true, region: true },
})
const cart = await db.cart.findFirst({
where: { userId: req.userId, status: 'open' },
include: { items: true },
})
const paymentSettings = await paymentClient.getSettings({
accountId: user.accountId,
region: user.region,
})
const inventory = await inventoryClient.reservePreview({
itemIds: cart.items.map((item) => item.sku),
})
return buildSummary({ user, cart, paymentSettings, inventory })
}
The service stores no local session state. Any pod can handle any request. Scaling it horizontally sounds reasonable.
But each request still uses shared resources:
- two database reads
- one cart relation load
- one payment-settings dependency call
- one inventory dependency call
- one connection from the HTTP server
- one connection from the database pool
- one slice of downstream quota
Adding pods multiplies the number of concurrent callers. It does not multiply the capacity of the database, payment service, inventory service, or shared quota.
Autoscaling Adds Replicas Based On A Signal
Infrastructure can scale the workload correctly and still fail to improve user-visible throughput.
Kubernetes documents horizontal pod autoscaling as a controller that adjusts a workload's desired scale based on observed metrics such as CPU, memory, or custom metrics. The basic formula uses the ratio between current metric value and desired metric value. See Kubernetes Horizontal Pod Autoscaling.
That is useful machinery. It is not a proof that the application tier is the right scaling target.
If CPU is high because request handlers do CPU-heavy work locally, more pods may help. If CPU is low because handlers are waiting on the database, more pods may only create more waiters. If the autoscaler uses average CPU, it may even scale down while dependency latency rises because the pods are blocked on I/O rather than burning CPU.
Autoscaling answers "how many replicas should this workload have for this metric?" It does not answer "which resource is limiting goodput?"
Shared Bottlenecks That Make Scaling Flatline
When horizontal scaling fails, the bottleneck is usually in a shared resource or coordination point.
| Bottleneck | What adding instances changes | What to inspect |
|---|---|---|
| Database connection limit | more pods compete for the same finite connections | pool wait time, active connections, max connections |
| Slow query or missing index | more instances run the same expensive query | query plans, rows scanned, total database time |
| Row or advisory lock contention | more workers wait on the same protected state | lock wait time, blocked transactions, hot keys |
| External API rate limit | more callers spend the same downstream quota faster | 429s, retries, per-client quota, dependency latency |
| Cache misses or invalidations | more instances produce more cold reads and churn | hit ratio by key family, value age, invalidation rate |
| Message broker partition or consumer key | more workers wait behind one ordered partition | partition lag, consumer utilization, key distribution |
| Load balancer or session affinity | some instances receive too much traffic | per-instance rps, connection age, sticky routing behavior |
| Per-tenant hot account | more fleet capacity does not split one tenant's state | tenant-level latency, row locks, queue depth by tenant |
| Logging or telemetry sink | more instances emit more side effects per request | logging latency, dropped spans, exporter queues |
The pattern is the same: the number of application workers increased, but the serial part of the workflow did not.
AWS's load-shedding article connects this to Amdahl's law and the Universal Scalability Law: parallelism can increase throughput only until serialization and contention dominate. That is the heart of the horizontal-scaling trap.
Connection Pools Can Multiply Pressure
A common failure is increasing every pod's connection pool without calculating fleet-wide concurrency.
Before scaling:
8 pods * 20 database connections = 160 possible database connections
After scaling:
32 pods * 20 database connections = 640 possible database connections
If the database is comfortable at 180 active application connections, the new fleet can overload it even though each pod's configuration did not change.
Increasing the pool size can make this worse:
32 pods * 40 database connections = 1,280 possible database connections
That does not create database capacity. It creates more concurrent work waiting for the same CPU, I/O, memory, locks, and query planner choices.
A safer scaling review asks:
| Question | Why it matters |
|---|---|
| What is the total fleet-wide connection budget? | per-pod settings multiply when replicas increase |
| What is the database's comfortable active workload? | max connections is not the same as healthy throughput |
| Which endpoints use the most connections? | hot routes can dominate the shared pool |
| What happens when pool wait time rises? | queued requests may keep HTTP connections open |
| Do retries acquire new connections? | retries can multiply pressure during partial failure |
If pool wait time rises while application CPU falls, adding more application instances is probably not the fix.
Locks Make More Workers Wait Together
Horizontal scaling can also flatten throughput through lock contention.
Consider an inventory reservation path:
UPDATE inventory
SET available = available - $1
WHERE sku = $2
AND available >= $1;
The guarded update is correct. It protects the invariant that inventory cannot go negative.
But if a campaign sends many requests for the same SKU, more pods do not create more independent work. They create more concurrent transactions trying to update the same row or nearby index pages.
The symptoms look like this:
- database CPU is not the first limit
- p95 lock wait rises
- transaction duration rises
- connection pool wait rises after lock waits
- application instances sit idle or blocked
- retries make the hot row even hotter
This is not a reason to remove the guard. Correctness comes first. It is a reason to recognize that the bottleneck is a serialized business invariant, not application compute.
The adjacent concurrency decisions are covered in Optimistic vs Pessimistic Locking in SQL. If the broader issue is overlapping requests changing shared state, see How to Prevent Race Conditions in Backend Systems.
Queues Can Hide The Plateau
Some systems appear to absorb the extra traffic because they enqueue work.
That can make the initial scaling attempt look successful:
HTTP 202 Accepted
job enqueued
worker will process later
But if the worker bottleneck is still the same database, external API, partition key, or lock, the queue only moves the plateau downstream.
Watch these signals:
| Signal | Meaning |
|---|---|
| queue age rising while workers grow | workers are admitted but not draining enough work |
| retries per job rising | workers are colliding with the same bottleneck |
| dead-letter volume rising | queued work is not merely delayed, it is failing |
| one partition much hotter than rest | ordering or key distribution limits parallelism |
| worker CPU low, dependency wait high | workers are blocked, not under-provisioned |
For the operational side of delayed work, see Background Jobs in Production. If workers claim jobs from PostgreSQL, PostgreSQL Job Queues with SKIP LOCKED covers the row-claiming boundary.
Load Balancing Can Hide Uneven Capacity
Even when dependencies are healthy, scaling may fail because traffic is not actually balanced.
Common causes:
- long-lived HTTP connections
- sticky sessions
- tenant-level routing
- uneven expensive requests
- uneven cache warmth
- slow instance warmup
- readiness checks that pass before local caches or JIT warmup finish
- one availability zone receiving more expensive traffic
Fleet averages hide this.
Look at per-instance metrics:
| Metric | Useful question |
|---|---|
| requests per instance | Are a few instances handling most traffic? |
| p95 latency per instance | Are slow instances hidden by the average? |
| dependency calls per route | Are expensive routes unevenly distributed? |
| cache hit ratio per pod | Are new pods cold while old pods are warm? |
| errors by instance | Are retries concentrating on a subset of the fleet? |
Google's SRE book chapter on Handling Overload discusses overload as something load balancing tries to avoid, but also notes that eventually some part of a system can become overloaded and needs graceful handling. That is the practical lesson: load balancing helps distribute work, but it does not remove the need to know each instance and dependency's real capacity.
Measure Goodput, Not Only More Traffic
A scaling test should show useful completed work.
Measure at least these:
| Metric | Why it matters |
|---|---|
| offered rps | how much work clients attempted |
| successful rps | how much work completed without errors |
| goodput | successful work completed within the latency budget |
| p95 and p99 latency | whether tail latency got worse |
| retry rate | whether clients multiplied load |
| database pool wait | whether app replicas are queued on shared connections |
| dependency latency and throttles | whether a downstream service became the bottleneck |
| lock wait | whether coordination, not compute, is limiting work |
| queue age | whether async work is falling behind |
| per-instance traffic | whether the load balancer is actually spreading work |
A test result like this is a warning:
offered_rps: 1200
http_2xx_rps: 900
goodput_rps: 640
p95_latency_ms: 1850
retry_rps: 260
db_pool_wait_p95: 620
payment_429_rps: 80
The service handled more attempts, but goodput plateaued. The extra work became queueing, retries, and throttling.
Google's SRE chapter on Addressing Cascading Failures emphasizes overload testing and understanding capacity limits before overload becomes a cascading failure. That advice applies directly to horizontal scaling: test until the system stops improving, then identify which resource bends first.
What To Try Instead Of More Instances
Once the bottleneck is visible, the next move depends on what limited goodput.
| Evidence | Better next move |
|---|---|
| database pool wait rises | reduce per-request queries, cap fleet-wide connections |
| one query dominates database time | inspect query plan, indexes, and result size |
| repeated relation lookups | fix N+1 or batch relation reads |
| lock wait rises on hot rows | redesign contention point, use queues, shard by key, or change invariant boundary |
| external API throttles rise | add client-side rate limits, cache safe reads, or negotiate quota |
| queue age rises | measure worker bottleneck and partition distribution |
| retry rate rises | add retry budgets and backoff, reduce wasted work |
| cache misses surge after scale-out | warm cache deliberately or reduce cache dependency |
| per-instance traffic is uneven | inspect sticky routing, connection lifetime, and readiness |
| CPU is truly saturated per instance | then horizontal or vertical compute scaling may help |
If the bottleneck is a slow SQL query, start with How to Find and Fix Slow SQL Queries in Production. If the issue is ORM query explosion, use N+1 Query Problem in ORMs. If the proposed fix is "add replicas," read Why Read Replicas Didn't Reduce Database Load before assuming the primary will cool down.
A Practical Scaling Review Checklist
Before scaling a production service horizontally, answer these questions:
- What is the current goodput, not only offered request rate?
- Which resource saturates first in a load test?
- Does per-instance CPU reflect useful work or waiting?
- What is the fleet-wide database connection budget after scaling?
- Which dependency quotas are shared by all instances?
- Which locks, tenants, partition keys, or rows serialize work?
- What happens to queue age when request volume doubles?
- Does retry volume rise during the test?
- Are requests evenly distributed across instances and zones?
- Do new instances have cold caches or expensive warmup paths?
- Can the service shed or defer excess work before dependencies collapse?
- Which metric would prove that scaling improved user-visible capacity?
Then run a test that intentionally crosses the expected limit. The first plateau is the most valuable part of the test because it tells you where the real constraint lives.
If the system starts timing out and retrying instead of rejecting work early, the next reliability topic is admission control. Rate Limiting and Backpressure in Microservices covers how to keep overloaded services alive by limiting admitted work.
The Short Version
Horizontal scaling improves throughput only when the work is parallelizable and the instances you add contain the bottleneck.
If every request waits on the same database, row lock, external API, queue partition, cache, or downstream quota, more instances mostly create more concurrency against the same limit. Application CPU can fall while user-visible throughput stays flat.
The useful question is not "Can we add more pods?"
The useful question is "Which resource limits goodput?"
Once that is clear, scaling becomes a targeted change: add compute when compute is limiting, tune queries when the database is limiting, control admission when overload is limiting, reduce contention when shared state is limiting, and measure success by completed useful work rather than by the number of instances running.