Why Horizontal Scaling Didn’t Improve Throughput

Horizontal scaling does not improve throughput when the service instances are not the bottleneck. Adding pods, containers, or servers can lower per-instance CPU while total useful work stays flat because every new instance sends more concurrent work into the same database, lock, queue, cache, rate limit, or downstream API.

That is the frustrating scaling failure: the application tier looks healthier after the rollout, but users do not get more completed requests. Latency gets worse, connection pools fill, retries increase, and the system spends more time waiting on shared resources than doing useful work.

For the broader reliability cluster around overload, admission control, retries, circuit breakers, and cache behavior, see the Backend Reliability hub.

More Instances Reduced CPU, Not The Bottleneck

Imagine a checkout summary endpoint that is scaled from 8 pods to 32 pods before a traffic campaign.

The first dashboard looks encouraging:

Metric	Before scaling	After scaling
Application pods	8	32
Average pod CPU	72%	28%
Average pod memory	61%	40%
Offered traffic	650 rps	1,200 rps
Completed useful responses	610 rps	660 rps
p95 latency	240 ms	1,850 ms
Database connection wait p95	12 ms	620 ms
Payment-settings API throttles	0	rising
Retry volume	low	high

The team added 4x more application capacity. Completed throughput barely moved.

The reason is visible only when "throughput" is separated from "goodput." Throughput is the work offered to the system. Goodput is the work completed successfully and quickly enough to be useful to callers. AWS's Builders Library article on using load shedding to avoid overload uses that distinction directly: systems can process or attempt more work while useful completed work plateaus or falls.

In this example, horizontal scaling increased offered load. It did not increase goodput because the application tier was no longer the limiting resource.

The Handler Looked Stateless

The endpoint itself can look ordinary:

async function getCheckoutSummary(req: Request) {
  const user = await db.user.findUnique({
    where: { id: req.userId },
    select: { id: true, accountId: true, region: true },
  })

  const cart = await db.cart.findFirst({
    where: { userId: req.userId, status: 'open' },
    include: { items: true },
  })

  const paymentSettings = await paymentClient.getSettings({
    accountId: user.accountId,
    region: user.region,
  })

  const inventory = await inventoryClient.reservePreview({
    itemIds: cart.items.map((item) => item.sku),
  })

  return buildSummary({ user, cart, paymentSettings, inventory })
}

The service stores no local session state. Any pod can handle any request. Scaling it horizontally sounds reasonable.

But each request still uses shared resources:

two database reads
one cart relation load
one payment-settings dependency call
one inventory dependency call
one connection from the HTTP server
one connection from the database pool
one slice of downstream quota

Adding pods multiplies the number of concurrent callers. It does not multiply the capacity of the database, payment service, inventory service, or shared quota.

Autoscaling Adds Replicas Based On A Signal

Infrastructure can scale the workload correctly and still fail to improve user-visible throughput.

Kubernetes documents horizontal pod autoscaling as a controller that adjusts a workload's desired scale based on observed metrics such as CPU, memory, or custom metrics. The basic formula uses the ratio between current metric value and desired metric value. See Kubernetes Horizontal Pod Autoscaling.

That is useful machinery. It is not a proof that the application tier is the right scaling target.

If CPU is high because request handlers do CPU-heavy work locally, more pods may help. If CPU is low because handlers are waiting on the database, more pods may only create more waiters. If the autoscaler uses average CPU, it may even scale down while dependency latency rises because the pods are blocked on I/O rather than burning CPU.

Autoscaling answers "how many replicas should this workload have for this metric?" It does not answer "which resource is limiting goodput?"

Shared Bottlenecks That Make Scaling Flatline

When horizontal scaling fails, the bottleneck is usually in a shared resource or coordination point.

Bottleneck	What adding instances changes	What to inspect
Database connection limit	more pods compete for the same finite connections	pool wait time, active connections, max connections
Slow query or missing index	more instances run the same expensive query	query plans, rows scanned, total database time
Row or advisory lock contention	more workers wait on the same protected state	lock wait time, blocked transactions, hot keys
External API rate limit	more callers spend the same downstream quota faster	429s, retries, per-client quota, dependency latency
Cache misses or invalidations	more instances produce more cold reads and churn	hit ratio by key family, value age, invalidation rate
Message broker partition or consumer key	more workers wait behind one ordered partition	partition lag, consumer utilization, key distribution
Load balancer or session affinity	some instances receive too much traffic	per-instance rps, connection age, sticky routing behavior
Per-tenant hot account	more fleet capacity does not split one tenant's state	tenant-level latency, row locks, queue depth by tenant
Logging or telemetry sink	more instances emit more side effects per request	logging latency, dropped spans, exporter queues

The pattern is the same: the number of application workers increased, but the serial part of the workflow did not.

AWS's load-shedding article connects this to Amdahl's law and the Universal Scalability Law: parallelism can increase throughput only until serialization and contention dominate. That is the heart of the horizontal-scaling trap.

Connection Pools Can Multiply Pressure

A common failure is increasing every pod's connection pool without calculating fleet-wide concurrency.

Before scaling:

8 pods * 20 database connections = 160 possible database connections

After scaling:

32 pods * 20 database connections = 640 possible database connections

If the database is comfortable at 180 active application connections, the new fleet can overload it even though each pod's configuration did not change.

Increasing the pool size can make this worse:

32 pods * 40 database connections = 1,280 possible database connections

That does not create database capacity. It creates more concurrent work waiting for the same CPU, I/O, memory, locks, and query planner choices.

A safer scaling review asks:

Question	Why it matters
What is the total fleet-wide connection budget?	per-pod settings multiply when replicas increase
What is the database's comfortable active workload?	max connections is not the same as healthy throughput
Which endpoints use the most connections?	hot routes can dominate the shared pool
What happens when pool wait time rises?	queued requests may keep HTTP connections open
Do retries acquire new connections?	retries can multiply pressure during partial failure

If pool wait time rises while application CPU falls, adding more application instances is probably not the fix.

Locks Make More Workers Wait Together

Horizontal scaling can also flatten throughput through lock contention.

Consider an inventory reservation path:

UPDATE inventory
SET available = available - $1
WHERE sku = $2
  AND available >= $1;

The guarded update is correct. It protects the invariant that inventory cannot go negative.

But if a campaign sends many requests for the same SKU, more pods do not create more independent work. They create more concurrent transactions trying to update the same row or nearby index pages.

The symptoms look like this:

database CPU is not the first limit
p95 lock wait rises
transaction duration rises
connection pool wait rises after lock waits
application instances sit idle or blocked
retries make the hot row even hotter

This is not a reason to remove the guard. Correctness comes first. It is a reason to recognize that the bottleneck is a serialized business invariant, not application compute.

The adjacent concurrency decisions are covered in Optimistic vs Pessimistic Locking in SQL. If the broader issue is overlapping requests changing shared state, see How to Prevent Race Conditions in Backend Systems.

Queues Can Hide The Plateau

Some systems appear to absorb the extra traffic because they enqueue work.

That can make the initial scaling attempt look successful:

HTTP 202 Accepted
job enqueued
worker will process later

But if the worker bottleneck is still the same database, external API, partition key, or lock, the queue only moves the plateau downstream.

Watch these signals:

Signal	Meaning
queue age rising while workers grow	workers are admitted but not draining enough work
retries per job rising	workers are colliding with the same bottleneck
dead-letter volume rising	queued work is not merely delayed, it is failing
one partition much hotter than rest	ordering or key distribution limits parallelism
worker CPU low, dependency wait high	workers are blocked, not under-provisioned

For the operational side of delayed work, see Background Jobs in Production. If workers claim jobs from PostgreSQL, PostgreSQL Job Queues with SKIP LOCKED covers the row-claiming boundary.

Load Balancing Can Hide Uneven Capacity

Even when dependencies are healthy, scaling may fail because traffic is not actually balanced.

Common causes:

long-lived HTTP connections
sticky sessions
tenant-level routing
uneven expensive requests
uneven cache warmth
slow instance warmup
readiness checks that pass before local caches or JIT warmup finish
one availability zone receiving more expensive traffic

Fleet averages hide this.

Look at per-instance metrics:

Metric	Useful question
requests per instance	Are a few instances handling most traffic?
p95 latency per instance	Are slow instances hidden by the average?
dependency calls per route	Are expensive routes unevenly distributed?
cache hit ratio per pod	Are new pods cold while old pods are warm?
errors by instance	Are retries concentrating on a subset of the fleet?

Google's SRE book chapter on Handling Overload discusses overload as something load balancing tries to avoid, but also notes that eventually some part of a system can become overloaded and needs graceful handling. That is the practical lesson: load balancing helps distribute work, but it does not remove the need to know each instance and dependency's real capacity.

Measure Goodput, Not Only More Traffic

A scaling test should show useful completed work.

Measure at least these:

Metric	Why it matters
offered rps	how much work clients attempted
successful rps	how much work completed without errors
goodput	successful work completed within the latency budget
p95 and p99 latency	whether tail latency got worse
retry rate	whether clients multiplied load
database pool wait	whether app replicas are queued on shared connections
dependency latency and throttles	whether a downstream service became the bottleneck
lock wait	whether coordination, not compute, is limiting work
queue age	whether async work is falling behind
per-instance traffic	whether the load balancer is actually spreading work

A test result like this is a warning:

offered_rps:        1200
http_2xx_rps:        900
goodput_rps:         640
p95_latency_ms:     1850
retry_rps:           260
db_pool_wait_p95:    620
payment_429_rps:      80

The service handled more attempts, but goodput plateaued. The extra work became queueing, retries, and throttling.

Google's SRE chapter on Addressing Cascading Failures emphasizes overload testing and understanding capacity limits before overload becomes a cascading failure. That advice applies directly to horizontal scaling: test until the system stops improving, then identify which resource bends first.

What To Try Instead Of More Instances

Once the bottleneck is visible, the next move depends on what limited goodput.

Evidence	Better next move
database pool wait rises	reduce per-request queries, cap fleet-wide connections
one query dominates database time	inspect query plan, indexes, and result size
repeated relation lookups	fix N+1 or batch relation reads
lock wait rises on hot rows	redesign contention point, use queues, shard by key, or change invariant boundary
external API throttles rise	add client-side rate limits, cache safe reads, or negotiate quota
queue age rises	measure worker bottleneck and partition distribution
retry rate rises	add retry budgets and backoff, reduce wasted work
cache misses surge after scale-out	warm cache deliberately or reduce cache dependency
per-instance traffic is uneven	inspect sticky routing, connection lifetime, and readiness
CPU is truly saturated per instance	then horizontal or vertical compute scaling may help

If the bottleneck is a slow SQL query, start with How to Find and Fix Slow SQL Queries in Production. If the issue is ORM query explosion, use N+1 Query Problem in ORMs. If the proposed fix is "add replicas," read Why Read Replicas Didn't Reduce Database Load before assuming the primary will cool down.

A Practical Scaling Review Checklist

Before scaling a production service horizontally, answer these questions:

What is the current goodput, not only offered request rate?
Which resource saturates first in a load test?
Does per-instance CPU reflect useful work or waiting?
What is the fleet-wide database connection budget after scaling?
Which dependency quotas are shared by all instances?
Which locks, tenants, partition keys, or rows serialize work?
What happens to queue age when request volume doubles?
Does retry volume rise during the test?
Are requests evenly distributed across instances and zones?
Do new instances have cold caches or expensive warmup paths?
Can the service shed or defer excess work before dependencies collapse?
Which metric would prove that scaling improved user-visible capacity?

Then run a test that intentionally crosses the expected limit. The first plateau is the most valuable part of the test because it tells you where the real constraint lives.

If the system starts timing out and retrying instead of rejecting work early, the next reliability topic is admission control. Rate Limiting and Backpressure in Microservices covers how to keep overloaded services alive by limiting admitted work.

The Short Version

Horizontal scaling improves throughput only when the work is parallelizable and the instances you add contain the bottleneck.

If every request waits on the same database, row lock, external API, queue partition, cache, or downstream quota, more instances mostly create more concurrency against the same limit. Application CPU can fall while user-visible throughput stays flat.

The useful question is not "Can we add more pods?"

The useful question is "Which resource limits goodput?"

Once that is clear, scaling becomes a targeted change: add compute when compute is limiting, tune queries when the database is limiting, control admission when overload is limiting, reduce contention when shared state is limiting, and measure success by completed useful work rather than by the number of instances running.