Why Horizontal Scaling Didn’t Improve Throughput

Why Horizontal Scaling Didn’t Improve Throughput

Horizontal scaling does not improve throughput when the service instances are not the bottleneck. Adding pods, containers, or servers can lower per-instance CPU while total useful work stays flat because every new instance sends more concurrent work into the same database, lock, queue, cache, rate limit, or downstream API.

That is the frustrating scaling failure: the application tier looks healthier after the rollout, but users do not get more completed requests. Latency gets worse, connection pools fill, retries increase, and the system spends more time waiting on shared resources than doing useful work.

For the broader reliability cluster around overload, admission control, retries, circuit breakers, and cache behavior, see the Backend Reliability hub.

More Instances Reduced CPU, Not The Bottleneck

Imagine a checkout summary endpoint that is scaled from 8 pods to 32 pods before a traffic campaign.

The first dashboard looks encouraging:

MetricBefore scalingAfter scaling
Application pods832
Average pod CPU72%28%
Average pod memory61%40%
Offered traffic650 rps1,200 rps
Completed useful responses610 rps660 rps
p95 latency240 ms1,850 ms
Database connection wait p9512 ms620 ms
Payment-settings API throttles0rising
Retry volumelowhigh

The team added 4x more application capacity. Completed throughput barely moved.

The reason is visible only when "throughput" is separated from "goodput." Throughput is the work offered to the system. Goodput is the work completed successfully and quickly enough to be useful to callers. AWS's Builders Library article on using load shedding to avoid overload uses that distinction directly: systems can process or attempt more work while useful completed work plateaus or falls.

In this example, horizontal scaling increased offered load. It did not increase goodput because the application tier was no longer the limiting resource.

The Handler Looked Stateless

The endpoint itself can look ordinary:

async function getCheckoutSummary(req: Request) {
  const user = await db.user.findUnique({
    where: { id: req.userId },
    select: { id: true, accountId: true, region: true },
  })

  const cart = await db.cart.findFirst({
    where: { userId: req.userId, status: 'open' },
    include: { items: true },
  })

  const paymentSettings = await paymentClient.getSettings({
    accountId: user.accountId,
    region: user.region,
  })

  const inventory = await inventoryClient.reservePreview({
    itemIds: cart.items.map((item) => item.sku),
  })

  return buildSummary({ user, cart, paymentSettings, inventory })
}

The service stores no local session state. Any pod can handle any request. Scaling it horizontally sounds reasonable.

But each request still uses shared resources:

  • two database reads
  • one cart relation load
  • one payment-settings dependency call
  • one inventory dependency call
  • one connection from the HTTP server
  • one connection from the database pool
  • one slice of downstream quota

Adding pods multiplies the number of concurrent callers. It does not multiply the capacity of the database, payment service, inventory service, or shared quota.

Autoscaling Adds Replicas Based On A Signal

Infrastructure can scale the workload correctly and still fail to improve user-visible throughput.

Kubernetes documents horizontal pod autoscaling as a controller that adjusts a workload's desired scale based on observed metrics such as CPU, memory, or custom metrics. The basic formula uses the ratio between current metric value and desired metric value. See Kubernetes Horizontal Pod Autoscaling.

That is useful machinery. It is not a proof that the application tier is the right scaling target.

If CPU is high because request handlers do CPU-heavy work locally, more pods may help. If CPU is low because handlers are waiting on the database, more pods may only create more waiters. If the autoscaler uses average CPU, it may even scale down while dependency latency rises because the pods are blocked on I/O rather than burning CPU.

Autoscaling answers "how many replicas should this workload have for this metric?" It does not answer "which resource is limiting goodput?"

Shared Bottlenecks That Make Scaling Flatline

When horizontal scaling fails, the bottleneck is usually in a shared resource or coordination point.

BottleneckWhat adding instances changesWhat to inspect
Database connection limitmore pods compete for the same finite connectionspool wait time, active connections, max connections
Slow query or missing indexmore instances run the same expensive queryquery plans, rows scanned, total database time
Row or advisory lock contentionmore workers wait on the same protected statelock wait time, blocked transactions, hot keys
External API rate limitmore callers spend the same downstream quota faster429s, retries, per-client quota, dependency latency
Cache misses or invalidationsmore instances produce more cold reads and churnhit ratio by key family, value age, invalidation rate
Message broker partition or consumer keymore workers wait behind one ordered partitionpartition lag, consumer utilization, key distribution
Load balancer or session affinitysome instances receive too much trafficper-instance rps, connection age, sticky routing behavior
Per-tenant hot accountmore fleet capacity does not split one tenant's statetenant-level latency, row locks, queue depth by tenant
Logging or telemetry sinkmore instances emit more side effects per requestlogging latency, dropped spans, exporter queues

The pattern is the same: the number of application workers increased, but the serial part of the workflow did not.

AWS's load-shedding article connects this to Amdahl's law and the Universal Scalability Law: parallelism can increase throughput only until serialization and contention dominate. That is the heart of the horizontal-scaling trap.

Connection Pools Can Multiply Pressure

A common failure is increasing every pod's connection pool without calculating fleet-wide concurrency.

Before scaling:

8 pods * 20 database connections = 160 possible database connections

After scaling:

32 pods * 20 database connections = 640 possible database connections

If the database is comfortable at 180 active application connections, the new fleet can overload it even though each pod's configuration did not change.

Increasing the pool size can make this worse:

32 pods * 40 database connections = 1,280 possible database connections

That does not create database capacity. It creates more concurrent work waiting for the same CPU, I/O, memory, locks, and query planner choices.

A safer scaling review asks:

QuestionWhy it matters
What is the total fleet-wide connection budget?per-pod settings multiply when replicas increase
What is the database's comfortable active workload?max connections is not the same as healthy throughput
Which endpoints use the most connections?hot routes can dominate the shared pool
What happens when pool wait time rises?queued requests may keep HTTP connections open
Do retries acquire new connections?retries can multiply pressure during partial failure

If pool wait time rises while application CPU falls, adding more application instances is probably not the fix.

Locks Make More Workers Wait Together

Horizontal scaling can also flatten throughput through lock contention.

Consider an inventory reservation path:

UPDATE inventory
SET available = available - $1
WHERE sku = $2
  AND available >= $1;

The guarded update is correct. It protects the invariant that inventory cannot go negative.

But if a campaign sends many requests for the same SKU, more pods do not create more independent work. They create more concurrent transactions trying to update the same row or nearby index pages.

The symptoms look like this:

  • database CPU is not the first limit
  • p95 lock wait rises
  • transaction duration rises
  • connection pool wait rises after lock waits
  • application instances sit idle or blocked
  • retries make the hot row even hotter

This is not a reason to remove the guard. Correctness comes first. It is a reason to recognize that the bottleneck is a serialized business invariant, not application compute.

The adjacent concurrency decisions are covered in Optimistic vs Pessimistic Locking in SQL. If the broader issue is overlapping requests changing shared state, see How to Prevent Race Conditions in Backend Systems.

Queues Can Hide The Plateau

Some systems appear to absorb the extra traffic because they enqueue work.

That can make the initial scaling attempt look successful:

HTTP 202 Accepted
job enqueued
worker will process later

But if the worker bottleneck is still the same database, external API, partition key, or lock, the queue only moves the plateau downstream.

Watch these signals:

SignalMeaning
queue age rising while workers growworkers are admitted but not draining enough work
retries per job risingworkers are colliding with the same bottleneck
dead-letter volume risingqueued work is not merely delayed, it is failing
one partition much hotter than restordering or key distribution limits parallelism
worker CPU low, dependency wait highworkers are blocked, not under-provisioned

For the operational side of delayed work, see Background Jobs in Production. If workers claim jobs from PostgreSQL, PostgreSQL Job Queues with SKIP LOCKED covers the row-claiming boundary.

Load Balancing Can Hide Uneven Capacity

Even when dependencies are healthy, scaling may fail because traffic is not actually balanced.

Common causes:

  • long-lived HTTP connections
  • sticky sessions
  • tenant-level routing
  • uneven expensive requests
  • uneven cache warmth
  • slow instance warmup
  • readiness checks that pass before local caches or JIT warmup finish
  • one availability zone receiving more expensive traffic

Fleet averages hide this.

Look at per-instance metrics:

MetricUseful question
requests per instanceAre a few instances handling most traffic?
p95 latency per instanceAre slow instances hidden by the average?
dependency calls per routeAre expensive routes unevenly distributed?
cache hit ratio per podAre new pods cold while old pods are warm?
errors by instanceAre retries concentrating on a subset of the fleet?

Google's SRE book chapter on Handling Overload discusses overload as something load balancing tries to avoid, but also notes that eventually some part of a system can become overloaded and needs graceful handling. That is the practical lesson: load balancing helps distribute work, but it does not remove the need to know each instance and dependency's real capacity.

Measure Goodput, Not Only More Traffic

A scaling test should show useful completed work.

Measure at least these:

MetricWhy it matters
offered rpshow much work clients attempted
successful rpshow much work completed without errors
goodputsuccessful work completed within the latency budget
p95 and p99 latencywhether tail latency got worse
retry ratewhether clients multiplied load
database pool waitwhether app replicas are queued on shared connections
dependency latency and throttleswhether a downstream service became the bottleneck
lock waitwhether coordination, not compute, is limiting work
queue agewhether async work is falling behind
per-instance trafficwhether the load balancer is actually spreading work

A test result like this is a warning:

offered_rps:        1200
http_2xx_rps:        900
goodput_rps:         640
p95_latency_ms:     1850
retry_rps:           260
db_pool_wait_p95:    620
payment_429_rps:      80

The service handled more attempts, but goodput plateaued. The extra work became queueing, retries, and throttling.

Google's SRE chapter on Addressing Cascading Failures emphasizes overload testing and understanding capacity limits before overload becomes a cascading failure. That advice applies directly to horizontal scaling: test until the system stops improving, then identify which resource bends first.

What To Try Instead Of More Instances

Once the bottleneck is visible, the next move depends on what limited goodput.

EvidenceBetter next move
database pool wait risesreduce per-request queries, cap fleet-wide connections
one query dominates database timeinspect query plan, indexes, and result size
repeated relation lookupsfix N+1 or batch relation reads
lock wait rises on hot rowsredesign contention point, use queues, shard by key, or change invariant boundary
external API throttles riseadd client-side rate limits, cache safe reads, or negotiate quota
queue age risesmeasure worker bottleneck and partition distribution
retry rate risesadd retry budgets and backoff, reduce wasted work
cache misses surge after scale-outwarm cache deliberately or reduce cache dependency
per-instance traffic is uneveninspect sticky routing, connection lifetime, and readiness
CPU is truly saturated per instancethen horizontal or vertical compute scaling may help

If the bottleneck is a slow SQL query, start with How to Find and Fix Slow SQL Queries in Production. If the issue is ORM query explosion, use N+1 Query Problem in ORMs. If the proposed fix is "add replicas," read Why Read Replicas Didn't Reduce Database Load before assuming the primary will cool down.

A Practical Scaling Review Checklist

Before scaling a production service horizontally, answer these questions:

  1. What is the current goodput, not only offered request rate?
  2. Which resource saturates first in a load test?
  3. Does per-instance CPU reflect useful work or waiting?
  4. What is the fleet-wide database connection budget after scaling?
  5. Which dependency quotas are shared by all instances?
  6. Which locks, tenants, partition keys, or rows serialize work?
  7. What happens to queue age when request volume doubles?
  8. Does retry volume rise during the test?
  9. Are requests evenly distributed across instances and zones?
  10. Do new instances have cold caches or expensive warmup paths?
  11. Can the service shed or defer excess work before dependencies collapse?
  12. Which metric would prove that scaling improved user-visible capacity?

Then run a test that intentionally crosses the expected limit. The first plateau is the most valuable part of the test because it tells you where the real constraint lives.

If the system starts timing out and retrying instead of rejecting work early, the next reliability topic is admission control. Rate Limiting and Backpressure in Microservices covers how to keep overloaded services alive by limiting admitted work.

The Short Version

Horizontal scaling improves throughput only when the work is parallelizable and the instances you add contain the bottleneck.

If every request waits on the same database, row lock, external API, queue partition, cache, or downstream quota, more instances mostly create more concurrency against the same limit. Application CPU can fall while user-visible throughput stays flat.

The useful question is not "Can we add more pods?"

The useful question is "Which resource limits goodput?"

Once that is clear, scaling becomes a targeted change: add compute when compute is limiting, tune queries when the database is limiting, control admission when overload is limiting, reduce contention when shared state is limiting, and measure success by completed useful work rather than by the number of instances running.