Why Horizontal Scaling Didn’t Improve Throughput

Why Horizontal Scaling Didn’t Improve Throughput

Situation

A backend service was experiencing sustained throughput limits under peak traffic. CPU utilization on each instance remained moderate, and memory pressure was low. The reasonable expectation was straightforward: increase the number of application instances and distribute traffic more evenly.

After scaling from a small cluster to several times its original size, request volume increased slightly - but overall throughput plateaued. Latency became more inconsistent, and error rates rose during traffic spikes.

The system had more compute capacity. Yet it was not processing more work.


The Reasonable Assumption

Horizontal scaling is widely understood as the safe path to growth. If individual machines are not saturated, adding more instances should allow the system to handle more concurrent requests.

Modern infrastructure reinforces this assumption:

  • Load balancers distribute traffic automatically.
  • Stateless services scale easily.
  • Container orchestration abstracts away placement.
  • Cloud platforms make scaling nearly instantaneous.

In isolation, the logic is sound: if one worker handles N requests per second, then ten workers should handle roughly 10N, assuming traffic is evenly distributed.

That assumption holds - but only when the system is actually parallelizable.


What Actually Happened

As more instances were added:

  • Database query latency increased.
  • Lock wait times grew.
  • Cache hit rates declined.
  • Network overhead between services increased.
  • Some requests began timing out despite low CPU usage.

Instead of increasing total throughput, the system shifted pressure into shared dependencies. The bottleneck moved - but it did not disappear.

Scaling the application tier exposed limits elsewhere.


Illustrative Code Example

The application layer appeared fully stateless:

async function handleRequest(req) {
  const user = await db.query(
    'SELECT * FROM users WHERE id = ?',
    [req.userId]
  )

  const account = await db.query(
    'SELECT * FROM accounts WHERE user_id = ?',
    [req.userId]
  )

  return process(user, account)
}

Individually, each instance handled these calls efficiently. The queries were indexed. Response times were stable under moderate load.

But when instance count doubled, the total number of concurrent queries against the database doubled as well.

Nothing in the application layer limited concurrency. The database became the coordination point.


Why It Happened

Throughput Is Limited by the Slowest Shared Resource

Horizontal scaling only improves throughput when work can be parallelized without increasing contention.

In this system:

  • Every request required database access.
  • The database connection pool had finite capacity.
  • Disk I/O and lock contention increased under concurrent access.
  • Cache invalidation events amplified write pressure.

The database was not horizontally scaled in the same proportion as the application tier. Even if it had been, consistency guarantees and shared state would still introduce coordination overhead.

Scaling stateless compute does not eliminate stateful bottlenecks.

Concurrency Amplifies Contention

Under light load, locks are short-lived and rarely conflict.

Under heavier parallelism:

  • More transactions attempt overlapping writes.
  • Row-level locks escalate into queueing delays.
  • Transactions remain open longer due to wait times.
  • Tail latency increases non-linearly.

Individual query performance might look acceptable in isolation. The degradation emerges from interaction.

More workers increase the probability of simultaneous access to shared rows or indexes. The system spends more time waiting, not computing.

Coordination Costs Increase with Scale

Every distributed system has implicit coordination points:

  • Connection pools
  • Caches
  • Rate limiters
  • Message brokers
  • Shared filesystems
  • Distributed locks

When instance count grows, coordination overhead grows as well:

  • More open connections
  • More heartbeats
  • More cache invalidations
  • More network chatter

Even if CPU utilization remains low, the system may be saturated on:

  • Network bandwidth
  • I/O queues
  • Lock tables
  • Internal scheduler limits

Horizontal scaling can increase internal system traffic faster than user-visible throughput.

Queueing Effects Become Dominant

Throughput plateaus often coincide with queue formation.

Once a shared resource nears saturation:

  • Requests begin to queue.
  • Queued requests hold connections open.
  • Connection pools exhaust.
  • Upstream services experience timeouts.
  • Retries increase load further.

The system may appear stable at average load but collapse under bursts.

Adding more instances increases arrival rate into the bottleneck, accelerating queue formation rather than relieving it.

The System Was Behaving Correctly

Nothing was broken.

The load balancer distributed traffic correctly.
The application instances processed requests efficiently.
The database enforced transactional guarantees as designed.

The system obeyed its constraints.

The assumption that compute was the limiting factor turned out to be incorrect. The true constraint was shared state coordination.

Horizontal scaling only helps when the constraint is compute-bound work that can be parallelized independently.


Alternatives That Didn't Work

Several intuitive adjustments were attempted:

  • Increasing connection pool size\
  • Increasing database instance size\
  • Adding read replicas

Each provided marginal improvement, but none changed the fundamental behavior.

Larger pools increased contention.
Larger database instances postponed saturation but did not eliminate it.
Read replicas helped read-heavy paths but did not reduce write coordination.

The bottleneck shifted slightly each time, but the system remained constrained by shared state.


Practical Takeaways

  • If every request depends on the same stateful component, scaling stateless tiers will amplify pressure on it.
  • Throughput plateaus often signal hidden coordination costs.
  • Moderate CPU usage does not imply available system capacity.
  • Lock contention grows non-linearly under concurrency.
  • Queue formation is often the first observable signal of scaling limits.
  • Horizontal scaling improves performance only when work is truly independent.

The key question is not "Can we add more instances?" but "What resource is actually limiting throughput?"


Closing Reflection

Horizontal scaling is powerful, but it does not remove systemic constraints. It redistributes them.

When a system fails to gain throughput after scaling out, the issue is rarely insufficient compute. More often, it is the cost of coordination

  • the invisible structure that allows distributed systems to behave consistently.

Understanding those coordination boundaries is often more important than adding more machines.