Why Read Replicas Didn’t Reduce Database Load

Situation

A production system was experiencing sustained pressure on its primary database during peak traffic. Most requests were read-heavy, and query latency on the primary was steadily increasing. The expected fix was straightforward: introduce read replicas and route read traffic away from the primary. After the rollout, read queries were visibly hitting replicas - but overall database load did not drop. In some periods, it increased. The primary database remained the bottleneck, even though additional capacity had been added specifically to relieve it.

For anyone encountering this situation, the confusion is immediate: reads moved, replicas are active - why is the primary still overloaded?

The Reasonable Assumption

The assumption behind read replicas is widely shared and rarely questioned. In most application workloads, reads significantly outnumber writes. Replicas exist to serve those reads, allowing the primary to focus on writes. From a system diagram perspective, this looks like simple load distribution: move the dominant class of traffic to additional machines.

This expectation is reinforced by documentation, architecture diagrams, and prior experience. Nothing about this assumption is careless. It reflects how read replicas often behave in smaller systems or under moderate load, where coordination costs are low and consistency constraints are easy to satisfy.

What Actually Happened

After replicas were introduced, some read traffic did move off the primary. Metrics showed queries executing on replicas, and average read latency improved when measured in isolation. Despite this, the primary database continued to exhibit high CPU usage, elevated I/O wait, and increasing contention during peak periods.

In practice:

The primary still handled a significant fraction of reads.
Write latency increased slightly instead of decreasing.
Adding more replicas did not linearly reduce load.
During traffic bursts, pressure concentrated back on the primary.

The system was doing more work overall, not less.

Illustrative Code Example

The issue was not visible in obvious routing logic. Reads were correctly marked and dispatched:

db.readOnly().query(
  'SELECT * FROM orders WHERE user_id = ?',
  userId
)

However, contextual guarantees quietly forced many reads back to the primary:

db.transaction(async (tx) => {
  await tx.query('UPDATE users SET last_seen = NOW() WHERE id = ?', id)
  return tx.query('SELECT * FROM users WHERE id = ?', id)
})

From the application’s perspective, this was still a read. From the database’s perspective, it was part of a write-bound, consistency-sensitive operation.

Why It Happened

The failure was not that replicas malfunctioned. It was that the system’s actual constraints differed from the assumed ones.

Replication increases primary work, not just replica capacity. Every replica must be kept consistent with the primary. Each write now incurs additional coordination: log shipping, buffering, acknowledgment tracking, and backpressure management. As replicas are added, the primary performs more work per write, not less.

Many reads are implicitly coupled to writes. Reads inside transactions, session-scoped queries, and abstraction-managed units of work often require primary access to preserve correctness. These reads tend to be latency-sensitive and cluster around peak load, precisely when offloading would matter most.

Replica lag introduces fallback behavior. Under sustained load, replicas fall behind. Applications rarely tolerate stale data uniformly, so they compensate by retrying or redirecting reads to the primary. Instead of smoothing load, this concentrates it during periods of stress.

The real bottleneck was coordination rather than execution. The system was no longer limited by raw read throughput. It was constrained by ordering, durability, and consistency guarantees. The database behaved correctly according to its rules, but those guarantees dominated the cost profile at scale.

Alternatives That Didn’t Work

Several reasonable alternatives were tried or considered:

Adding more replicas, which increased replication overhead.
Aggressively routing “safe” reads, which helped locally but failed under peak load.
Relaxing consistency in limited paths, which introduced subtle correctness risks.
Query-level routing hints that were bypassed by higher-level abstractions.

Each option shifted pressure without removing it.

Practical Takeaways

Some patterns generalized beyond this system:

“Read-heavy” workloads often conceal write-coupled reads.
Replica count can increase primary load under sustained writes.
Replica lag is often the first visible signal, not the root cause.
Load that does not move is frequently constrained by guarantees, not capacity.

These signals tend to emerge gradually, making the system feel inexplicably resistant to scaling.

Closing Reflection

Read replicas solve a narrower problem than they appear to. They distribute execution but do not eliminate coordination, and at scale, coordination often dominates cost. When load refuses to move, it is usually because the system is protecting invariants that matter more than throughput. Understanding those invariants explains why adding capacity can sometimes make the system heavier instead of lighter.