Adding Retries Can Make Outages Worse

Adding Retries Can Make Outages Worse

Situation

Retries are a standard defensive technique in production systems. When a request fails due to a timeout or transient error, retrying feels like a pragmatic way to absorb temporary instability without surfacing errors to users.

This logic appears in many places: HTTP clients retrying failed calls, background jobs re-queuing work, message consumers reprocessing events, or SDKs automatically reissuing requests. Over time, retries often become deeply embedded into system behavior - sometimes implicitly, sometimes through shared libraries.

In this situation, a system was already experiencing partial degradation. Some requests were slow, some failed intermittently, and capacity was reduced but not exhausted. From the outside, it looked like a classic transient failure scenario: exactly the case retries are meant to help with.

To improve resilience, retry logic was enabled or made more aggressive.

The result was not improved stability.


The Reasonable Assumption

The underlying assumption is straightforward and widely accepted:

If a request fails temporarily, retrying increases the chance of eventual success without meaningful downside.

This assumption is reasonable for several reasons:

  • Many failures are transient.
  • Network hiccups, short GC pauses, or brief dependency restarts often resolve themselves.
  • Retrying shifts complexity away from users and into infrastructure.
  • In isolation, a single retry looks cheap compared to a failed operation.

From a local perspective, retries feel like a low-risk improvement. Each individual request either succeeds or tries again. Nothing about that suggests a system-wide failure mode.


What Actually Happened

Instead of stabilizing the system, retries accelerated its collapse.

As the system slowed down, more requests began to time out. Each timeout triggered one or more retries. Those retries added additional load to already constrained components. The extra load caused further slowdowns, which triggered even more retries.

Within minutes, a partial failure turned into a full outage.

Key symptoms appeared quickly:

  • Request volumes increased sharply without a corresponding increase in user traffic.
  • Queues grew faster after retries were enabled.
  • Latency percentiles spiked even though success rates initially looked unchanged.
  • Components that were previously healthy began failing under amplified load.

The system was not failing because of a single broken component. It was failing because the retry mechanism itself was feeding the failure.


Illustrative Code Example

for (let attempt = 0; attempt < 3; attempt++) {
  const result = await callDependency()
  if (result.ok) return result
}
throw new Error("Request failed")

Each retry appeared harmless. The problem was not the loop itself, but the fact that it existed in thousands of concurrent requests across the system, all reacting to the same degradation at the same time.


Why It Happened

The failure was not caused by retries being “wrong,” but by how they interact with real systems under stress.

Load Amplification

Retries multiply demand exactly when capacity is reduced.

When a dependency slows down, the system already has less throughput available. Retrying failed requests adds new work without removing old work. A single user request can quickly become multiple internal requests, all competing for the same constrained resources.

This amplification is nonlinear. A small increase in error rate can result in a much larger increase in load.

Synchronized Behavior

Retries often happen on similar schedules: fixed delays, exponential backoff with the same parameters, or immediate retries on timeout.

When many clients experience failure simultaneously, they retry simultaneously. Instead of smoothing traffic, retries can synchronize spikes, creating periodic bursts of load that prevent recovery.

From the system’s perspective, it never gets a quiet moment to catch up.

Disguised Demand

Retries obscure the true shape of traffic.

Metrics may show a stable request rate from users while internal request volumes climb rapidly. This makes it harder to reason about capacity, because demand appears artificially inflated and detached from actual usage.

Operators see “high load” without a corresponding external cause.

Partial Success Masking

Retries can make systems look healthier than they are - briefly.

If retries succeed after one or two attempts, success rates may remain high even as latency and resource usage degrade. This delays detection and response, allowing the underlying issue to worsen before intervention.

By the time failures are visible, the system may already be saturated.

Hidden Coupling

Retry logic often exists in multiple layers: client libraries, services, background workers, and infrastructure.

Each layer may independently retry, multiplying the effect. What looks like “three retries” in one component may turn into dozens of requests across the full call chain.

Because this coupling is implicit, it is rarely modeled or tested.


Alternatives That Didn’t Work

Several responses were attempted during the incident, all of them reasonable in isolation.

  • Increasing timeouts reduced immediate failures but extended resource occupancy, worsening contention.
  • Scaling up capacity helped briefly, but retries scaled with it, consuming the new headroom.
  • Disabling retries in one service had limited impact because other layers continued retrying.

None of these actions addressed the core dynamic: retries were generating more work than the system could process under degraded conditions.


Practical Takeaways

Retries are not inherently dangerous, but they have system-level effects that are easy to overlook.

Patterns worth watching for:

  • Retry logic that activates under the same conditions across many clients.
  • Success metrics that remain stable while latency and load increase.
  • Systems where retry behavior is distributed across multiple layers.
  • Degradation scenarios where demand grows as capacity shrinks.

The key is not whether retries exist, but how they interact with failure modes that affect many requests at once.


Closing Reflection

Retries are appealing because they feel like a local improvement: a small change that increases robustness without altering architecture. In practice, they are a coordination mechanism, shaping how the entire system reacts to stress.

Outages caused by retries are rarely about a single bad decision. They emerge from reasonable choices made independently, interacting in ways that are difficult to predict without observing real failures.

Understanding that dynamic is often more valuable than any specific configuration change.