Adding Retries Can Make Outages Worse

Adding Retries Can Make Outages Worse

Retries make production systems more reliable when failures are rare, brief, and safe to repeat. They make outages worse when the system is already overloaded and every retry adds work to the same constrained dependency.

That is the uncomfortable part of retry logic: the code often looks harmless. A small loop around an HTTP call can become a traffic multiplier across thousands of concurrent requests, several service layers, background workers, SDKs, queues, and clients.

The question is not "should we retry?" The question is "what happens to total load when many callers retry at the same time?"

For related production failure patterns around timeouts, overload, queues, and backpressure, see the Backend Reliability hub.


Why Retries Feel Like A Safe Default

Retries solve a real problem.

Networks drop packets. Containers restart. Connections get reset. Load balancers rebalance. Short dependency pauses happen.

If a request fails once and succeeds 50 ms later, retrying hides noise from users and avoids unnecessary incident pages. That is why retry behavior appears in HTTP clients, database drivers, queue consumers, SDKs, background jobs, browser clients, and service meshes.

The local reasoning is reasonable:

  1. the request failed
  2. the failure might be transient
  3. retrying may succeed
  4. one more attempt is cheaper than surfacing an error

But local reasoning misses shared capacity. During overload, every caller performs the same reasoning at the same time.

The AWS Builders Library article on timeouts, retries, and backoff makes this trade-off explicit: retries can help with transient failures, but they can also increase load on a dependency that is already approaching overload.


The Retry Storm Timeline

Imagine an API service calling a payment dependency. The dependency normally handles 10,000 requests per minute. During a partial incident it can only handle 7,000.

The API has two retry attempts.

StageWhat happensEffect
Normal traffic10,000 user requests per minuteDependency is healthy
Dependency degradesOnly 7,000 can complete3,000 fail or time out
First retryFailed requests retryDependency now sees up to 13,000 attempts
Second retrySome first retries fail tooAttempts can rise again
Queues growMore calls wait behind retry trafficLatency increases
Timeouts riseMore requests look retryableThe loop feeds itself

The original problem was reduced capacity. Retries turned reduced capacity into increased demand.

Timeout-heavy incidents often create this same pressure shape. Timeouts make failure visible; retries decide whether that visible failure creates more work. For the timeout-specific side of the problem, see When Timeouts Didn't Prevent Cascading Failures.


Small Code, Large Load Amplification

The risky version often looks like this:

async function chargeCustomer(payment: PaymentRequest) {
  for (let attempt = 0; attempt < 3; attempt++) {
    const response = await payments.charge(payment)

    if (response.ok) {
      return response
    }
  }

  throw new Error('payment charge failed')
}

That is not obviously reckless. It is short, readable, and common.

The problem is multiplication.

If one request can make three attempts, then 1,000 incoming requests can create up to 3,000 dependency calls. If retries exist at several layers, the multiplier can grow much faster.

LayerAttempts per requestWorst-case calls at that layer
Client33
API service39
Internal service327
SDK or driver381

This table is intentionally simplified, but the warning is real. Retries compose multiplicatively unless the system has a single retry owner, shared attempt metadata, or budgets that stop the cascade. Google's SRE chapter on handling overload discusses retry budgets and the risk of multiple layers retrying the same failed work.


Retry Load Is Not User Load

One of the most confusing incident symptoms is rising internal request volume while external traffic stays flat.

User traffic may be stable. The dependency graph is not.

Retries create synthetic demand: work generated by the system's reaction to failure rather than by users. That demand competes with original requests for the same queues, connections, CPU, locks, and downstream capacity.

This is why success rate alone can be misleading. Retries may keep success rate acceptable for a short time while:

  • p95 and p99 latency rise
  • connection pools saturate
  • queue depth grows
  • dependency CPU increases
  • timeout errors become clustered
  • background jobs fall behind
  • dashboards show more total requests than users sent

Retries can hide pain until the system has less recovery room. By the time visible failures spike, the retry storm may already be consuming a large share of capacity.


Which Failures Should Be Retried?

Retries should be selective. Treating every failure as retryable is how small faults become load amplification.

FailureRetry?Why
Connection reset before request reached dependencyUsually, with budgetThe operation may not have started
HTTP 429 or explicit overload responseUsually no immediate retryThe dependency is asking for less traffic
HTTP 503 from a known overloaded serviceOnly if budget and backoff allowMore load may delay recovery
HTTP 400 validation errorNoThe same request will fail again
Timeout after side-effecting callOnly if idempotentThe side effect may already have happened
Deadlock or serialization conflictOften, with short bounded retryThe database may succeed on a new attempt
Queue processing failureMaybe, with delay and dead-letter pathRepeated immediate attempts can block progress

The side-effect case is especially important. A timeout does not prove the operation failed. It proves the caller stopped waiting.

If retrying the operation can duplicate a payment, create two orders, send two emails, or process the same webhook twice, the API needs idempotency before retries are safe. That design is covered in API Idempotency Keys: Prevent Duplicate Requests Safely and Webhook Idempotency and Retries in Production.


Better Retry Logic Has A Budget

A retry budget limits how much extra work retry behavior is allowed to create.

There are several useful budget shapes:

BudgetWhat it limits
Per-request attempt countOne request cannot retry forever
Total elapsed timeA request stops when it is no longer useful
Per-client retry ratioRetries cannot exceed a share of original traffic
Token bucketRetries are allowed while tokens remain, then throttled
Per-dependency circuit stateRetries stop when the dependency is broadly unhealthy

Here is a simplified retry wrapper with a per-request budget and jittered backoff:

async function retryWithBudget<T>(
  operation: () => Promise<T>,
  options: { maxAttempts: number; deadlineMs: number }
) {
  const startedAt = Date.now()

  for (let attempt = 1; attempt <= options.maxAttempts; attempt++) {
    try {
      return await operation()
    } catch (error) {
      const remainingMs = options.deadlineMs - (Date.now() - startedAt)

      if (attempt === options.maxAttempts || remainingMs <= 0) {
        throw error
      }

      if (!isRetryable(error)) {
        throw error
      }

      const delayMs = jitter(backoffMs(attempt))
      await sleep(Math.min(delayMs, remainingMs))
    }
  }

  throw new Error('retry budget exhausted')
}

This still needs service-level controls. Per-request budgets stop one request from retrying forever. They do not stop every request in the fleet from using its entire budget at the same time.

For that, the caller also needs a shared or local retry-rate limit.

if (!retryTokens.tryTake()) {
  throw new Error('retry budget unavailable')
}

The point is not the exact implementation. The point is that retries should consume a limited resource, just like database connections or worker slots.


Backoff Without Jitter Can Still Synchronize Load

Exponential backoff is better than immediate retry, but it can still create synchronized bursts.

If many clients fail at the same time and all wait the same durations, they retry together:

0 ms: all clients fail
100 ms: all clients retry
300 ms: all clients retry again
700 ms: all clients retry again

The dependency receives waves of traffic instead of a smooth recovery curve.

Jitter adds randomness to spread those retries:

function jitter(baseMs: number) {
  return Math.floor(Math.random() * baseMs)
}

A more polished implementation may use full jitter, equal jitter, decorrelated jitter, or a library that already handles these choices. The important property is that retry timing should not align across the fleet.

AWS explicitly calls out jitter as a way to avoid large retry bursts when clients react to failure together. That advice applies beyond AWS services: synchronized retry schedules are a common way to keep a recovering dependency under pressure.


Avoid Retrying At Every Layer

Retries need ownership.

Without ownership, every layer tries to be helpful:

  • browser retries the request
  • edge retries to another origin
  • API retries the internal service
  • internal service retries the database-facing service
  • SDK retries the HTTP call
  • queue retries the job

Each layer sees only its local failure. Together they create a storm.

A safer design chooses where retries belong:

LocationGood fitRisk
Edge/clientNetwork blips before request reaches serviceCan multiply user traffic globally
API layerCheap idempotent readsMay waste work already done downstream
Dependency clientLocal transient dependency errorsHidden behavior across many callers
Queue workerDelayed async recoveryPoison messages can block throughput
Database transactionSerialization or deadlock retryCan repeat expensive work under contention

For a deep synchronous call stack, retrying at one well-chosen layer is usually easier to reason about than letting every layer retry independently.

Attempt metadata helps too. If a request carries attempt=2, downstream services can distinguish original work from retry work and respond differently under load.


Watch Retry Health Directly

Retry behavior should have its own metrics.

Useful signals include:

MetricWhy it matters
Attempts per successful requestShows hidden work behind success rate
Retry rate by dependencyIdentifies which downstream path is amplifying load
Retry budget exhaustionShows when retries are being suppressed
Time to success after retrySeparates useful retries from slow failures
Error rate before and after retriesShows whether retries improve availability
Timeout-to-retry ratioDetects retry storms triggered by latency
Duplicate side-effect rateReveals unsafe retry behavior

An alert on final error rate is too late. The useful alert often fires when retry rate rises while user traffic is flat.

That tells you the system is creating extra internal work before users see full failure.

For queue-based retry behavior, also watch age of oldest job, dead-letter count, retry count distribution, and downstream error class. Those operational details are part of Background Jobs in Production.


A Practical Retry Checklist

Before adding or increasing retries, answer these questions:

  1. Is the operation idempotent?
  2. Which errors are retryable and which are final?
  3. Which layer owns the retry?
  4. What is the maximum number of attempts?
  5. What is the maximum elapsed time?
  6. Is backoff jittered?
  7. What budget limits total retry traffic?
  8. What happens when the dependency returns an overload signal?
  9. Are retry attempts visible in logs, metrics, and traces?
  10. Can retries be disabled quickly during an incident?

If the answer is "the library handles it," inspect the library defaults. Hidden retries are still retries.


The Short Version

Retries are useful when failures are transient and retry traffic is bounded. They are dangerous when they add load to a dependency that is already failing because of load.

The safe version is not "retry everything with exponential backoff." It is:

  1. retry only safe failure classes
  2. make side effects idempotent
  3. retry at one intentional layer
  4. use bounded attempts and deadlines
  5. add jitter
  6. enforce retry budgets
  7. stop retrying when retries stop helping

Retries are a load-shaping mechanism. Treat them that way, and they become a reliability tool instead of an outage amplifier.