
Adding Retries Can Make Outages Worse
Retries make production systems more reliable when failures are rare, brief, and safe to repeat. They make outages worse when the system is already overloaded and every retry adds work to the same constrained dependency.
That is the uncomfortable part of retry logic: the code often looks harmless. A small loop around an HTTP call can become a traffic multiplier across thousands of concurrent requests, several service layers, background workers, SDKs, queues, and clients.
The question is not "should we retry?" The question is "what happens to total load when many callers retry at the same time?"
For related production failure patterns around timeouts, overload, queues, and backpressure, see the Backend Reliability hub.
Why Retries Feel Like A Safe Default
Retries solve a real problem.
Networks drop packets. Containers restart. Connections get reset. Load balancers rebalance. Short dependency pauses happen.
If a request fails once and succeeds 50 ms later, retrying hides noise from users and avoids unnecessary incident pages. That is why retry behavior appears in HTTP clients, database drivers, queue consumers, SDKs, background jobs, browser clients, and service meshes.
The local reasoning is reasonable:
- the request failed
- the failure might be transient
- retrying may succeed
- one more attempt is cheaper than surfacing an error
But local reasoning misses shared capacity. During overload, every caller performs the same reasoning at the same time.
The AWS Builders Library article on timeouts, retries, and backoff makes this trade-off explicit: retries can help with transient failures, but they can also increase load on a dependency that is already approaching overload.
The Retry Storm Timeline
Imagine an API service calling a payment dependency. The dependency normally handles 10,000 requests per minute. During a partial incident it can only handle 7,000.
The API has two retry attempts.
| Stage | What happens | Effect |
|---|---|---|
| Normal traffic | 10,000 user requests per minute | Dependency is healthy |
| Dependency degrades | Only 7,000 can complete | 3,000 fail or time out |
| First retry | Failed requests retry | Dependency now sees up to 13,000 attempts |
| Second retry | Some first retries fail too | Attempts can rise again |
| Queues grow | More calls wait behind retry traffic | Latency increases |
| Timeouts rise | More requests look retryable | The loop feeds itself |
The original problem was reduced capacity. Retries turned reduced capacity into increased demand.
Timeout-heavy incidents often create this same pressure shape. Timeouts make failure visible; retries decide whether that visible failure creates more work. For the timeout-specific side of the problem, see When Timeouts Didn't Prevent Cascading Failures.
Small Code, Large Load Amplification
The risky version often looks like this:
async function chargeCustomer(payment: PaymentRequest) {
for (let attempt = 0; attempt < 3; attempt++) {
const response = await payments.charge(payment)
if (response.ok) {
return response
}
}
throw new Error('payment charge failed')
}
That is not obviously reckless. It is short, readable, and common.
The problem is multiplication.
If one request can make three attempts, then 1,000 incoming requests can create up to 3,000 dependency calls. If retries exist at several layers, the multiplier can grow much faster.
| Layer | Attempts per request | Worst-case calls at that layer |
|---|---|---|
| Client | 3 | 3 |
| API service | 3 | 9 |
| Internal service | 3 | 27 |
| SDK or driver | 3 | 81 |
This table is intentionally simplified, but the warning is real. Retries compose multiplicatively unless the system has a single retry owner, shared attempt metadata, or budgets that stop the cascade. Google's SRE chapter on handling overload discusses retry budgets and the risk of multiple layers retrying the same failed work.
Retry Load Is Not User Load
One of the most confusing incident symptoms is rising internal request volume while external traffic stays flat.
User traffic may be stable. The dependency graph is not.
Retries create synthetic demand: work generated by the system's reaction to failure rather than by users. That demand competes with original requests for the same queues, connections, CPU, locks, and downstream capacity.
This is why success rate alone can be misleading. Retries may keep success rate acceptable for a short time while:
- p95 and p99 latency rise
- connection pools saturate
- queue depth grows
- dependency CPU increases
- timeout errors become clustered
- background jobs fall behind
- dashboards show more total requests than users sent
Retries can hide pain until the system has less recovery room. By the time visible failures spike, the retry storm may already be consuming a large share of capacity.
Which Failures Should Be Retried?
Retries should be selective. Treating every failure as retryable is how small faults become load amplification.
| Failure | Retry? | Why |
|---|---|---|
| Connection reset before request reached dependency | Usually, with budget | The operation may not have started |
| HTTP 429 or explicit overload response | Usually no immediate retry | The dependency is asking for less traffic |
| HTTP 503 from a known overloaded service | Only if budget and backoff allow | More load may delay recovery |
| HTTP 400 validation error | No | The same request will fail again |
| Timeout after side-effecting call | Only if idempotent | The side effect may already have happened |
| Deadlock or serialization conflict | Often, with short bounded retry | The database may succeed on a new attempt |
| Queue processing failure | Maybe, with delay and dead-letter path | Repeated immediate attempts can block progress |
The side-effect case is especially important. A timeout does not prove the operation failed. It proves the caller stopped waiting.
If retrying the operation can duplicate a payment, create two orders, send two emails, or process the same webhook twice, the API needs idempotency before retries are safe. That design is covered in API Idempotency Keys: Prevent Duplicate Requests Safely and Webhook Idempotency and Retries in Production.
Better Retry Logic Has A Budget
A retry budget limits how much extra work retry behavior is allowed to create.
There are several useful budget shapes:
| Budget | What it limits |
|---|---|
| Per-request attempt count | One request cannot retry forever |
| Total elapsed time | A request stops when it is no longer useful |
| Per-client retry ratio | Retries cannot exceed a share of original traffic |
| Token bucket | Retries are allowed while tokens remain, then throttled |
| Per-dependency circuit state | Retries stop when the dependency is broadly unhealthy |
Here is a simplified retry wrapper with a per-request budget and jittered backoff:
async function retryWithBudget<T>(
operation: () => Promise<T>,
options: { maxAttempts: number; deadlineMs: number }
) {
const startedAt = Date.now()
for (let attempt = 1; attempt <= options.maxAttempts; attempt++) {
try {
return await operation()
} catch (error) {
const remainingMs = options.deadlineMs - (Date.now() - startedAt)
if (attempt === options.maxAttempts || remainingMs <= 0) {
throw error
}
if (!isRetryable(error)) {
throw error
}
const delayMs = jitter(backoffMs(attempt))
await sleep(Math.min(delayMs, remainingMs))
}
}
throw new Error('retry budget exhausted')
}
This still needs service-level controls. Per-request budgets stop one request from retrying forever. They do not stop every request in the fleet from using its entire budget at the same time.
For that, the caller also needs a shared or local retry-rate limit.
if (!retryTokens.tryTake()) {
throw new Error('retry budget unavailable')
}
The point is not the exact implementation. The point is that retries should consume a limited resource, just like database connections or worker slots.
Backoff Without Jitter Can Still Synchronize Load
Exponential backoff is better than immediate retry, but it can still create synchronized bursts.
If many clients fail at the same time and all wait the same durations, they retry together:
0 ms: all clients fail
100 ms: all clients retry
300 ms: all clients retry again
700 ms: all clients retry again
The dependency receives waves of traffic instead of a smooth recovery curve.
Jitter adds randomness to spread those retries:
function jitter(baseMs: number) {
return Math.floor(Math.random() * baseMs)
}
A more polished implementation may use full jitter, equal jitter, decorrelated jitter, or a library that already handles these choices. The important property is that retry timing should not align across the fleet.
AWS explicitly calls out jitter as a way to avoid large retry bursts when clients react to failure together. That advice applies beyond AWS services: synchronized retry schedules are a common way to keep a recovering dependency under pressure.
Avoid Retrying At Every Layer
Retries need ownership.
Without ownership, every layer tries to be helpful:
- browser retries the request
- edge retries to another origin
- API retries the internal service
- internal service retries the database-facing service
- SDK retries the HTTP call
- queue retries the job
Each layer sees only its local failure. Together they create a storm.
A safer design chooses where retries belong:
| Location | Good fit | Risk |
|---|---|---|
| Edge/client | Network blips before request reaches service | Can multiply user traffic globally |
| API layer | Cheap idempotent reads | May waste work already done downstream |
| Dependency client | Local transient dependency errors | Hidden behavior across many callers |
| Queue worker | Delayed async recovery | Poison messages can block throughput |
| Database transaction | Serialization or deadlock retry | Can repeat expensive work under contention |
For a deep synchronous call stack, retrying at one well-chosen layer is usually easier to reason about than letting every layer retry independently.
Attempt metadata helps too. If a request carries attempt=2, downstream services can distinguish original work from retry work and respond differently under load.
Watch Retry Health Directly
Retry behavior should have its own metrics.
Useful signals include:
| Metric | Why it matters |
|---|---|
| Attempts per successful request | Shows hidden work behind success rate |
| Retry rate by dependency | Identifies which downstream path is amplifying load |
| Retry budget exhaustion | Shows when retries are being suppressed |
| Time to success after retry | Separates useful retries from slow failures |
| Error rate before and after retries | Shows whether retries improve availability |
| Timeout-to-retry ratio | Detects retry storms triggered by latency |
| Duplicate side-effect rate | Reveals unsafe retry behavior |
An alert on final error rate is too late. The useful alert often fires when retry rate rises while user traffic is flat.
That tells you the system is creating extra internal work before users see full failure.
For queue-based retry behavior, also watch age of oldest job, dead-letter count, retry count distribution, and downstream error class. Those operational details are part of Background Jobs in Production.
A Practical Retry Checklist
Before adding or increasing retries, answer these questions:
- Is the operation idempotent?
- Which errors are retryable and which are final?
- Which layer owns the retry?
- What is the maximum number of attempts?
- What is the maximum elapsed time?
- Is backoff jittered?
- What budget limits total retry traffic?
- What happens when the dependency returns an overload signal?
- Are retry attempts visible in logs, metrics, and traces?
- Can retries be disabled quickly during an incident?
If the answer is "the library handles it," inspect the library defaults. Hidden retries are still retries.
The Short Version
Retries are useful when failures are transient and retry traffic is bounded. They are dangerous when they add load to a dependency that is already failing because of load.
The safe version is not "retry everything with exponential backoff." It is:
- retry only safe failure classes
- make side effects idempotent
- retry at one intentional layer
- use bounded attempts and deadlines
- add jitter
- enforce retry budgets
- stop retrying when retries stop helping
Retries are a load-shaping mechanism. Treat them that way, and they become a reliability tool instead of an outage amplifier.