
Retry Budgets in Microservices: Stop Retrying Into Outages
Retry budgets keep microservice retries from turning a partial outage into a larger outage. They answer a practical question every backend system eventually hits: how many extra attempts are allowed before retrying stops helping availability and starts consuming the capacity the dependency needs to recover?
Retries are useful when failures are brief, random, and isolated. They are dangerous when the dependency is overloaded. In that case, every retry competes with fresh traffic, increases queueing, burns connection pools, and can keep the overloaded service unhealthy after the original trigger is gone.
This article is part of the Backend Reliability hub. It is the companion to Adding Retries Can Make Outages Worse: that article explains retry amplification; this one shows how to put an explicit budget around retries so callers stop spending unlimited downstream capacity.
What A Retry Budget Is
A retry budget is a limit on retry traffic.
The limit can exist at several levels:
| Budget | What it limits | Example |
|---|---|---|
| Per-request attempt budget | Attempts for one logical request | Try at most 3 times total |
| Per-client retry ratio | Retry traffic as a fraction of original traffic | Retries may be at most 10% of initial requests |
| Token bucket | Local retry permission under pressure | Spend one token per retry; refill slowly |
| Dependency health gate | Whether retries are allowed right now | Retry only while the dependency looks healthy |
A per-request limit prevents infinite loops. A per-client ratio prevents a whole fleet of callers from doubling traffic during a bad minute. A token bucket lets a service absorb small bursts without allowing unbounded retry storms. A dependency health gate stops clients from retrying into a service that is already telling you it cannot keep up.
Google's SRE book describes both per-request retry budgets and per-client retry budgets in its overload handling chapter. It gives one concrete client-side ratio example: limiting retries to about 10% can reduce retry growth from a much larger multiplier to a much smaller one in the general case. See Google SRE: Handling Overload.
The exact number is less important than the rule: retries need an upper bound.
The Retry Storm A Budget Prevents
Imagine a checkout service calling an inventory service.
Normal traffic:
| Source | Rate |
|---|---|
| Initial checkout requests | 1,000 requests/sec |
| Inventory calls per checkout | 1 |
| Retry traffic | 0 |
| Total inventory traffic | 1,000 requests/sec |
Now inventory becomes slow and starts timing out 10% of calls.
If every caller retries twice immediately:
| Attempt | Additional traffic |
|---|---|
| Initial requests | 1,000/sec |
| First retry for 10% failures | 100/sec |
| Second retry if those fail | up to 100/sec |
| Total possible traffic | 1,200/sec |
That looks manageable in one layer.
Now imagine the request crosses five services and three of them retry independently. The multipliers compound. AWS's Builders' Library warns about this exact pattern: if a five-deep stack retries three times at each layer, load at the deepest dependency can increase dramatically. See Timeouts, retries, and backoff with jitter.
The problem is not that one retry is bad. The problem is that every layer thinks its retry is local.
The dependency receives the sum.
Use A Per-Request Budget First
The first budget is the simplest:
One logical request gets a fixed maximum number of attempts.
For example:
type RetryPolicy = {
maxAttempts: number
baseDelayMs: number
maxDelayMs: number
}
const inventoryRetryPolicy: RetryPolicy = {
maxAttempts: 3,
baseDelayMs: 100,
maxDelayMs: 1_000,
}
Then every call tracks the attempt count:
async function callInventory(request: ReserveInventoryRequest) {
for (let attempt = 1; attempt <= inventoryRetryPolicy.maxAttempts; attempt++) {
const response = await inventory.reserve(request, {
headers: {
'X-Retry-Attempt': String(attempt - 1),
},
})
if (response.ok) {
return response
}
if (!isRetryable(response) || attempt === inventoryRetryPolicy.maxAttempts) {
throw new InventoryUnavailable(response.status)
}
await sleep(jitteredBackoff(attempt, inventoryRetryPolicy))
}
}
This protects one request from retrying forever.
It does not protect the dependency from a fleet-wide retry storm.
That is why per-request attempt limits are necessary but not enough.
Add A Client-Side Retry Ratio
A per-client retry ratio asks:
How much of this client's recent traffic is retry traffic?
If retries exceed the allowed ratio, the client stops retrying and returns the failure.
Example policy:
| Metric | Value |
|---|---|
| Window | 60 seconds |
| Max retry ratio | 10% |
| Counted traffic | Initial attempts and retry attempts |
| Behavior when exhausted | Do not retry; return controlled failure |
Suppose checkout sent 10,000 initial inventory calls in the last minute. With a 10% retry ratio, it can spend about 1,000 retry attempts in that same window.
That budget lets isolated failures recover while preventing retry traffic from becoming the dominant workload.
A simple in-process sketch:
class RetryBudget {
constructor(
private readonly maxRetryRatio: number,
private readonly window: RollingWindow
) {}
recordInitialRequest() {
this.window.increment('initial')
}
trySpendRetry() {
const initial = this.window.count('initial')
const retries = this.window.count('retry')
if (initial === 0) {
return false
}
if (retries / initial >= this.maxRetryRatio) {
return false
}
this.window.increment('retry')
return true
}
}
Then the caller checks the budget before sleeping and retrying:
if (!retryBudget.trySpendRetry()) {
throw new RetryBudgetExhausted('inventory retry budget exhausted')
}
The important part is not this exact implementation. The important part is that retry permission becomes finite and measurable.
Token Buckets Work Well For Local Limits
A token bucket is another practical retry budget.
It works like this:
- The bucket starts with some retry tokens.
- Each retry spends one token.
- Tokens refill at a fixed rate.
- If the bucket is empty, the caller does not retry.
AWS's Builders' Library describes limiting retries locally with a token bucket so that retries continue while tokens are available and then settle to a fixed rate once tokens are exhausted.
A simplified token bucket:
class TokenBucket {
private tokens: number
private lastRefillAt = Date.now()
constructor(
private readonly capacity: number,
private readonly refillPerSecond: number
) {
this.tokens = capacity
}
tryTake() {
this.refill()
if (this.tokens < 1) {
return false
}
this.tokens -= 1
return true
}
private refill() {
const now = Date.now()
const elapsedSeconds = (now - this.lastRefillAt) / 1_000
const refill = elapsedSeconds * this.refillPerSecond
this.tokens = Math.min(this.capacity, this.tokens + refill)
this.lastRefillAt = now
}
}
For a dependency that handles important but non-critical background work, you might use:
const retryBucket = new TokenBucket({
capacity: 200,
refillPerSecond: 20,
})
The bucket allows a short burst of retries but prevents a permanent retry flood.
Retry At One Layer
Retry budgets work best when ownership is clear.
Do not let every layer retry the same failure.
Bad:
browser retries
-> api gateway retries
-> checkout retries
-> payment client retries
-> database driver retries
Better:
checkout owns retries for payment
payment client exposes attempt metadata
lower layers return clear retryable/non-retryable errors
Google's SRE overload chapter makes the same point: if multiple layers retry, a failed request can produce a combinatorial explosion. It recommends retrying at the layer immediately above the overloaded dependency rather than letting every layer multiply attempts.
Write the ownership rule down:
| Dependency | Retry owner | Other layers |
|---|---|---|
| Payment provider | checkout-api | Do not retry payment writes elsewhere |
| Inventory service | checkout-api | Gateway does not retry 409/503 from checkout |
| Search indexer | background worker | API request writes outbox only |
This is especially important for side effects. If the operation can create orders, payments, subscriptions, emails, or jobs, retry safety depends on idempotency. For the synchronous API boundary, see API Idempotency Keys: Prevent Duplicate Requests Safely.
Decide What Is Retryable
Retry budgets are not a license to retry everything.
Classify failures before spending the budget:
| Failure | Retry? | Why |
|---|---|---|
| Connection reset before response | Usually | Outcome may be transient or ambiguous |
429 Too Many Requests with Retry-After | Usually, after delay | Server is explicitly asking for slower traffic |
503 Service Unavailable from dependency | Sometimes | Retry only if budget and overload policy allow it |
400 Bad Request | No | Same request is still invalid |
401 Unauthorized | No, except token refresh flow | Retrying same credentials will not help |
409 Conflict | Depends | Could be a business conflict, idempotency mismatch, or retryable concurrency issue |
| Timeout after side-effecting request | Only with idempotency | The server may already have committed |
When a dependency is overloaded, a fast controlled failure can be more reliable than another retry. Backpressure and rate limiting are the related controls for deciding how much work enters the system at all; see Rate Limiting and Backpressure in Microservices.
Add Jitter Or The Budget Still Clumps
Backoff without jitter can synchronize callers.
If 10,000 clients fail at the same time and all retry after exactly 1 second, they create another spike at exactly 1 second.
Use jittered backoff:
function jitteredBackoff(attempt: number, policy: RetryPolicy) {
const exponential = Math.min(policy.maxDelayMs, policy.baseDelayMs * 2 ** (attempt - 1))
return Math.floor(Math.random() * exponential)
}
The AWS Builders' Library article explains why jitter matters: when retries align, they can recreate contention or overload at the same time rather than smoothing it out.
Jitter is not a replacement for a retry budget. It spreads retry traffic. The budget caps retry traffic.
You usually need both.
Make Retry Traffic Visible
A retry budget that nobody observes is only half a control.
Track these metrics:
| Metric | Why it matters |
|---|---|
| Initial request rate | Baseline demand |
| Retry request rate | Extra load created by clients |
| Retry ratio by dependency | Whether retries are consuming too much capacity |
| Budget exhaustion count | Whether clients are being forced to stop retrying |
| Attempts per request | Whether individual requests are looping |
| Success after retry | Whether retries are actually helping |
| Failure after all attempts | Whether retrying is just delaying errors |
| Dependency latency during retries | Whether retrying correlates with overload |
The most useful chart is often:
dependency request rate = initial attempts + retry attempts
retry ratio = retry attempts / initial attempts
success after retry = requests that succeeded only after retry
If retry traffic rises but success-after-retry does not, the retry policy is spending capacity without improving availability.
That is the moment to lower attempts, reduce the retry ratio, increase backoff, or stop retrying that failure mode.
Practical Checklist
Before enabling retries against a dependency, check:
- Does one layer clearly own the retry?
- Is there a per-request attempt limit?
- Is there a fleet/client retry ratio or token bucket?
- Are side-effecting operations protected by idempotency?
- Are retryable and non-retryable failures separated?
- Does overload produce a response that clients know not to hammer?
- Does backoff include jitter?
- Are retries counted separately from initial requests?
- Can you see budget exhaustion?
- Can you show that retries improve success rate during normal transient failures?
- Can you show that retries stop during overload?
If you cannot answer those questions, retries may still help in happy-path testing, but they are not production-safe yet.
Final Takeaway
Retries are not free. They spend downstream capacity.
A retry budget makes that spending explicit.
Use retries for transient failures, cap them per request, limit retry traffic across the client, add jitter, avoid retrying at multiple layers, and measure whether retries are actually improving outcomes.
When the dependency is overloaded, the most reliable retry may be the one you do not send.