Retry Budgets in Microservices: Stop Retrying Into Outages

Retry Budgets in Microservices: Stop Retrying Into Outages

Retry budgets keep microservice retries from turning a partial outage into a larger outage. They answer a practical question every backend system eventually hits: how many extra attempts are allowed before retrying stops helping availability and starts consuming the capacity the dependency needs to recover?

Retries are useful when failures are brief, random, and isolated. They are dangerous when the dependency is overloaded. In that case, every retry competes with fresh traffic, increases queueing, burns connection pools, and can keep the overloaded service unhealthy after the original trigger is gone.

This article is part of the Backend Reliability hub. It is the companion to Adding Retries Can Make Outages Worse: that article explains retry amplification; this one shows how to put an explicit budget around retries so callers stop spending unlimited downstream capacity.


What A Retry Budget Is

A retry budget is a limit on retry traffic.

The limit can exist at several levels:

BudgetWhat it limitsExample
Per-request attempt budgetAttempts for one logical requestTry at most 3 times total
Per-client retry ratioRetry traffic as a fraction of original trafficRetries may be at most 10% of initial requests
Token bucketLocal retry permission under pressureSpend one token per retry; refill slowly
Dependency health gateWhether retries are allowed right nowRetry only while the dependency looks healthy

A per-request limit prevents infinite loops. A per-client ratio prevents a whole fleet of callers from doubling traffic during a bad minute. A token bucket lets a service absorb small bursts without allowing unbounded retry storms. A dependency health gate stops clients from retrying into a service that is already telling you it cannot keep up.

Google's SRE book describes both per-request retry budgets and per-client retry budgets in its overload handling chapter. It gives one concrete client-side ratio example: limiting retries to about 10% can reduce retry growth from a much larger multiplier to a much smaller one in the general case. See Google SRE: Handling Overload.

The exact number is less important than the rule: retries need an upper bound.


The Retry Storm A Budget Prevents

Imagine a checkout service calling an inventory service.

Normal traffic:

SourceRate
Initial checkout requests1,000 requests/sec
Inventory calls per checkout1
Retry traffic0
Total inventory traffic1,000 requests/sec

Now inventory becomes slow and starts timing out 10% of calls.

If every caller retries twice immediately:

AttemptAdditional traffic
Initial requests1,000/sec
First retry for 10% failures100/sec
Second retry if those failup to 100/sec
Total possible traffic1,200/sec

That looks manageable in one layer.

Now imagine the request crosses five services and three of them retry independently. The multipliers compound. AWS's Builders' Library warns about this exact pattern: if a five-deep stack retries three times at each layer, load at the deepest dependency can increase dramatically. See Timeouts, retries, and backoff with jitter.

The problem is not that one retry is bad. The problem is that every layer thinks its retry is local.

The dependency receives the sum.


Use A Per-Request Budget First

The first budget is the simplest:

One logical request gets a fixed maximum number of attempts.

For example:

type RetryPolicy = {
  maxAttempts: number
  baseDelayMs: number
  maxDelayMs: number
}

const inventoryRetryPolicy: RetryPolicy = {
  maxAttempts: 3,
  baseDelayMs: 100,
  maxDelayMs: 1_000,
}

Then every call tracks the attempt count:

async function callInventory(request: ReserveInventoryRequest) {
  for (let attempt = 1; attempt <= inventoryRetryPolicy.maxAttempts; attempt++) {
    const response = await inventory.reserve(request, {
      headers: {
        'X-Retry-Attempt': String(attempt - 1),
      },
    })

    if (response.ok) {
      return response
    }

    if (!isRetryable(response) || attempt === inventoryRetryPolicy.maxAttempts) {
      throw new InventoryUnavailable(response.status)
    }

    await sleep(jitteredBackoff(attempt, inventoryRetryPolicy))
  }
}

This protects one request from retrying forever.

It does not protect the dependency from a fleet-wide retry storm.

That is why per-request attempt limits are necessary but not enough.


Add A Client-Side Retry Ratio

A per-client retry ratio asks:

How much of this client's recent traffic is retry traffic?

If retries exceed the allowed ratio, the client stops retrying and returns the failure.

Example policy:

MetricValue
Window60 seconds
Max retry ratio10%
Counted trafficInitial attempts and retry attempts
Behavior when exhaustedDo not retry; return controlled failure

Suppose checkout sent 10,000 initial inventory calls in the last minute. With a 10% retry ratio, it can spend about 1,000 retry attempts in that same window.

That budget lets isolated failures recover while preventing retry traffic from becoming the dominant workload.

A simple in-process sketch:

class RetryBudget {
  constructor(
    private readonly maxRetryRatio: number,
    private readonly window: RollingWindow
  ) {}

  recordInitialRequest() {
    this.window.increment('initial')
  }

  trySpendRetry() {
    const initial = this.window.count('initial')
    const retries = this.window.count('retry')

    if (initial === 0) {
      return false
    }

    if (retries / initial >= this.maxRetryRatio) {
      return false
    }

    this.window.increment('retry')
    return true
  }
}

Then the caller checks the budget before sleeping and retrying:

if (!retryBudget.trySpendRetry()) {
  throw new RetryBudgetExhausted('inventory retry budget exhausted')
}

The important part is not this exact implementation. The important part is that retry permission becomes finite and measurable.


Token Buckets Work Well For Local Limits

A token bucket is another practical retry budget.

It works like this:

  1. The bucket starts with some retry tokens.
  2. Each retry spends one token.
  3. Tokens refill at a fixed rate.
  4. If the bucket is empty, the caller does not retry.

AWS's Builders' Library describes limiting retries locally with a token bucket so that retries continue while tokens are available and then settle to a fixed rate once tokens are exhausted.

A simplified token bucket:

class TokenBucket {
  private tokens: number
  private lastRefillAt = Date.now()

  constructor(
    private readonly capacity: number,
    private readonly refillPerSecond: number
  ) {
    this.tokens = capacity
  }

  tryTake() {
    this.refill()

    if (this.tokens < 1) {
      return false
    }

    this.tokens -= 1
    return true
  }

  private refill() {
    const now = Date.now()
    const elapsedSeconds = (now - this.lastRefillAt) / 1_000
    const refill = elapsedSeconds * this.refillPerSecond

    this.tokens = Math.min(this.capacity, this.tokens + refill)
    this.lastRefillAt = now
  }
}

For a dependency that handles important but non-critical background work, you might use:

const retryBucket = new TokenBucket({
  capacity: 200,
  refillPerSecond: 20,
})

The bucket allows a short burst of retries but prevents a permanent retry flood.


Retry At One Layer

Retry budgets work best when ownership is clear.

Do not let every layer retry the same failure.

Bad:

browser retries
  -> api gateway retries
    -> checkout retries
      -> payment client retries
        -> database driver retries

Better:

checkout owns retries for payment
payment client exposes attempt metadata
lower layers return clear retryable/non-retryable errors

Google's SRE overload chapter makes the same point: if multiple layers retry, a failed request can produce a combinatorial explosion. It recommends retrying at the layer immediately above the overloaded dependency rather than letting every layer multiply attempts.

Write the ownership rule down:

DependencyRetry ownerOther layers
Payment providercheckout-apiDo not retry payment writes elsewhere
Inventory servicecheckout-apiGateway does not retry 409/503 from checkout
Search indexerbackground workerAPI request writes outbox only

This is especially important for side effects. If the operation can create orders, payments, subscriptions, emails, or jobs, retry safety depends on idempotency. For the synchronous API boundary, see API Idempotency Keys: Prevent Duplicate Requests Safely.


Decide What Is Retryable

Retry budgets are not a license to retry everything.

Classify failures before spending the budget:

FailureRetry?Why
Connection reset before responseUsuallyOutcome may be transient or ambiguous
429 Too Many Requests with Retry-AfterUsually, after delayServer is explicitly asking for slower traffic
503 Service Unavailable from dependencySometimesRetry only if budget and overload policy allow it
400 Bad RequestNoSame request is still invalid
401 UnauthorizedNo, except token refresh flowRetrying same credentials will not help
409 ConflictDependsCould be a business conflict, idempotency mismatch, or retryable concurrency issue
Timeout after side-effecting requestOnly with idempotencyThe server may already have committed

When a dependency is overloaded, a fast controlled failure can be more reliable than another retry. Backpressure and rate limiting are the related controls for deciding how much work enters the system at all; see Rate Limiting and Backpressure in Microservices.


Add Jitter Or The Budget Still Clumps

Backoff without jitter can synchronize callers.

If 10,000 clients fail at the same time and all retry after exactly 1 second, they create another spike at exactly 1 second.

Use jittered backoff:

function jitteredBackoff(attempt: number, policy: RetryPolicy) {
  const exponential = Math.min(policy.maxDelayMs, policy.baseDelayMs * 2 ** (attempt - 1))

  return Math.floor(Math.random() * exponential)
}

The AWS Builders' Library article explains why jitter matters: when retries align, they can recreate contention or overload at the same time rather than smoothing it out.

Jitter is not a replacement for a retry budget. It spreads retry traffic. The budget caps retry traffic.

You usually need both.


Make Retry Traffic Visible

A retry budget that nobody observes is only half a control.

Track these metrics:

MetricWhy it matters
Initial request rateBaseline demand
Retry request rateExtra load created by clients
Retry ratio by dependencyWhether retries are consuming too much capacity
Budget exhaustion countWhether clients are being forced to stop retrying
Attempts per requestWhether individual requests are looping
Success after retryWhether retries are actually helping
Failure after all attemptsWhether retrying is just delaying errors
Dependency latency during retriesWhether retrying correlates with overload

The most useful chart is often:

dependency request rate = initial attempts + retry attempts
retry ratio = retry attempts / initial attempts
success after retry = requests that succeeded only after retry

If retry traffic rises but success-after-retry does not, the retry policy is spending capacity without improving availability.

That is the moment to lower attempts, reduce the retry ratio, increase backoff, or stop retrying that failure mode.


Practical Checklist

Before enabling retries against a dependency, check:

  • Does one layer clearly own the retry?
  • Is there a per-request attempt limit?
  • Is there a fleet/client retry ratio or token bucket?
  • Are side-effecting operations protected by idempotency?
  • Are retryable and non-retryable failures separated?
  • Does overload produce a response that clients know not to hammer?
  • Does backoff include jitter?
  • Are retries counted separately from initial requests?
  • Can you see budget exhaustion?
  • Can you show that retries improve success rate during normal transient failures?
  • Can you show that retries stop during overload?

If you cannot answer those questions, retries may still help in happy-path testing, but they are not production-safe yet.


Final Takeaway

Retries are not free. They spend downstream capacity.

A retry budget makes that spending explicit.

Use retries for transient failures, cap them per request, limit retry traffic across the client, add jitter, avoid retrying at multiple layers, and measure whether retries are actually improving outcomes.

When the dependency is overloaded, the most reliable retry may be the one you do not send.