Rate Limiting and Backpressure in Microservices

Rate limiting and backpressure in microservices are not just ways to reject traffic. They are ways to keep a system inside a range where it can still do useful work.

The common failure is admitting more work than the service, its dependencies, or its workers can finish. Timeouts then fire, clients retry, queues grow, and the system spends its remaining capacity on work that may no longer matter. At that point the problem is no longer one slow endpoint. It is uncontrolled admission.

For the broader reliability cluster around timeouts, retries, circuit breakers, queues, and recovery patterns, see the Backend Reliability hub.

The Failure Rate Limiting Is Meant To Prevent

Imagine a checkout API that usually receives 250 requests per second.

During a promotion, traffic jumps to 1,200 requests per second. The checkout API scales horizontally, so the load balancer keeps accepting requests. Each checkout request calls inventory, pricing, payment risk, and email scheduling. The inventory service is the tightest dependency. It can complete about 500 useful requests per second before latency climbs sharply.

Without admission control, the first symptom may not be an error. It may be rising latency.

Minute	What happens	Why it matters
0	Promotion starts; ingress accepts the burst	No boundary asks whether downstream capacity exists
2	Inventory p95 latency rises from 80 ms to 900 ms	Requests are queueing inside the dependency
4	Checkout timeouts start firing	Callers give up, but inventory keeps working
5	Clients and upstream services retry failed checkout attempts	New work competes with already queued work
7	Worker queues grow because completed checkouts schedule emails	Async paths absorb the overload after the request
9	Circuit breakers open on some callers but not others	Traffic shifts and recovery becomes uneven
12	Goodput falls even though raw request volume remains very high	The system is busy but less useful

This is the overload pattern that When Timeouts Didn't Prevent Cascading Failures is about. Timeouts limit how long callers wait. They do not decide how much work enters the system.

AWS's Builders Library article on using load shedding to avoid overload makes the important distinction between throughput and goodput: a service can process or attempt a high volume of work while the amount of useful, timely work falls.

Rate limiting and backpressure exist to preserve goodput.

Rate Limiting, Backpressure, And Load Shedding Are Different Controls

These terms often get mixed together because they all reduce traffic. The differences matter when you are deciding where to put the control.

Control	Question it answers	Typical place
Rate limiting	Is this caller, tenant, route, or dependency over budget?	API gateway, service boundary, client library
Backpressure	Can the downstream path accept more work right now?	Service-to-service calls, queues, streams, workers
Load shedding	Which work should be rejected or degraded under overload?	Server, gateway, worker, or dependency boundary

Rate limiting is usually policy based. A tenant may be allowed 100 write requests per second. An expensive endpoint may be allowed fewer requests than a cheap read endpoint. A dependency may receive only a fixed amount of traffic from each caller.

Backpressure is capacity feedback. A worker pool is full. A queue is aging. A dependency has high latency. A stream consumer cannot drain fast enough. The upstream component must slow down, reject, or degrade before the downstream path becomes unstable.

Load shedding is the survival decision. When the system is beyond safe capacity, it chooses not to do some work so it can keep doing more important work.

In a healthy design, these controls cooperate:

ingress rate limits protect shared service capacity and fairness
per-dependency limits prevent one slow downstream from consuming every worker
bounded queues prevent memory from becoming the overload buffer
load shedding keeps the service responsive enough to recover
retry budgets prevent rejected work from returning immediately as more traffic

That last point connects directly to Retry Budgets in Microservices. If overload responses cause immediate retries, admission control becomes a traffic amplifier with nicer status codes.

Edge Rate Limits Are Not Enough

An API gateway limit is useful, but it is not a complete overload strategy.

The gateway usually knows the client, route, and request rate. It may not know that one route became expensive because the cache is cold, one database query changed plan, one dependency is slow, or one tenant is hitting a rare path that creates five downstream calls per request.

For example:

Boundary	What it can see	What it may miss
CDN or WAF	IP, path, coarse request rate	authenticated tenant, downstream cost
API gateway	API key, route, method, request count	database pressure, worker saturation, dependency lag
Service instance	local CPU, in-flight work, dependency latency	global traffic distribution across all instances
Dependency client	error rate, latency, timeout rate for one path	tenant fairness or business priority
Queue producer/consumer	backlog, job age, worker drain rate	synchronous request pressure that created the backlog

A practical design uses layered admission:

A coarse edge limit blocks abusive or accidental bursts.
A per-tenant or per-account limit preserves fairness.
A route or operation budget accounts for expensive paths.
A local concurrency limit protects each service instance.
A dependency-specific limit protects scarce downstream capacity.
A queue policy decides whether to enqueue, defer, or reject async work.

The mistake is assuming one global requests-per-second number can represent capacity. Microservices rarely have one capacity. They have capacity per route, per dependency, per worker pool, per tenant, and per failure mode.

A Concrete Admission Model

A useful admission decision should answer three questions:

Is the caller allowed to send this much work?
Is this service instance healthy enough to accept the work?
Is the downstream path needed by this request healthy enough to accept more work?

Here is a simplified TypeScript-style example. It combines a tenant token bucket, a local in-flight request cap, and a dependency pressure check.

type AdmissionResult =
  | { allowed: true }
  | {
      allowed: false
      status: 429 | 503
      reason: string
      retryAfterSeconds: number
    }

class TokenBucket {
  private tokens: number
  private lastRefillMs = Date.now()

  constructor(
    private readonly refillPerSecond: number,
    private readonly burst: number
  ) {
    this.tokens = burst
  }

  trySpend(cost = 1): boolean {
    const now = Date.now()
    const elapsedSeconds = (now - this.lastRefillMs) / 1000

    this.tokens = Math.min(this.burst, this.tokens + elapsedSeconds * this.refillPerSecond)
    this.lastRefillMs = now

    if (this.tokens < cost) return false

    this.tokens -= cost
    return true
  }
}

class InFlightLimiter {
  private active = 0

  constructor(private readonly maxActive: number) {}

  tryEnter() {
    if (this.active >= this.maxActive) return false
    this.active++
    return true
  }

  leave() {
    this.active = Math.max(0, this.active - 1)
  }
}

const tenantBuckets = new Map<string, TokenBucket>()
const localLimiter = new InFlightLimiter(300)

function admitCheckoutRequest(req: CheckoutRequest): AdmissionResult {
  const tenantBucket = getTenantBucket(req.tenantId, tenantBuckets)
  const requestCost = req.items.length > 20 ? 5 : 1

  if (!tenantBucket.trySpend(requestCost)) {
    return {
      allowed: false,
      status: 429,
      reason: 'tenant_rate_limit',
      retryAfterSeconds: 2,
    }
  }

  if (!localLimiter.tryEnter()) {
    return {
      allowed: false,
      status: 503,
      reason: 'checkout_inflight_limit',
      retryAfterSeconds: 1,
    }
  }

  if (inventoryPressure().p95LatencyMs > 800 || inventoryPressure().queueDepth > 2000) {
    localLimiter.leave()
    return {
      allowed: false,
      status: 503,
      reason: 'inventory_backpressure',
      retryAfterSeconds: 5,
    }
  }

  return { allowed: true }
}

This is not a complete production limiter. It is the shape of the decision.

The important part is that 429 and 503 mean different things:

429 Too Many Requests means the caller exceeded a policy budget.
503 Service Unavailable means the service is temporarily unable to handle the request, often because of overload or maintenance.

RFC 6585 defines 429 as a status for too many requests in a given amount of time and says the response can include Retry-After. RFC 9110 defines 503 as temporary inability to handle a request due to overload or maintenance, and its Retry-After section explains how a server can suggest a delay before the client retries.

Do not use the status code as the whole policy. The response should also expose enough information for clients and operators to behave correctly:

HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Retry-After: 5

{
  "error": "service_overloaded",
  "message": "Checkout is temporarily overloaded. Retry after the advertised delay.",
  "retryable": true
}

For public APIs, be careful with how much internal detail the response reveals. For internal service calls, include a machine-readable reason so clients can distinguish tenant limits, dependency backpressure, and system overload.

Backpressure Should Protect Specific Resources

Backpressure becomes weak when it is only a global flag named busy.

A service can be healthy overall while one dependency is saturated. It can have spare CPU while its database connection pool is exhausted. It can process cheap reads while expensive writes are unsafe. Good backpressure protects the specific resource that is under pressure.

Examples:

Resource under pressure	Backpressure signal	Safer action
Database pool	connections in use, wait time	reject expensive writes, preserve reads if possible
Dependency client	p95 latency, timeout rate, open pool	cap calls to that dependency
Worker pool	active workers, queue age	stop accepting new background work
Memory	heap pressure, queue size	reject or shed low-priority requests early
Tenant budget	tokens exhausted	return `429` for that tenant, not for everyone
Expensive endpoint	route-specific saturation	reduce only that route's admission

This is why per-dependency concurrency limits are often more useful than one service-wide request limit.

const inventoryLimiter = new InFlightLimiter(80)

async function callInventoryWithBackpressure(req: InventoryRequest) {
  if (!inventoryLimiter.tryEnter()) {
    throw new DependencyBackpressureError('inventory dependency saturated')
  }

  try {
    return await inventoryClient.reserve(req, { timeoutMs: 700 })
  } finally {
    inventoryLimiter.leave()
  }
}

The caller can map DependencyBackpressureError to a controlled 503, a degraded response, or an async fallback depending on the product path.

This is different from a Circuit Breaker Pattern in Microservices. A circuit breaker asks whether a dependency should be considered unhealthy enough to stop calling for a while. A concurrency limit asks how many calls may be in progress right now. They often belong together, but they are not the same control.

Queues Need Backpressure Too

Queues are useful because they decouple request latency from background work. They are dangerous when they become an infinite buffer for overload.

If producers can enqueue faster than consumers can drain, the queue becomes a delayed outage. The request path looks healthy for a while, but job age grows, retries accumulate, and users eventually notice stale emails, delayed webhooks, missed exports, or expired work.

A queue producer should usually ask:

How deep is the queue?
How old is the oldest runnable job?
Is the worker error rate rising?
Is the job still valuable if it starts later?
Is this job critical enough to displace lower-priority work?

For low-priority work, rejecting or degrading early can be better than accepting a job that will run too late to matter.

async function enqueueReceiptEmail(order: Order) {
  const health = await emailQueueHealth()

  if (health.oldestRunnableAgeSeconds > 300) {
    return {
      accepted: false,
      reason: 'email_queue_backpressure',
      userMessage: 'Receipt email may be delayed. Order confirmation is available in your account.',
    }
  }

  if (health.depth > 50_000 && order.priority !== 'critical') {
    return {
      accepted: false,
      reason: 'non_critical_email_shed',
      userMessage: 'Order completed. Non-critical email was skipped during high load.',
    }
  }

  await emailQueue.enqueue({
    type: 'send_receipt_email',
    orderId: order.id,
    priority: order.priority,
  })

  return { accepted: true }
}

That example is intentionally product-specific. Backpressure is not only an infrastructure setting. It is a business decision about which work stays valuable under delay.

For more on replay-safe workers, retries, dead-letter handling, and queue health, see Background Jobs in Production.

Decide What Gets Protected First

A rate limit without priority is a blunt instrument.

During overload, some work is more important than other work:

Work type	Typical priority	Possible overload behavior
Health checks	High	keep cheap and isolated from user traffic
User login	High	preserve, but cap expensive fraud checks
Checkout payment confirmation	High	preserve with strict dependency limits
Search autocomplete	Medium	degrade result quality or use cached results
Analytics event ingestion	Low to medium	sample, batch, or drop when queues are saturated
Recommendation refresh	Low	defer or skip
Bulk export	Low	queue behind interactive work or return try later

The worst overload policy treats all work equally. It lets cheap, low-value, or retry-generated requests compete with critical user actions.

Good policies usually include:

tenant or account isolation so one customer cannot consume all shared capacity
endpoint costs so expensive requests spend more budget than cheap requests
separate pools for critical and non-critical work
dependency-specific limits so one downstream failure does not stall the whole service
explicit shedding order so the team knows what degrades first
client retry guidance so rejected work does not return immediately

Google's SRE chapter on handling overload discusses client-side throttling and adaptive throttling as ways to stop clients from continuing to send traffic after the backend is already rejecting work. That matters because server-side limits are only half of the system. Clients must cooperate too.

Common Mistakes

Limits Based On Average Traffic

Average traffic is not capacity.

If a service handles 200 requests per second on an ordinary day, that does not mean a 400 requests-per-second limit is safe. The expensive path may saturate at 120 requests per second when a cache is cold or a dependency is slow.

Set limits from saturation tests and production signals, not only historical averages.

One Global Limit For Every Route

One request to GET /products/:id may be cheap. One request to POST /checkout may reserve inventory, call a payment service, write several rows, and enqueue jobs.

Treating both as one token makes the policy easy to configure and hard to trust.

Unbounded Queues

An unbounded queue is not backpressure. It is a promise to fail later.

Bound queue depth, bound job age, and decide what happens when the queue cannot accept more useful work.

Per-Instance Limits That Ignore Fleet Size

A local in-flight limit protects each instance, but the fleet-wide limit changes when autoscaling adds or removes instances.

That can be fine if the limit is intentionally local, but be explicit. A fleet of 20 instances with a local limit of 300 admits up to 6,000 in-flight requests. A fleet of 5 admits 1,500. If a dependency can handle only 2,000 concurrent calls from the whole service, local limits alone are not enough.

Limits That Fail Open Silently

If the central rate-limit store is unavailable, should the service allow everything or reject everything?

There is no universal answer. A public login endpoint, a billing write, and an internal analytics endpoint may need different failure modes. What is dangerous is not deciding, then discovering during an incident that the limiter failed open on the path that most needed protection.

No Metrics On Rejected Work

Fast rejection can make latency dashboards look better while users are receiving 429 or 503.

Keep admitted traffic, rejected traffic, and successful useful work separate. Otherwise load shedding can hide the incident it is trying to survive.

Metrics That Tell You Whether Backpressure Is Working

Do not monitor only request count and error rate.

Track the shape of admission:

Metric	What it tells you
admitted requests by route and tenant	what work entered the system
rejected requests by reason	which policy is protecting the system
`429` rate and `503` rate separately	policy limit vs temporary service overload
in-flight requests by service and route	whether concurrency is approaching saturation
dependency latency before rejection starts	whether backpressure activates early enough
queue depth and oldest runnable job age	whether async paths are draining or hiding overload
goodput	useful successful work, not just attempted work
retry rate after overload responses	whether clients are respecting the signal
limiter false-positive review	whether limits are rejecting work while capacity exists

A healthy overload-control system does not make rejection disappear. It makes rejection intentional, bounded, visible, and recoverable.

The most important chart often shows offered load, admitted load, rejected load, and goodput on the same timeline. If offered load rises and admitted load stays inside capacity while goodput remains stable, the limit is doing its job. If rejected load rises while goodput falls, the system is probably shedding too late, protecting the wrong resource, or being hammered by retries.

Rollout Checklist

Start with observation before enforcement:

Identify the resource that actually saturates: CPU, database pool, dependency, worker pool, queue, memory, or tenant budget.
Add shadow-mode decisions that log what would have been rejected.
Split metrics by route, tenant, dependency, and rejection reason.
Add soft limits on non-critical paths first.
Return clear 429 or 503 responses with Retry-After where clients can use it.
Add client retry guidance and enforce retry budgets for internal clients.
Load test past the saturation point and confirm goodput plateaus instead of collapsing.
Decide which limits fail open, which fail closed, and which use a local fallback.
Document what gets shed first during overload.
Review the policy after real traffic, because the first limit is rarely the final limit.

Do not ship rate limiting as a lonely middleware change. Ship it as a policy, a client contract, and an operational signal.

The Short Version

Rate limiting controls budgets.

Backpressure communicates capacity.

Load shedding preserves useful work when the system is beyond safe capacity.

Microservices need all three because overload is rarely contained to one process. It moves through clients, queues, dependency pools, retries, and background workers.

The goal is not to reject as much traffic as possible. The goal is to admit the work the system can finish, reject or defer the work that would make recovery harder, and make those decisions visible enough that operators and clients can cooperate under stress.