
Rate Limiting and Backpressure in Microservices
Rate limiting and backpressure in microservices are not just ways to reject traffic. They are ways to keep a system inside a range where it can still do useful work.
The common failure is admitting more work than the service, its dependencies, or its workers can finish. Timeouts then fire, clients retry, queues grow, and the system spends its remaining capacity on work that may no longer matter. At that point the problem is no longer one slow endpoint. It is uncontrolled admission.
For the broader reliability cluster around timeouts, retries, circuit breakers, queues, and recovery patterns, see the Backend Reliability hub.
The Failure Rate Limiting Is Meant To Prevent
Imagine a checkout API that usually receives 250 requests per second.
During a promotion, traffic jumps to 1,200 requests per second. The checkout API scales horizontally, so the load balancer keeps accepting requests. Each checkout request calls inventory, pricing, payment risk, and email scheduling. The inventory service is the tightest dependency. It can complete about 500 useful requests per second before latency climbs sharply.
Without admission control, the first symptom may not be an error. It may be rising latency.
| Minute | What happens | Why it matters |
|---|---|---|
| 0 | Promotion starts; ingress accepts the burst | No boundary asks whether downstream capacity exists |
| 2 | Inventory p95 latency rises from 80 ms to 900 ms | Requests are queueing inside the dependency |
| 4 | Checkout timeouts start firing | Callers give up, but inventory keeps working |
| 5 | Clients and upstream services retry failed checkout attempts | New work competes with already queued work |
| 7 | Worker queues grow because completed checkouts schedule emails | Async paths absorb the overload after the request |
| 9 | Circuit breakers open on some callers but not others | Traffic shifts and recovery becomes uneven |
| 12 | Goodput falls even though raw request volume remains very high | The system is busy but less useful |
This is the overload pattern that When Timeouts Didn't Prevent Cascading Failures is about. Timeouts limit how long callers wait. They do not decide how much work enters the system.
AWS's Builders Library article on using load shedding to avoid overload makes the important distinction between throughput and goodput: a service can process or attempt a high volume of work while the amount of useful, timely work falls.
Rate limiting and backpressure exist to preserve goodput.
Rate Limiting, Backpressure, And Load Shedding Are Different Controls
These terms often get mixed together because they all reduce traffic. The differences matter when you are deciding where to put the control.
| Control | Question it answers | Typical place |
|---|---|---|
| Rate limiting | Is this caller, tenant, route, or dependency over budget? | API gateway, service boundary, client library |
| Backpressure | Can the downstream path accept more work right now? | Service-to-service calls, queues, streams, workers |
| Load shedding | Which work should be rejected or degraded under overload? | Server, gateway, worker, or dependency boundary |
Rate limiting is usually policy based. A tenant may be allowed 100 write requests per second. An expensive endpoint may be allowed fewer requests than a cheap read endpoint. A dependency may receive only a fixed amount of traffic from each caller.
Backpressure is capacity feedback. A worker pool is full. A queue is aging. A dependency has high latency. A stream consumer cannot drain fast enough. The upstream component must slow down, reject, or degrade before the downstream path becomes unstable.
Load shedding is the survival decision. When the system is beyond safe capacity, it chooses not to do some work so it can keep doing more important work.
In a healthy design, these controls cooperate:
- ingress rate limits protect shared service capacity and fairness
- per-dependency limits prevent one slow downstream from consuming every worker
- bounded queues prevent memory from becoming the overload buffer
- load shedding keeps the service responsive enough to recover
- retry budgets prevent rejected work from returning immediately as more traffic
That last point connects directly to Retry Budgets in Microservices. If overload responses cause immediate retries, admission control becomes a traffic amplifier with nicer status codes.
Edge Rate Limits Are Not Enough
An API gateway limit is useful, but it is not a complete overload strategy.
The gateway usually knows the client, route, and request rate. It may not know that one route became expensive because the cache is cold, one database query changed plan, one dependency is slow, or one tenant is hitting a rare path that creates five downstream calls per request.
For example:
| Boundary | What it can see | What it may miss |
|---|---|---|
| CDN or WAF | IP, path, coarse request rate | authenticated tenant, downstream cost |
| API gateway | API key, route, method, request count | database pressure, worker saturation, dependency lag |
| Service instance | local CPU, in-flight work, dependency latency | global traffic distribution across all instances |
| Dependency client | error rate, latency, timeout rate for one path | tenant fairness or business priority |
| Queue producer/consumer | backlog, job age, worker drain rate | synchronous request pressure that created the backlog |
A practical design uses layered admission:
- A coarse edge limit blocks abusive or accidental bursts.
- A per-tenant or per-account limit preserves fairness.
- A route or operation budget accounts for expensive paths.
- A local concurrency limit protects each service instance.
- A dependency-specific limit protects scarce downstream capacity.
- A queue policy decides whether to enqueue, defer, or reject async work.
The mistake is assuming one global requests-per-second number can represent capacity. Microservices rarely have one capacity. They have capacity per route, per dependency, per worker pool, per tenant, and per failure mode.
A Concrete Admission Model
A useful admission decision should answer three questions:
- Is the caller allowed to send this much work?
- Is this service instance healthy enough to accept the work?
- Is the downstream path needed by this request healthy enough to accept more work?
Here is a simplified TypeScript-style example. It combines a tenant token bucket, a local in-flight request cap, and a dependency pressure check.
type AdmissionResult =
| { allowed: true }
| {
allowed: false
status: 429 | 503
reason: string
retryAfterSeconds: number
}
class TokenBucket {
private tokens: number
private lastRefillMs = Date.now()
constructor(
private readonly refillPerSecond: number,
private readonly burst: number
) {
this.tokens = burst
}
trySpend(cost = 1): boolean {
const now = Date.now()
const elapsedSeconds = (now - this.lastRefillMs) / 1000
this.tokens = Math.min(this.burst, this.tokens + elapsedSeconds * this.refillPerSecond)
this.lastRefillMs = now
if (this.tokens < cost) return false
this.tokens -= cost
return true
}
}
class InFlightLimiter {
private active = 0
constructor(private readonly maxActive: number) {}
tryEnter() {
if (this.active >= this.maxActive) return false
this.active++
return true
}
leave() {
this.active = Math.max(0, this.active - 1)
}
}
const tenantBuckets = new Map<string, TokenBucket>()
const localLimiter = new InFlightLimiter(300)
function admitCheckoutRequest(req: CheckoutRequest): AdmissionResult {
const tenantBucket = getTenantBucket(req.tenantId, tenantBuckets)
const requestCost = req.items.length > 20 ? 5 : 1
if (!tenantBucket.trySpend(requestCost)) {
return {
allowed: false,
status: 429,
reason: 'tenant_rate_limit',
retryAfterSeconds: 2,
}
}
if (!localLimiter.tryEnter()) {
return {
allowed: false,
status: 503,
reason: 'checkout_inflight_limit',
retryAfterSeconds: 1,
}
}
if (inventoryPressure().p95LatencyMs > 800 || inventoryPressure().queueDepth > 2000) {
localLimiter.leave()
return {
allowed: false,
status: 503,
reason: 'inventory_backpressure',
retryAfterSeconds: 5,
}
}
return { allowed: true }
}
This is not a complete production limiter. It is the shape of the decision.
The important part is that 429 and 503 mean different things:
429 Too Many Requestsmeans the caller exceeded a policy budget.503 Service Unavailablemeans the service is temporarily unable to handle the request, often because of overload or maintenance.
RFC 6585 defines 429 as a status for too many requests in a given amount of time and says the response can include Retry-After. RFC 9110 defines 503 as temporary inability to handle a request due to overload or maintenance, and its Retry-After section explains how a server can suggest a delay before the client retries.
Do not use the status code as the whole policy. The response should also expose enough information for clients and operators to behave correctly:
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Retry-After: 5
{
"error": "service_overloaded",
"message": "Checkout is temporarily overloaded. Retry after the advertised delay.",
"retryable": true
}
For public APIs, be careful with how much internal detail the response reveals. For internal service calls, include a machine-readable reason so clients can distinguish tenant limits, dependency backpressure, and system overload.
Backpressure Should Protect Specific Resources
Backpressure becomes weak when it is only a global flag named busy.
A service can be healthy overall while one dependency is saturated. It can have spare CPU while its database connection pool is exhausted. It can process cheap reads while expensive writes are unsafe. Good backpressure protects the specific resource that is under pressure.
Examples:
| Resource under pressure | Backpressure signal | Safer action |
|---|---|---|
| Database pool | connections in use, wait time | reject expensive writes, preserve reads if possible |
| Dependency client | p95 latency, timeout rate, open pool | cap calls to that dependency |
| Worker pool | active workers, queue age | stop accepting new background work |
| Memory | heap pressure, queue size | reject or shed low-priority requests early |
| Tenant budget | tokens exhausted | return 429 for that tenant, not for everyone |
| Expensive endpoint | route-specific saturation | reduce only that route's admission |
This is why per-dependency concurrency limits are often more useful than one service-wide request limit.
const inventoryLimiter = new InFlightLimiter(80)
async function callInventoryWithBackpressure(req: InventoryRequest) {
if (!inventoryLimiter.tryEnter()) {
throw new DependencyBackpressureError('inventory dependency saturated')
}
try {
return await inventoryClient.reserve(req, { timeoutMs: 700 })
} finally {
inventoryLimiter.leave()
}
}
The caller can map DependencyBackpressureError to a controlled 503, a degraded response, or an async fallback depending on the product path.
This is different from a Circuit Breaker Pattern in Microservices. A circuit breaker asks whether a dependency should be considered unhealthy enough to stop calling for a while. A concurrency limit asks how many calls may be in progress right now. They often belong together, but they are not the same control.
Queues Need Backpressure Too
Queues are useful because they decouple request latency from background work. They are dangerous when they become an infinite buffer for overload.
If producers can enqueue faster than consumers can drain, the queue becomes a delayed outage. The request path looks healthy for a while, but job age grows, retries accumulate, and users eventually notice stale emails, delayed webhooks, missed exports, or expired work.
A queue producer should usually ask:
- How deep is the queue?
- How old is the oldest runnable job?
- Is the worker error rate rising?
- Is the job still valuable if it starts later?
- Is this job critical enough to displace lower-priority work?
For low-priority work, rejecting or degrading early can be better than accepting a job that will run too late to matter.
async function enqueueReceiptEmail(order: Order) {
const health = await emailQueueHealth()
if (health.oldestRunnableAgeSeconds > 300) {
return {
accepted: false,
reason: 'email_queue_backpressure',
userMessage: 'Receipt email may be delayed. Order confirmation is available in your account.',
}
}
if (health.depth > 50_000 && order.priority !== 'critical') {
return {
accepted: false,
reason: 'non_critical_email_shed',
userMessage: 'Order completed. Non-critical email was skipped during high load.',
}
}
await emailQueue.enqueue({
type: 'send_receipt_email',
orderId: order.id,
priority: order.priority,
})
return { accepted: true }
}
That example is intentionally product-specific. Backpressure is not only an infrastructure setting. It is a business decision about which work stays valuable under delay.
For more on replay-safe workers, retries, dead-letter handling, and queue health, see Background Jobs in Production.
Decide What Gets Protected First
A rate limit without priority is a blunt instrument.
During overload, some work is more important than other work:
| Work type | Typical priority | Possible overload behavior |
|---|---|---|
| Health checks | High | keep cheap and isolated from user traffic |
| User login | High | preserve, but cap expensive fraud checks |
| Checkout payment confirmation | High | preserve with strict dependency limits |
| Search autocomplete | Medium | degrade result quality or use cached results |
| Analytics event ingestion | Low to medium | sample, batch, or drop when queues are saturated |
| Recommendation refresh | Low | defer or skip |
| Bulk export | Low | queue behind interactive work or return try later |
The worst overload policy treats all work equally. It lets cheap, low-value, or retry-generated requests compete with critical user actions.
Good policies usually include:
- tenant or account isolation so one customer cannot consume all shared capacity
- endpoint costs so expensive requests spend more budget than cheap requests
- separate pools for critical and non-critical work
- dependency-specific limits so one downstream failure does not stall the whole service
- explicit shedding order so the team knows what degrades first
- client retry guidance so rejected work does not return immediately
Google's SRE chapter on handling overload discusses client-side throttling and adaptive throttling as ways to stop clients from continuing to send traffic after the backend is already rejecting work. That matters because server-side limits are only half of the system. Clients must cooperate too.
Common Mistakes
Limits Based On Average Traffic
Average traffic is not capacity.
If a service handles 200 requests per second on an ordinary day, that does not mean a 400 requests-per-second limit is safe. The expensive path may saturate at 120 requests per second when a cache is cold or a dependency is slow.
Set limits from saturation tests and production signals, not only historical averages.
One Global Limit For Every Route
One request to GET /products/:id may be cheap. One request to POST /checkout may reserve inventory, call a payment service, write several rows, and enqueue jobs.
Treating both as one token makes the policy easy to configure and hard to trust.
Unbounded Queues
An unbounded queue is not backpressure. It is a promise to fail later.
Bound queue depth, bound job age, and decide what happens when the queue cannot accept more useful work.
Per-Instance Limits That Ignore Fleet Size
A local in-flight limit protects each instance, but the fleet-wide limit changes when autoscaling adds or removes instances.
That can be fine if the limit is intentionally local, but be explicit. A fleet of 20 instances with a local limit of 300 admits up to 6,000 in-flight requests. A fleet of 5 admits 1,500. If a dependency can handle only 2,000 concurrent calls from the whole service, local limits alone are not enough.
Limits That Fail Open Silently
If the central rate-limit store is unavailable, should the service allow everything or reject everything?
There is no universal answer. A public login endpoint, a billing write, and an internal analytics endpoint may need different failure modes. What is dangerous is not deciding, then discovering during an incident that the limiter failed open on the path that most needed protection.
No Metrics On Rejected Work
Fast rejection can make latency dashboards look better while users are receiving 429 or 503.
Keep admitted traffic, rejected traffic, and successful useful work separate. Otherwise load shedding can hide the incident it is trying to survive.
Metrics That Tell You Whether Backpressure Is Working
Do not monitor only request count and error rate.
Track the shape of admission:
| Metric | What it tells you |
|---|---|
| admitted requests by route and tenant | what work entered the system |
| rejected requests by reason | which policy is protecting the system |
429 rate and 503 rate separately | policy limit vs temporary service overload |
| in-flight requests by service and route | whether concurrency is approaching saturation |
| dependency latency before rejection starts | whether backpressure activates early enough |
| queue depth and oldest runnable job age | whether async paths are draining or hiding overload |
| goodput | useful successful work, not just attempted work |
| retry rate after overload responses | whether clients are respecting the signal |
| limiter false-positive review | whether limits are rejecting work while capacity exists |
A healthy overload-control system does not make rejection disappear. It makes rejection intentional, bounded, visible, and recoverable.
The most important chart often shows offered load, admitted load, rejected load, and goodput on the same timeline. If offered load rises and admitted load stays inside capacity while goodput remains stable, the limit is doing its job. If rejected load rises while goodput falls, the system is probably shedding too late, protecting the wrong resource, or being hammered by retries.
Rollout Checklist
Start with observation before enforcement:
- Identify the resource that actually saturates: CPU, database pool, dependency, worker pool, queue, memory, or tenant budget.
- Add shadow-mode decisions that log what would have been rejected.
- Split metrics by route, tenant, dependency, and rejection reason.
- Add soft limits on non-critical paths first.
- Return clear
429or503responses withRetry-Afterwhere clients can use it. - Add client retry guidance and enforce retry budgets for internal clients.
- Load test past the saturation point and confirm goodput plateaus instead of collapsing.
- Decide which limits fail open, which fail closed, and which use a local fallback.
- Document what gets shed first during overload.
- Review the policy after real traffic, because the first limit is rarely the final limit.
Do not ship rate limiting as a lonely middleware change. Ship it as a policy, a client contract, and an operational signal.
The Short Version
Rate limiting controls budgets.
Backpressure communicates capacity.
Load shedding preserves useful work when the system is beyond safe capacity.
Microservices need all three because overload is rarely contained to one process. It moves through clients, queues, dependency pools, retries, and background workers.
The goal is not to reject as much traffic as possible. The goal is to admit the work the system can finish, reject or defer the work that would make recovery harder, and make those decisions visible enough that operators and clients can cooperate under stress.