Rate Limiting and Backpressure in Microservices: Prevent Cascading Failures Under Load

Rate Limiting and Backpressure in Microservices: Prevent Cascading Failures Under Load

Situation

A system can have timeouts, retries, and circuit breakers configured correctly and still fail under load. The missing control is often admission: how much work is allowed into the system in the first place.

This is where rate limiting and backpressure matter. They prevent overload from spreading across service boundaries and turning partial degradation into full outage.


Why Timeouts and Retries Are Not Enough

Timeouts limit how long callers wait. Retries increase the number of attempts. Circuit breakers stop calls to unhealthy dependencies.

All three are useful, but none directly answers the core question: How much concurrent work should enter the system right now?

If request admission is unconstrained, queues grow, workers saturate, and latency rises everywhere. At that point, protective mechanisms activate late because resources are already consumed.


What Rate Limiting Does in Microservices

Rate limiting sets explicit request budgets over time. Instead of accepting unlimited traffic bursts, each boundary enforces policy such as:

  • requests per second per API key or tenant
  • global budget per service instance
  • stricter limits on expensive endpoints

The goal is not to block users randomly. The goal is to keep the system inside safe operating limits where latency and error rates remain predictable.


What Backpressure Does (And Why It Is Different)

Backpressure signals upstream components to slow down when downstream capacity drops. It can be explicit (queue depth thresholds, concurrency caps) or protocol-level (stream flow control).

Rate limiting controls external and internal admission. Backpressure coordinates load between producers and consumers.

In practice, resilient systems need both:

  • rate limits at ingress and critical boundaries
  • backpressure inside async pipelines and service-to-service call paths

Practical Implementation Pattern

Below is a minimal Node.js-style example that combines token-bucket style admission and queue-based backpressure.

class TokenBucket {
  private tokens: number;
  private lastRefill = Date.now();

  constructor(
    private readonly ratePerSecond: number,
    private readonly burst: number
  ) {
    this.tokens = burst;
  }

  allow(): boolean {
    const now = Date.now();
    const elapsedSec = (now - this.lastRefill) / 1000;

    this.tokens = Math.min(this.burst, this.tokens + elapsedSec * this.ratePerSecond);
    this.lastRefill = now;

    if (this.tokens >= 1) {
      this.tokens -= 1;

      return true;
    }

    return false;
  }
}

const limiter = new TokenBucket(200, 400);
const MAX_IN_FLIGHT = 300;
let inFlight = 0;

export async function handleRequest(req, res) {
  if (!limiter.allow()) {
    res.statusCode = 429;
    res.setHeader('Retry-After', '1');
    res.end('Rate limit exceeded');

    return;
  }

  if (inFlight >= MAX_IN_FLIGHT) {
    // Backpressure: reject early before worker saturation cascades.
    res.statusCode = 503;
    res.setHeader('Retry-After', '1');
    res.end('Server busy');

    return;
  }

  inFlight++;

  try {
    const result = await callDependencyWithTimeout(req);

    res.statusCode = 200;
    res.end(result);
  } finally {
    inFlight--;
  }
}

This pattern does not maximize immediate throughput. It maximizes stability under stress.


Common Mistakes That Break Overload Protection

1) Limits That Are Too High to Matter

A limit above real capacity is functionally no limit. Tune from observed saturation points, not guesses.

2) No Per-Tenant Isolation

Without tenant-level budgets, one noisy client can consume shared capacity. Fairness is a reliability feature.

3) Unbounded Internal Queues

Large queues hide overload until latency explodes. Bound queue size and reject or defer early.

4) Retrying 429 and 503 Aggressively

If clients retry immediately, protective limits become a traffic amplifier. Use jittered exponential backoff and retry budgets.

5) No Observability on Rejections

If you only monitor success rate, you miss overload onset. Track admitted vs rejected traffic as first-class metrics.


Metrics to Track for Rate Limiting and Backpressure

At minimum, monitor:

  • request admission rate vs rejection rate (429, 503)
  • in-flight requests per service instance
  • queue depth and queue wait time
  • dependency latency (p95, p99) before and after limit changes
  • retry volume and retry success after backoff
  • tenant-level distribution (to detect unfair load concentration)

Healthy overload control looks like short, controlled degradation windows, not total collapse.


How to Roll Out Safely

  1. Start in shadow mode: compute decisions, log what would be rejected.
  2. Enable soft limits first on non-critical paths.
  3. Add tenant budgets and expensive-endpoint budgets.
  4. Roll stricter limits gradually while watching saturation metrics.
  5. Document client retry behavior and required backoff policy.

Do not ship overload controls as isolated code changes. Ship them as policy plus observability plus client guidance.


FAQ

Is rate limiting the same as throttling?

They are closely related. Rate limiting defines a policy budget; throttling is how traffic is slowed or rejected when that budget is exceeded.

Should I return 429 or 503?

Use 429 Too Many Requests when a client exceeded policy limits. Use 503 Service Unavailable when the service is overloaded regardless of client identity.

Can backpressure replace circuit breakers?

No. Backpressure manages admission and flow; circuit breakers isolate unhealthy dependencies. They solve different failure modes and should be combined.


Closing Reflection

Most cascading failures are not caused by one bad dependency. They are caused by systems continuing to accept work past safe capacity.

Rate limiting and backpressure make capacity explicit. That single shift turns reliability from reactive firefighting into controlled behavior under load.