Circuit Breaker Pattern in Microservices

Circuit Breaker Pattern in Microservices

The circuit breaker pattern in microservices protects callers from dependencies that are already unlikely to respond successfully. It does not make the dependency healthier. It stops the caller from spending threads, connections, timeouts, and retry attempts on work that is probably going to fail.

That distinction matters. A circuit breaker is not a generic "resilience switch." It is a stateful decision at a dependency boundary: should this caller attempt the remote operation right now, or should it fail fast, use a fallback, or defer the work until recovery is more likely?

For the broader reliability cluster around timeouts, retries, admission control, queues, and outbox recovery, see the Backend Reliability hub.

The Failure Circuit Breakers Are Meant To Stop

Consider a checkout API that calls a pricing service on every request.

The pricing service starts responding slowly after a deployment. It still returns some successful responses, so the caller does not see a clean outage. Instead, p95 latency rises from 120 ms to 2.5 seconds, timeouts begin, and checkout workers spend more time waiting for pricing than doing useful work.

Without a circuit breaker, the failure can spread like this:

MinuteWhat happensWhy it spreads
0Pricing latency starts risingCheckout keeps sending all traffic
2Checkout workers wait longer on pricing callsThreads and connection pools stay occupied
4Some checkout calls time outClients and upstream services retry
5Retry traffic increases pricing loadThe recovering dependency receives more work
7Checkout queue depth risesHealthy parts of checkout now wait behind bad calls
9Other endpoints sharing the same pool slow downThe failure escapes the original request path
12Operators see many symptoms but no single hard errorThe system is degraded, not simply down

A circuit breaker changes the failure mode. Once the pricing dependency crosses the configured failure or slow-call threshold, checkout stops calling pricing for a short period. Requests fail fast, return cached pricing where acceptable, or route into a controlled fallback path.

This is the same family of overload behavior covered in When Timeouts Didn't Prevent Cascading Failures. Timeouts bound waiting. Circuit breakers decide whether a dependency call should happen at all.

Martin Fowler's original write-up describes the pattern as wrapping a protected remote call, monitoring failures, and returning immediately once a threshold has tripped instead of continuing to make calls that are likely to fail. See Circuit Breaker.

Closed, Open, And Half-Open States

A circuit breaker usually has three operational states.

StateWhat happensPurpose
ClosedCalls pass through to the dependencyNormal behavior while dependency looks good
OpenCalls are rejected immediately or routed to fallbackProtect caller and dependency during fault
Half-openA small number of probe calls are allowed after a recovery waitTest whether dependency has recovered

Microsoft's Circuit Breaker pattern guidance describes the same model and separates it from retries: retries try an operation again because success is plausible; a circuit breaker blocks an operation that is likely to fail until recovery looks more likely. See Microsoft Learn: Circuit Breaker pattern.

The half-open state is the part teams often get wrong. It is not a gradual return to full traffic by itself. It is a probe. If every caller releases normal traffic the moment the wait period expires, a dependency that was recovering can be pushed back into failure.

What A Breaker Should Count

A useful breaker does not count every exception as a dependency failure.

For example:

OutcomeShould it trip the breaker?Reason
Connection timeout to dependencyYesThe dependency path is unavailable or too slow
503 Service Unavailable from dependencyUsually yesThe dependency is explicitly overloaded or unavailable
Slow response above dependency latency budgetUsually yesSlow calls can exhaust caller resources before hard errors
429 Too Many Requests from dependencySometimesIt may indicate caller quota, dependency overload, or both
404 Not Found for a product IDNoThis is a business outcome, not dependency health
Validation error from caller inputNoRetrying or opening a breaker does not help
Authorization failureNoThe dependency is reachable and behaving correctly

The important rule is simple: the breaker should represent dependency health, not business success.

If a user enters an invalid coupon code, the coupon service is not unhealthy. If pricing returns a normal "product not found" response, the pricing service is not unhealthy. Counting those outcomes as failures will open the circuit during perfectly valid traffic.

The same caution applies to 429. If the dependency returns 429 because this caller exceeded a contract limit, the caller should slow down or fail that request. If the dependency returns 429 as an overload signal for everyone, the breaker may need to open. The response contract should make that distinction visible.

Use Rolling Windows, Not A Single Failure Counter

A fixed "five failures opens the circuit" rule is easy to explain and easy to misconfigure.

Five failures mean different things on different routes:

Traffic shapeFive failures means...
10 calls per hourSerious signal, but slow to detect
10,000 calls per minuteProbably noise unless failures cluster
A route that is normally idleNot enough data to infer dependency health
A dependency already timing outToo late if every failure waits for timeout
A partial regional or shard issueToo broad if all shards share one breaker

Most production breakers need a rolling window, minimum sample size, failure-rate threshold, and slow-call threshold.

Resilience4j's circuit breaker documentation uses this kind of model: it stores outcomes in count-based or time-based sliding windows, supports failure-rate and slow-call-rate thresholds, and requires a minimum number of calls before calculating those rates. See Resilience4j CircuitBreaker.

The minimum sample size matters. If a low-traffic dependency receives only three calls and all three fail, that might be enough to alert a human, but it may not be enough to open a shared breaker for every caller. If a high-traffic dependency sees 50% slow calls over the last 30 seconds, waiting for hard failures may be too late.

A Concrete TypeScript Circuit Breaker

This example is intentionally small. It is not a replacement for a proven library, but it shows the production shape: rolling outcomes, slow-call classification, open wait time, and limited half-open probes.

type BreakerState = 'closed' | 'open' | 'half_open'

type BreakerPolicy = {
  windowMs: number
  minimumCalls: number
  failureRateToOpen: number
  slowCallRateToOpen: number
  slowCallMs: number
  openForMs: number
  halfOpenMaxCalls: number
}

type CallOutcome = {
  at: number
  failed: boolean
  slow: boolean
}

class CircuitOpenError extends Error {
  constructor(readonly dependency: string) {
    super(`Circuit open for ${dependency}`)
  }
}

class CircuitBreaker {
  private state: BreakerState = 'closed'
  private openedAt = 0
  private halfOpenInFlight = 0
  private outcomes: CallOutcome[] = []

  constructor(
    private readonly dependency: string,
    private readonly policy: BreakerPolicy
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    this.transitionIfReadyForProbe()

    if (this.state === 'open') {
      throw new CircuitOpenError(this.dependency)
    }

    if (this.state === 'half_open') {
      if (this.halfOpenInFlight >= this.policy.halfOpenMaxCalls) {
        throw new CircuitOpenError(this.dependency)
      }

      this.halfOpenInFlight++
    }

    const startedAt = Date.now()

    try {
      const result = await operation()
      this.record({ failed: false, durationMs: Date.now() - startedAt })
      return result
    } catch (error) {
      this.record({ failed: true, durationMs: Date.now() - startedAt })
      throw error
    } finally {
      if (this.state === 'half_open') {
        this.halfOpenInFlight = Math.max(0, this.halfOpenInFlight - 1)
      }
    }
  }

  private transitionIfReadyForProbe() {
    if (this.state !== 'open') return

    if (Date.now() - this.openedAt >= this.policy.openForMs) {
      this.state = 'half_open'
      this.halfOpenInFlight = 0
      this.outcomes = []
    }
  }

  private record(result: { failed: boolean; durationMs: number }) {
    const now = Date.now()
    const outcome = {
      at: now,
      failed: result.failed,
      slow: result.durationMs >= this.policy.slowCallMs,
    }

    this.outcomes.push(outcome)
    this.outcomes = this.outcomes.filter((item) => now - item.at <= this.policy.windowMs)

    if (this.state === 'half_open') {
      if (outcome.failed || outcome.slow) {
        this.open()
        return
      }

      if (this.outcomes.length >= this.policy.halfOpenMaxCalls) {
        this.close()
      }

      return
    }

    if (this.outcomes.length < this.policy.minimumCalls) return

    const failures = this.outcomes.filter((item) => item.failed).length
    const slowCalls = this.outcomes.filter((item) => item.slow).length
    const failureRate = failures / this.outcomes.length
    const slowCallRate = slowCalls / this.outcomes.length

    if (
      failureRate >= this.policy.failureRateToOpen ||
      slowCallRate >= this.policy.slowCallRateToOpen
    ) {
      this.open()
    }
  }

  private open() {
    this.state = 'open'
    this.openedAt = Date.now()
    this.halfOpenInFlight = 0
  }

  private close() {
    this.state = 'closed'
    this.outcomes = []
    this.halfOpenInFlight = 0
  }
}

The code has one deliberately important detail: slow calls can open the breaker before errors dominate.

Many dependency incidents begin as latency incidents. If the breaker waits for explicit errors, the caller may already have filled its connection pool or request workers with slow calls. Slow-call thresholds let the caller protect itself before timeout failures become the main signal.

Where To Put The Breaker

A circuit breaker belongs around a specific dependency operation, not around the whole service by default.

Bad shape:

const checkoutBreaker = new CircuitBreaker('checkout-service', policy)

Better shape:

const priceLookupBreaker = new CircuitBreaker('pricing.lookupPrice', pricingPolicy)
const inventoryReserveBreaker = new CircuitBreaker('inventory.reserveStock', inventoryPolicy)
const paymentRiskBreaker = new CircuitBreaker('risk.scorePayment', riskPolicy)

Different operations fail differently.

Inventory reservation may be critical and write-heavy. Price lookup may tolerate cached data. Risk scoring may degrade with a stricter limit. A single "checkout dependency breaker" collapses those differences into one state and can block healthy paths because one operation is unhealthy.

The same issue appears with shards, regions, and tenants. If only one shard is unhealthy, a global breaker may block calls to healthy shards. If only one region has elevated latency, a global breaker may hide routing information the system needs. Key the breaker at the boundary where failure is actually correlated.

Fallback Is A Product Decision

Opening a breaker answers only one question: do not call this dependency right now.

It does not answer what the user, caller, or workflow should receive.

OperationPossible fallbackRisk
Product recommendationreturn cached or empty recommendationslower personalization, usually acceptable
Shipping ETA lookupshow "ETA unavailable"user sees degraded experience
Price calculationuse cached price only if business allowsstale price can create financial or trust issues
Payment authorizationusually fail or queue very carefullyunsafe fallback can create duplicate or incorrect payment
Receipt email schedulingenqueue later if the queue is healthyasync path must be replay-safe
Fraud or risk scoringuse conservative decision or manual reviewmay increase false positives or block legitimate users

Fallbacks are not free. A stale cached value may be acceptable for recommendations and unacceptable for pricing. A queued payment step may create correctness problems unless the operation is idempotent. A default response may satisfy availability metrics while hiding a broken user experience.

If the fallback turns synchronous work into asynchronous work, the article Background Jobs in Production covers the next reliability boundary: replay-safe handlers, retries, queue age, and dead-letter triage.

Circuit Breakers, Retries, Timeouts, And Backpressure

Circuit breakers work only when the surrounding controls agree with them.

ControlJobFailure if misused
TimeoutBound how long one call may waitToo long ties up callers; too short creates false fail
RetrySpend extra attempts on failures likely to recoverCan amplify dependency overload
Retry budgetLimit how much extra retry traffic callers may createToo loose lets retries dominate fresh traffic
Circuit breakerStop calling a dependency that looks unhealthyToo broad blocks healthy paths
Rate limitKeep callers, tenants, or routes inside an admission budgetToo high fails to protect capacity
BackpressureCommunicate downstream saturation before it spreads upstreamToo late becomes ordinary failure
Bulkhead/concurrency capKeep one dependency from consuming every shared resourceToo small rejects useful work; too large isolates less

Retries should usually call through the breaker, and retry logic should stop when the breaker is open. Otherwise every caller can keep retrying a dependency the breaker has already declared unhealthy. That is the retry-amplification failure covered in Adding Retries Can Make Outages Worse and Retry Budgets in Microservices.

Rate limiting and backpressure answer a different question: how much work should enter the system or downstream path at all? A circuit breaker opens after recent dependency behavior says "this path is unhealthy." Admission controls should often activate earlier, before the dependency is already failing. See Rate Limiting and Backpressure in Microservices for that side of the overload-control cluster.

Common Misconfigurations

Opening Only On Hard Errors

If the breaker ignores latency, it opens after the caller has already paid the timeout cost.

Track slow-call ratio as well as error ratio. A service that returns 200 responses after 8 seconds can still break the caller if the caller's useful latency budget is 700 ms.

Sharing One Breaker Across Too Many Operations

A global breaker is easy to manage and often too blunt.

If pricing.lookupPrice is unhealthy, that does not mean pricing.getCurrencyRules is unhealthy. If shard 17 is timing out, that does not mean shard 2 should be blocked. Key breakers where failure correlation is real.

Half-Open Floods

Half-open should allow a small number of probes. It should not release the entire backlog.

If a breaker opens for 30 seconds and then every instance sends normal traffic at the same time, the dependency can fail again before the first probe result is even useful.

No Minimum Call Count

Low sample sizes make breakers jumpy.

If one call fails on a route that has seen only two calls, a 50% failure rate says little. Use minimum sample counts so low-traffic paths do not flap because of tiny windows.

No Fallback Ownership

Engineers often configure the breaker and leave product behavior vague.

When the breaker opens, does the API return 503, cached data, partial data, a queued command, or a conservative response? That decision should be explicit before the first incident.

Breaker Metrics Hidden Inside Logs

State changes should be metrics and events, not only log lines.

During an incident, operators need to see which dependency breaker opened, why it opened, how much traffic it rejected, whether fallback is working, and whether half-open probes are succeeding.

Metrics To Watch

Track breaker behavior as first-class production telemetry.

MetricWhy it matters
current state by dependency and operationshows which dependency path is unhealthy
state transitionsreveals flapping and recovery patterns
open durationshows whether recovery is quick or prolonged
calls rejected because circuit is openmeasures protected traffic and user-visible impact
fallback count and fallback success rateshows whether degradation is working
failure rate and slow-call rate by windowexplains why the breaker opened
half-open probe attempts and outcomesshows whether recovery is real
retry rate while breaker is opencatches clients fighting the breaker
caller saturation while breaker is openconfirms whether resources are being preserved

Breaker metrics should sit beside dependency latency, timeout rate, retry rate, and queue depth. Looking at them alone can be misleading.

For example, an open breaker with low caller latency may look healthy on an API latency dashboard. Users may still be receiving fallback responses or hard failures. Keep useful success, fallback success, and rejected work separate.

Rollout Checklist

Use this checklist before enabling a circuit breaker on a production dependency:

  1. Name the protected dependency operation precisely.
  2. Define which failures count as dependency-health failures.
  3. Exclude business outcomes such as validation errors and expected 404 responses.
  4. Set a timeout budget before the breaker records a call as slow or failed.
  5. Choose a rolling window and minimum call count that match the traffic shape.
  6. Add a slow-call threshold, not only a hard-error threshold.
  7. Limit half-open probes per instance and across the fleet if needed.
  8. Decide the fallback behavior for each operation.
  9. Make retry logic stop when the breaker is open.
  10. Emit state transitions, rejection counts, fallback counts, and probe results.
  11. Add a manual force-open and reset path for incident response.
  12. Test the breaker with dependency latency, dependency errors, and partial regional failure.

Do not measure success by whether the breaker never opens. A breaker that never opens may be unnecessary, misconfigured, or protecting the wrong boundary.

Measure success by whether the caller preserves useful capacity while the dependency is unhealthy, and whether recovery happens without a second traffic surge.

The Short Version

Circuit breakers are dependency-health isolation.

They stop callers from continuing to spend resources on remote operations that are likely to fail. They work best with short timeouts, bounded retries, retry budgets, per-dependency concurrency limits, explicit fallback behavior, and visible state transitions.

They do not replace rate limiting, backpressure, bulkheads, retries, or capacity fixes.

The production question is not "do we have a circuit breaker?" It is "which dependency operation can fail, how quickly will callers stop sending harmful work, and what will the system do instead?"