Circuit Breaker Pattern in Microservices

The circuit breaker pattern in microservices protects callers from dependencies that are already unlikely to respond successfully. It does not make the dependency healthier. It stops the caller from spending threads, connections, timeouts, and retry attempts on work that is probably going to fail.

That distinction matters. A circuit breaker is not a generic "resilience switch." It is a stateful decision at a dependency boundary: should this caller attempt the remote operation right now, or should it fail fast, use a fallback, or defer the work until recovery is more likely?

For the broader reliability cluster around timeouts, retries, admission control, queues, and outbox recovery, see the Backend Reliability hub.

The Failure Circuit Breakers Are Meant To Stop

Consider a checkout API that calls a pricing service on every request.

The pricing service starts responding slowly after a deployment. It still returns some successful responses, so the caller does not see a clean outage. Instead, p95 latency rises from 120 ms to 2.5 seconds, timeouts begin, and checkout workers spend more time waiting for pricing than doing useful work.

Without a circuit breaker, the failure can spread like this:

Minute	What happens	Why it spreads
0	Pricing latency starts rising	Checkout keeps sending all traffic
2	Checkout workers wait longer on pricing calls	Threads and connection pools stay occupied
4	Some checkout calls time out	Clients and upstream services retry
5	Retry traffic increases pricing load	The recovering dependency receives more work
7	Checkout queue depth rises	Healthy parts of checkout now wait behind bad calls
9	Other endpoints sharing the same pool slow down	The failure escapes the original request path
12	Operators see many symptoms but no single hard error	The system is degraded, not simply down

A circuit breaker changes the failure mode. Once the pricing dependency crosses the configured failure or slow-call threshold, checkout stops calling pricing for a short period. Requests fail fast, return cached pricing where acceptable, or route into a controlled fallback path.

This is the same family of overload behavior covered in When Timeouts Didn't Prevent Cascading Failures. Timeouts bound waiting. Circuit breakers decide whether a dependency call should happen at all.

Martin Fowler's original write-up describes the pattern as wrapping a protected remote call, monitoring failures, and returning immediately once a threshold has tripped instead of continuing to make calls that are likely to fail. See Circuit Breaker.

Closed, Open, And Half-Open States

A circuit breaker usually has three operational states.

State	What happens	Purpose
Closed	Calls pass through to the dependency	Normal behavior while dependency looks good
Open	Calls are rejected immediately or routed to fallback	Protect caller and dependency during fault
Half-open	A small number of probe calls are allowed after a recovery wait	Test whether dependency has recovered

Microsoft's Circuit Breaker pattern guidance describes the same model and separates it from retries: retries try an operation again because success is plausible; a circuit breaker blocks an operation that is likely to fail until recovery looks more likely. See Microsoft Learn: Circuit Breaker pattern.

The half-open state is the part teams often get wrong. It is not a gradual return to full traffic by itself. It is a probe. If every caller releases normal traffic the moment the wait period expires, a dependency that was recovering can be pushed back into failure.

What A Breaker Should Count

A useful breaker does not count every exception as a dependency failure.

For example:

Outcome	Should it trip the breaker?	Reason
Connection timeout to dependency	Yes	The dependency path is unavailable or too slow
`503 Service Unavailable` from dependency	Usually yes	The dependency is explicitly overloaded or unavailable
Slow response above dependency latency budget	Usually yes	Slow calls can exhaust caller resources before hard errors
`429 Too Many Requests` from dependency	Sometimes	It may indicate caller quota, dependency overload, or both
`404 Not Found` for a product ID	No	This is a business outcome, not dependency health
Validation error from caller input	No	Retrying or opening a breaker does not help
Authorization failure	No	The dependency is reachable and behaving correctly

The important rule is simple: the breaker should represent dependency health, not business success.

If a user enters an invalid coupon code, the coupon service is not unhealthy. If pricing returns a normal "product not found" response, the pricing service is not unhealthy. Counting those outcomes as failures will open the circuit during perfectly valid traffic.

The same caution applies to 429. If the dependency returns 429 because this caller exceeded a contract limit, the caller should slow down or fail that request. If the dependency returns 429 as an overload signal for everyone, the breaker may need to open. The response contract should make that distinction visible.

Use Rolling Windows, Not A Single Failure Counter

A fixed "five failures opens the circuit" rule is easy to explain and easy to misconfigure.

Five failures mean different things on different routes:

Traffic shape	Five failures means...
10 calls per hour	Serious signal, but slow to detect
10,000 calls per minute	Probably noise unless failures cluster
A route that is normally idle	Not enough data to infer dependency health
A dependency already timing out	Too late if every failure waits for timeout
A partial regional or shard issue	Too broad if all shards share one breaker

Most production breakers need a rolling window, minimum sample size, failure-rate threshold, and slow-call threshold.

Resilience4j's circuit breaker documentation uses this kind of model: it stores outcomes in count-based or time-based sliding windows, supports failure-rate and slow-call-rate thresholds, and requires a minimum number of calls before calculating those rates. See Resilience4j CircuitBreaker.

The minimum sample size matters. If a low-traffic dependency receives only three calls and all three fail, that might be enough to alert a human, but it may not be enough to open a shared breaker for every caller. If a high-traffic dependency sees 50% slow calls over the last 30 seconds, waiting for hard failures may be too late.

A Concrete TypeScript Circuit Breaker

This example is intentionally small. It is not a replacement for a proven library, but it shows the production shape: rolling outcomes, slow-call classification, open wait time, and limited half-open probes.

type BreakerState = 'closed' | 'open' | 'half_open'

type BreakerPolicy = {
  windowMs: number
  minimumCalls: number
  failureRateToOpen: number
  slowCallRateToOpen: number
  slowCallMs: number
  openForMs: number
  halfOpenMaxCalls: number
}

type CallOutcome = {
  at: number
  failed: boolean
  slow: boolean
}

class CircuitOpenError extends Error {
  constructor(readonly dependency: string) {
    super(`Circuit open for ${dependency}`)
  }
}

class CircuitBreaker {
  private state: BreakerState = 'closed'
  private openedAt = 0
  private halfOpenInFlight = 0
  private outcomes: CallOutcome[] = []

  constructor(
    private readonly dependency: string,
    private readonly policy: BreakerPolicy
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    this.transitionIfReadyForProbe()

    if (this.state === 'open') {
      throw new CircuitOpenError(this.dependency)
    }

    if (this.state === 'half_open') {
      if (this.halfOpenInFlight >= this.policy.halfOpenMaxCalls) {
        throw new CircuitOpenError(this.dependency)
      }

      this.halfOpenInFlight++
    }

    const startedAt = Date.now()

    try {
      const result = await operation()
      this.record({ failed: false, durationMs: Date.now() - startedAt })
      return result
    } catch (error) {
      this.record({ failed: true, durationMs: Date.now() - startedAt })
      throw error
    } finally {
      if (this.state === 'half_open') {
        this.halfOpenInFlight = Math.max(0, this.halfOpenInFlight - 1)
      }
    }
  }

  private transitionIfReadyForProbe() {
    if (this.state !== 'open') return

    if (Date.now() - this.openedAt >= this.policy.openForMs) {
      this.state = 'half_open'
      this.halfOpenInFlight = 0
      this.outcomes = []
    }
  }

  private record(result: { failed: boolean; durationMs: number }) {
    const now = Date.now()
    const outcome = {
      at: now,
      failed: result.failed,
      slow: result.durationMs >= this.policy.slowCallMs,
    }

    this.outcomes.push(outcome)
    this.outcomes = this.outcomes.filter((item) => now - item.at <= this.policy.windowMs)

    if (this.state === 'half_open') {
      if (outcome.failed || outcome.slow) {
        this.open()
        return
      }

      if (this.outcomes.length >= this.policy.halfOpenMaxCalls) {
        this.close()
      }

      return
    }

    if (this.outcomes.length < this.policy.minimumCalls) return

    const failures = this.outcomes.filter((item) => item.failed).length
    const slowCalls = this.outcomes.filter((item) => item.slow).length
    const failureRate = failures / this.outcomes.length
    const slowCallRate = slowCalls / this.outcomes.length

    if (
      failureRate >= this.policy.failureRateToOpen ||
      slowCallRate >= this.policy.slowCallRateToOpen
    ) {
      this.open()
    }
  }

  private open() {
    this.state = 'open'
    this.openedAt = Date.now()
    this.halfOpenInFlight = 0
  }

  private close() {
    this.state = 'closed'
    this.outcomes = []
    this.halfOpenInFlight = 0
  }
}

The code has one deliberately important detail: slow calls can open the breaker before errors dominate.

Many dependency incidents begin as latency incidents. If the breaker waits for explicit errors, the caller may already have filled its connection pool or request workers with slow calls. Slow-call thresholds let the caller protect itself before timeout failures become the main signal.

Where To Put The Breaker

A circuit breaker belongs around a specific dependency operation, not around the whole service by default.

Bad shape:

const checkoutBreaker = new CircuitBreaker('checkout-service', policy)

Better shape:

const priceLookupBreaker = new CircuitBreaker('pricing.lookupPrice', pricingPolicy)
const inventoryReserveBreaker = new CircuitBreaker('inventory.reserveStock', inventoryPolicy)
const paymentRiskBreaker = new CircuitBreaker('risk.scorePayment', riskPolicy)

Different operations fail differently.

Inventory reservation may be critical and write-heavy. Price lookup may tolerate cached data. Risk scoring may degrade with a stricter limit. A single "checkout dependency breaker" collapses those differences into one state and can block healthy paths because one operation is unhealthy.

The same issue appears with shards, regions, and tenants. If only one shard is unhealthy, a global breaker may block calls to healthy shards. If only one region has elevated latency, a global breaker may hide routing information the system needs. Key the breaker at the boundary where failure is actually correlated.

Fallback Is A Product Decision

Opening a breaker answers only one question: do not call this dependency right now.

It does not answer what the user, caller, or workflow should receive.

Operation	Possible fallback	Risk
Product recommendation	return cached or empty recommendations	lower personalization, usually acceptable
Shipping ETA lookup	show "ETA unavailable"	user sees degraded experience
Price calculation	use cached price only if business allows	stale price can create financial or trust issues
Payment authorization	usually fail or queue very carefully	unsafe fallback can create duplicate or incorrect payment
Receipt email scheduling	enqueue later if the queue is healthy	async path must be replay-safe
Fraud or risk scoring	use conservative decision or manual review	may increase false positives or block legitimate users

Fallbacks are not free. A stale cached value may be acceptable for recommendations and unacceptable for pricing. A queued payment step may create correctness problems unless the operation is idempotent. A default response may satisfy availability metrics while hiding a broken user experience.

If the fallback turns synchronous work into asynchronous work, the article Background Jobs in Production covers the next reliability boundary: replay-safe handlers, retries, queue age, and dead-letter triage.

Circuit Breakers, Retries, Timeouts, And Backpressure

Circuit breakers work only when the surrounding controls agree with them.

Control	Job	Failure if misused
Timeout	Bound how long one call may wait	Too long ties up callers; too short creates false fail
Retry	Spend extra attempts on failures likely to recover	Can amplify dependency overload
Retry budget	Limit how much extra retry traffic callers may create	Too loose lets retries dominate fresh traffic
Circuit breaker	Stop calling a dependency that looks unhealthy	Too broad blocks healthy paths
Rate limit	Keep callers, tenants, or routes inside an admission budget	Too high fails to protect capacity
Backpressure	Communicate downstream saturation before it spreads upstream	Too late becomes ordinary failure
Bulkhead/concurrency cap	Keep one dependency from consuming every shared resource	Too small rejects useful work; too large isolates less

Retries should usually call through the breaker, and retry logic should stop when the breaker is open. Otherwise every caller can keep retrying a dependency the breaker has already declared unhealthy. That is the retry-amplification failure covered in Adding Retries Can Make Outages Worse and Retry Budgets in Microservices.

Rate limiting and backpressure answer a different question: how much work should enter the system or downstream path at all? A circuit breaker opens after recent dependency behavior says "this path is unhealthy." Admission controls should often activate earlier, before the dependency is already failing. See Rate Limiting and Backpressure in Microservices for that side of the overload-control cluster.

Common Misconfigurations

Opening Only On Hard Errors

If the breaker ignores latency, it opens after the caller has already paid the timeout cost.

Track slow-call ratio as well as error ratio. A service that returns 200 responses after 8 seconds can still break the caller if the caller's useful latency budget is 700 ms.

A global breaker is easy to manage and often too blunt.

If pricing.lookupPrice is unhealthy, that does not mean pricing.getCurrencyRules is unhealthy. If shard 17 is timing out, that does not mean shard 2 should be blocked. Key breakers where failure correlation is real.

Half-Open Floods

Half-open should allow a small number of probes. It should not release the entire backlog.

If a breaker opens for 30 seconds and then every instance sends normal traffic at the same time, the dependency can fail again before the first probe result is even useful.

No Minimum Call Count

Low sample sizes make breakers jumpy.

If one call fails on a route that has seen only two calls, a 50% failure rate says little. Use minimum sample counts so low-traffic paths do not flap because of tiny windows.

No Fallback Ownership

Engineers often configure the breaker and leave product behavior vague.

When the breaker opens, does the API return 503, cached data, partial data, a queued command, or a conservative response? That decision should be explicit before the first incident.

Breaker Metrics Hidden Inside Logs

State changes should be metrics and events, not only log lines.

During an incident, operators need to see which dependency breaker opened, why it opened, how much traffic it rejected, whether fallback is working, and whether half-open probes are succeeding.

Metrics To Watch

Track breaker behavior as first-class production telemetry.

Metric	Why it matters
current state by dependency and operation	shows which dependency path is unhealthy
state transitions	reveals flapping and recovery patterns
open duration	shows whether recovery is quick or prolonged
calls rejected because circuit is open	measures protected traffic and user-visible impact
fallback count and fallback success rate	shows whether degradation is working
failure rate and slow-call rate by window	explains why the breaker opened
half-open probe attempts and outcomes	shows whether recovery is real
retry rate while breaker is open	catches clients fighting the breaker
caller saturation while breaker is open	confirms whether resources are being preserved

Breaker metrics should sit beside dependency latency, timeout rate, retry rate, and queue depth. Looking at them alone can be misleading.

For example, an open breaker with low caller latency may look healthy on an API latency dashboard. Users may still be receiving fallback responses or hard failures. Keep useful success, fallback success, and rejected work separate.

Rollout Checklist

Use this checklist before enabling a circuit breaker on a production dependency:

Name the protected dependency operation precisely.
Define which failures count as dependency-health failures.
Exclude business outcomes such as validation errors and expected 404 responses.
Set a timeout budget before the breaker records a call as slow or failed.
Choose a rolling window and minimum call count that match the traffic shape.
Add a slow-call threshold, not only a hard-error threshold.
Limit half-open probes per instance and across the fleet if needed.
Decide the fallback behavior for each operation.
Make retry logic stop when the breaker is open.
Emit state transitions, rejection counts, fallback counts, and probe results.
Add a manual force-open and reset path for incident response.
Test the breaker with dependency latency, dependency errors, and partial regional failure.

Do not measure success by whether the breaker never opens. A breaker that never opens may be unnecessary, misconfigured, or protecting the wrong boundary.

Measure success by whether the caller preserves useful capacity while the dependency is unhealthy, and whether recovery happens without a second traffic surge.

The Short Version

Circuit breakers are dependency-health isolation.

They stop callers from continuing to spend resources on remote operations that are likely to fail. They work best with short timeouts, bounded retries, retry budgets, per-dependency concurrency limits, explicit fallback behavior, and visible state transitions.

They do not replace rate limiting, backpressure, bulkheads, retries, or capacity fixes.

The production question is not "do we have a circuit breaker?" It is "which dependency operation can fail, how quickly will callers stop sending harmful work, and what will the system do instead?"