
Circuit Breaker Pattern in Microservices
The circuit breaker pattern in microservices protects callers from dependencies that are already unlikely to respond successfully. It does not make the dependency healthier. It stops the caller from spending threads, connections, timeouts, and retry attempts on work that is probably going to fail.
That distinction matters. A circuit breaker is not a generic "resilience switch." It is a stateful decision at a dependency boundary: should this caller attempt the remote operation right now, or should it fail fast, use a fallback, or defer the work until recovery is more likely?
For the broader reliability cluster around timeouts, retries, admission control, queues, and outbox recovery, see the Backend Reliability hub.
The Failure Circuit Breakers Are Meant To Stop
Consider a checkout API that calls a pricing service on every request.
The pricing service starts responding slowly after a deployment. It still returns some successful responses, so the caller does not see a clean outage. Instead, p95 latency rises from 120 ms to 2.5 seconds, timeouts begin, and checkout workers spend more time waiting for pricing than doing useful work.
Without a circuit breaker, the failure can spread like this:
| Minute | What happens | Why it spreads |
|---|---|---|
| 0 | Pricing latency starts rising | Checkout keeps sending all traffic |
| 2 | Checkout workers wait longer on pricing calls | Threads and connection pools stay occupied |
| 4 | Some checkout calls time out | Clients and upstream services retry |
| 5 | Retry traffic increases pricing load | The recovering dependency receives more work |
| 7 | Checkout queue depth rises | Healthy parts of checkout now wait behind bad calls |
| 9 | Other endpoints sharing the same pool slow down | The failure escapes the original request path |
| 12 | Operators see many symptoms but no single hard error | The system is degraded, not simply down |
A circuit breaker changes the failure mode. Once the pricing dependency crosses the configured failure or slow-call threshold, checkout stops calling pricing for a short period. Requests fail fast, return cached pricing where acceptable, or route into a controlled fallback path.
This is the same family of overload behavior covered in When Timeouts Didn't Prevent Cascading Failures. Timeouts bound waiting. Circuit breakers decide whether a dependency call should happen at all.
Martin Fowler's original write-up describes the pattern as wrapping a protected remote call, monitoring failures, and returning immediately once a threshold has tripped instead of continuing to make calls that are likely to fail. See Circuit Breaker.
Closed, Open, And Half-Open States
A circuit breaker usually has three operational states.
| State | What happens | Purpose |
|---|---|---|
| Closed | Calls pass through to the dependency | Normal behavior while dependency looks good |
| Open | Calls are rejected immediately or routed to fallback | Protect caller and dependency during fault |
| Half-open | A small number of probe calls are allowed after a recovery wait | Test whether dependency has recovered |
Microsoft's Circuit Breaker pattern guidance describes the same model and separates it from retries: retries try an operation again because success is plausible; a circuit breaker blocks an operation that is likely to fail until recovery looks more likely. See Microsoft Learn: Circuit Breaker pattern.
The half-open state is the part teams often get wrong. It is not a gradual return to full traffic by itself. It is a probe. If every caller releases normal traffic the moment the wait period expires, a dependency that was recovering can be pushed back into failure.
What A Breaker Should Count
A useful breaker does not count every exception as a dependency failure.
For example:
| Outcome | Should it trip the breaker? | Reason |
|---|---|---|
| Connection timeout to dependency | Yes | The dependency path is unavailable or too slow |
503 Service Unavailable from dependency | Usually yes | The dependency is explicitly overloaded or unavailable |
| Slow response above dependency latency budget | Usually yes | Slow calls can exhaust caller resources before hard errors |
429 Too Many Requests from dependency | Sometimes | It may indicate caller quota, dependency overload, or both |
404 Not Found for a product ID | No | This is a business outcome, not dependency health |
| Validation error from caller input | No | Retrying or opening a breaker does not help |
| Authorization failure | No | The dependency is reachable and behaving correctly |
The important rule is simple: the breaker should represent dependency health, not business success.
If a user enters an invalid coupon code, the coupon service is not unhealthy. If pricing returns a normal "product not found" response, the pricing service is not unhealthy. Counting those outcomes as failures will open the circuit during perfectly valid traffic.
The same caution applies to 429. If the dependency returns 429 because this caller exceeded a contract limit, the caller should slow down or fail that request. If the dependency returns 429 as an overload signal for everyone, the breaker may need to open. The response contract should make that distinction visible.
Use Rolling Windows, Not A Single Failure Counter
A fixed "five failures opens the circuit" rule is easy to explain and easy to misconfigure.
Five failures mean different things on different routes:
| Traffic shape | Five failures means... |
|---|---|
| 10 calls per hour | Serious signal, but slow to detect |
| 10,000 calls per minute | Probably noise unless failures cluster |
| A route that is normally idle | Not enough data to infer dependency health |
| A dependency already timing out | Too late if every failure waits for timeout |
| A partial regional or shard issue | Too broad if all shards share one breaker |
Most production breakers need a rolling window, minimum sample size, failure-rate threshold, and slow-call threshold.
Resilience4j's circuit breaker documentation uses this kind of model: it stores outcomes in count-based or time-based sliding windows, supports failure-rate and slow-call-rate thresholds, and requires a minimum number of calls before calculating those rates. See Resilience4j CircuitBreaker.
The minimum sample size matters. If a low-traffic dependency receives only three calls and all three fail, that might be enough to alert a human, but it may not be enough to open a shared breaker for every caller. If a high-traffic dependency sees 50% slow calls over the last 30 seconds, waiting for hard failures may be too late.
A Concrete TypeScript Circuit Breaker
This example is intentionally small. It is not a replacement for a proven library, but it shows the production shape: rolling outcomes, slow-call classification, open wait time, and limited half-open probes.
type BreakerState = 'closed' | 'open' | 'half_open'
type BreakerPolicy = {
windowMs: number
minimumCalls: number
failureRateToOpen: number
slowCallRateToOpen: number
slowCallMs: number
openForMs: number
halfOpenMaxCalls: number
}
type CallOutcome = {
at: number
failed: boolean
slow: boolean
}
class CircuitOpenError extends Error {
constructor(readonly dependency: string) {
super(`Circuit open for ${dependency}`)
}
}
class CircuitBreaker {
private state: BreakerState = 'closed'
private openedAt = 0
private halfOpenInFlight = 0
private outcomes: CallOutcome[] = []
constructor(
private readonly dependency: string,
private readonly policy: BreakerPolicy
) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
this.transitionIfReadyForProbe()
if (this.state === 'open') {
throw new CircuitOpenError(this.dependency)
}
if (this.state === 'half_open') {
if (this.halfOpenInFlight >= this.policy.halfOpenMaxCalls) {
throw new CircuitOpenError(this.dependency)
}
this.halfOpenInFlight++
}
const startedAt = Date.now()
try {
const result = await operation()
this.record({ failed: false, durationMs: Date.now() - startedAt })
return result
} catch (error) {
this.record({ failed: true, durationMs: Date.now() - startedAt })
throw error
} finally {
if (this.state === 'half_open') {
this.halfOpenInFlight = Math.max(0, this.halfOpenInFlight - 1)
}
}
}
private transitionIfReadyForProbe() {
if (this.state !== 'open') return
if (Date.now() - this.openedAt >= this.policy.openForMs) {
this.state = 'half_open'
this.halfOpenInFlight = 0
this.outcomes = []
}
}
private record(result: { failed: boolean; durationMs: number }) {
const now = Date.now()
const outcome = {
at: now,
failed: result.failed,
slow: result.durationMs >= this.policy.slowCallMs,
}
this.outcomes.push(outcome)
this.outcomes = this.outcomes.filter((item) => now - item.at <= this.policy.windowMs)
if (this.state === 'half_open') {
if (outcome.failed || outcome.slow) {
this.open()
return
}
if (this.outcomes.length >= this.policy.halfOpenMaxCalls) {
this.close()
}
return
}
if (this.outcomes.length < this.policy.minimumCalls) return
const failures = this.outcomes.filter((item) => item.failed).length
const slowCalls = this.outcomes.filter((item) => item.slow).length
const failureRate = failures / this.outcomes.length
const slowCallRate = slowCalls / this.outcomes.length
if (
failureRate >= this.policy.failureRateToOpen ||
slowCallRate >= this.policy.slowCallRateToOpen
) {
this.open()
}
}
private open() {
this.state = 'open'
this.openedAt = Date.now()
this.halfOpenInFlight = 0
}
private close() {
this.state = 'closed'
this.outcomes = []
this.halfOpenInFlight = 0
}
}
The code has one deliberately important detail: slow calls can open the breaker before errors dominate.
Many dependency incidents begin as latency incidents. If the breaker waits for explicit errors, the caller may already have filled its connection pool or request workers with slow calls. Slow-call thresholds let the caller protect itself before timeout failures become the main signal.
Where To Put The Breaker
A circuit breaker belongs around a specific dependency operation, not around the whole service by default.
Bad shape:
const checkoutBreaker = new CircuitBreaker('checkout-service', policy)
Better shape:
const priceLookupBreaker = new CircuitBreaker('pricing.lookupPrice', pricingPolicy)
const inventoryReserveBreaker = new CircuitBreaker('inventory.reserveStock', inventoryPolicy)
const paymentRiskBreaker = new CircuitBreaker('risk.scorePayment', riskPolicy)
Different operations fail differently.
Inventory reservation may be critical and write-heavy. Price lookup may tolerate cached data. Risk scoring may degrade with a stricter limit. A single "checkout dependency breaker" collapses those differences into one state and can block healthy paths because one operation is unhealthy.
The same issue appears with shards, regions, and tenants. If only one shard is unhealthy, a global breaker may block calls to healthy shards. If only one region has elevated latency, a global breaker may hide routing information the system needs. Key the breaker at the boundary where failure is actually correlated.
Fallback Is A Product Decision
Opening a breaker answers only one question: do not call this dependency right now.
It does not answer what the user, caller, or workflow should receive.
| Operation | Possible fallback | Risk |
|---|---|---|
| Product recommendation | return cached or empty recommendations | lower personalization, usually acceptable |
| Shipping ETA lookup | show "ETA unavailable" | user sees degraded experience |
| Price calculation | use cached price only if business allows | stale price can create financial or trust issues |
| Payment authorization | usually fail or queue very carefully | unsafe fallback can create duplicate or incorrect payment |
| Receipt email scheduling | enqueue later if the queue is healthy | async path must be replay-safe |
| Fraud or risk scoring | use conservative decision or manual review | may increase false positives or block legitimate users |
Fallbacks are not free. A stale cached value may be acceptable for recommendations and unacceptable for pricing. A queued payment step may create correctness problems unless the operation is idempotent. A default response may satisfy availability metrics while hiding a broken user experience.
If the fallback turns synchronous work into asynchronous work, the article Background Jobs in Production covers the next reliability boundary: replay-safe handlers, retries, queue age, and dead-letter triage.
Circuit Breakers, Retries, Timeouts, And Backpressure
Circuit breakers work only when the surrounding controls agree with them.
| Control | Job | Failure if misused |
|---|---|---|
| Timeout | Bound how long one call may wait | Too long ties up callers; too short creates false fail |
| Retry | Spend extra attempts on failures likely to recover | Can amplify dependency overload |
| Retry budget | Limit how much extra retry traffic callers may create | Too loose lets retries dominate fresh traffic |
| Circuit breaker | Stop calling a dependency that looks unhealthy | Too broad blocks healthy paths |
| Rate limit | Keep callers, tenants, or routes inside an admission budget | Too high fails to protect capacity |
| Backpressure | Communicate downstream saturation before it spreads upstream | Too late becomes ordinary failure |
| Bulkhead/concurrency cap | Keep one dependency from consuming every shared resource | Too small rejects useful work; too large isolates less |
Retries should usually call through the breaker, and retry logic should stop when the breaker is open. Otherwise every caller can keep retrying a dependency the breaker has already declared unhealthy. That is the retry-amplification failure covered in Adding Retries Can Make Outages Worse and Retry Budgets in Microservices.
Rate limiting and backpressure answer a different question: how much work should enter the system or downstream path at all? A circuit breaker opens after recent dependency behavior says "this path is unhealthy." Admission controls should often activate earlier, before the dependency is already failing. See Rate Limiting and Backpressure in Microservices for that side of the overload-control cluster.
Common Misconfigurations
Opening Only On Hard Errors
If the breaker ignores latency, it opens after the caller has already paid the timeout cost.
Track slow-call ratio as well as error ratio. A service that returns 200 responses after 8 seconds can still break the caller if the caller's useful latency budget is 700 ms.
Sharing One Breaker Across Too Many Operations
A global breaker is easy to manage and often too blunt.
If pricing.lookupPrice is unhealthy, that does not mean pricing.getCurrencyRules is unhealthy. If shard 17 is timing out, that does not mean shard 2 should be blocked. Key breakers where failure correlation is real.
Half-Open Floods
Half-open should allow a small number of probes. It should not release the entire backlog.
If a breaker opens for 30 seconds and then every instance sends normal traffic at the same time, the dependency can fail again before the first probe result is even useful.
No Minimum Call Count
Low sample sizes make breakers jumpy.
If one call fails on a route that has seen only two calls, a 50% failure rate says little. Use minimum sample counts so low-traffic paths do not flap because of tiny windows.
No Fallback Ownership
Engineers often configure the breaker and leave product behavior vague.
When the breaker opens, does the API return 503, cached data, partial data, a queued command, or a conservative response? That decision should be explicit before the first incident.
Breaker Metrics Hidden Inside Logs
State changes should be metrics and events, not only log lines.
During an incident, operators need to see which dependency breaker opened, why it opened, how much traffic it rejected, whether fallback is working, and whether half-open probes are succeeding.
Metrics To Watch
Track breaker behavior as first-class production telemetry.
| Metric | Why it matters |
|---|---|
| current state by dependency and operation | shows which dependency path is unhealthy |
| state transitions | reveals flapping and recovery patterns |
| open duration | shows whether recovery is quick or prolonged |
| calls rejected because circuit is open | measures protected traffic and user-visible impact |
| fallback count and fallback success rate | shows whether degradation is working |
| failure rate and slow-call rate by window | explains why the breaker opened |
| half-open probe attempts and outcomes | shows whether recovery is real |
| retry rate while breaker is open | catches clients fighting the breaker |
| caller saturation while breaker is open | confirms whether resources are being preserved |
Breaker metrics should sit beside dependency latency, timeout rate, retry rate, and queue depth. Looking at them alone can be misleading.
For example, an open breaker with low caller latency may look healthy on an API latency dashboard. Users may still be receiving fallback responses or hard failures. Keep useful success, fallback success, and rejected work separate.
Rollout Checklist
Use this checklist before enabling a circuit breaker on a production dependency:
- Name the protected dependency operation precisely.
- Define which failures count as dependency-health failures.
- Exclude business outcomes such as validation errors and expected
404responses. - Set a timeout budget before the breaker records a call as slow or failed.
- Choose a rolling window and minimum call count that match the traffic shape.
- Add a slow-call threshold, not only a hard-error threshold.
- Limit half-open probes per instance and across the fleet if needed.
- Decide the fallback behavior for each operation.
- Make retry logic stop when the breaker is open.
- Emit state transitions, rejection counts, fallback counts, and probe results.
- Add a manual force-open and reset path for incident response.
- Test the breaker with dependency latency, dependency errors, and partial regional failure.
Do not measure success by whether the breaker never opens. A breaker that never opens may be unnecessary, misconfigured, or protecting the wrong boundary.
Measure success by whether the caller preserves useful capacity while the dependency is unhealthy, and whether recovery happens without a second traffic surge.
The Short Version
Circuit breakers are dependency-health isolation.
They stop callers from continuing to spend resources on remote operations that are likely to fail. They work best with short timeouts, bounded retries, retry budgets, per-dependency concurrency limits, explicit fallback behavior, and visible state transitions.
They do not replace rate limiting, backpressure, bulkheads, retries, or capacity fixes.
The production question is not "do we have a circuit breaker?" It is "which dependency operation can fail, how quickly will callers stop sending harmful work, and what will the system do instead?"