AK
All writing
January 12, 2026·4 min readresiliencedistributed-systemsreliability

Circuit Breakers, Done Right

Most circuit breakers protect nothing. Here's how to place them, tune them, and avoid the worst failure mode — circuit-breaker-induced outages.

Circuit breakers are one of those patterns engineers repeat without understanding. Libraries like Hystrix popularised them, then everyone slapped @CircuitBreaker on random methods and called it resilient.

At Bolt I rolled out circuit breakers across a 30-service estate. Done right, they bought us roughly 35% better p99 and far fewer 3am pages. Here's what I learned.

What a breaker actually does

A circuit breaker sits in front of a dependency and tracks its failure rate. Three states:

  • Closed — requests flow through. Normal.
  • Open — requests are rejected immediately without calling the dependency. The breaker "trips" when failure rate exceeds a threshold.
  • Half-open — after a cool-down, a handful of probe requests are allowed. If they succeed, close the breaker. If they fail, open again.

The point isn't to hide failures. It's to fail fast. A timing-out dependency is worse than a clearly-broken one because every caller waits 30s and piles up.

The rule most people break

One breaker per dependency, not per service.

If your orders service talks to payments, search, and a vendor API, you need three breakers. A single breaker wrapping the whole service will trip because of one downstream — and suddenly healthy paths (like reading order history) start rejecting.

Where to place them

Right at the client of each remote dependency. If you're using a typed HTTP client, that's the wrapper. If you're using a SDK, wrap the SDK.

Don't wrap database calls with the same breaker library. Your ORM already has connection pooling, timeouts, and retry semantics. Adding a breaker on top usually causes double-counting or lock contention.

Tuning (the part nobody writes about)

Default settings are almost always wrong. You need to think about:

ParameterTypical valueWhy
Failure threshold50% over 20 reqToo low and you trip on noise; too high and you never protect
Minimum requests20 per windowDon't trip on 2/3 failures of 3 requests
Open duration5–30sShort for fast-recovering deps, long for flaky ones
Half-open probes3–5Enough signal, not a thundering herd
Timeoutp99 × 2Must be shorter than the caller's timeout

That last one matters. If your caller times out at 1s and your breaker times out at 5s, the breaker never sees the failure. It only protects the dependency, not the caller's latency budget.

Graceful degradation > failure

An open breaker should do more than reject. It should return a fallback whenever one is safe:

  • Search service down → return cached popular results.
  • Maps tiles down → return the last known tiles.
  • Recommendation service down → return a hand-curated default.
  • Payments down → return a clear error to the user. Never return a fake success.

The rule: fallbacks are fine for non-authoritative reads. For writes and for anything involving money, fail loudly.

The worst failure mode

A cluster-wide retry storm triggered by simultaneous breaker recovery. All 20 instances trip their breakers, wait the same 10s, then all 20 send probe requests at the same moment. The dependency, just coming back to life, gets hammered flat.

Fix: jitter the half-open window per instance. Add ±20% random skew. Trivial, essential.

Observability or it didn't happen

Every breaker should emit:

  • State (closed / open / half-open)
  • Failure rate (rolling window)
  • Reject count (per state)
  • Fallback count

And every breaker state change should be a log line with the dependency name. At Bolt we had a grafana dashboard showing breaker state across all services. You could see incidents propagate in real time.

What I'd tell a team starting today

  1. One breaker per dependency.
  2. Timeouts shorter than caller timeouts.
  3. Fallbacks only for safe reads.
  4. Jitter the recovery window.
  5. Dashboards from day one.

Circuit breakers are simple in theory and subtle in practice. Place them carefully, tune them honestly, and they'll save you. Deploy them on autopilot, and they'll lie to you.


Rolling out resilience patterns across your services? Let's talk.

Let's build

Have a system that needs to scale — or stop breaking?

I work with a small number of teams each month on architecture reviews, scaling, and hands-on backend engineering. If that sounds like you, let's talk.