Circuit Breakers, Done Right
Most circuit breakers protect nothing. Here's how to place them, tune them, and avoid the worst failure mode — circuit-breaker-induced outages.
Circuit breakers are one of those patterns engineers repeat without understanding. Libraries like Hystrix popularised them, then everyone slapped @CircuitBreaker on random methods and called it resilient.
At Bolt I rolled out circuit breakers across a 30-service estate. Done right, they bought us roughly 35% better p99 and far fewer 3am pages. Here's what I learned.
What a breaker actually does
A circuit breaker sits in front of a dependency and tracks its failure rate. Three states:
- Closed — requests flow through. Normal.
- Open — requests are rejected immediately without calling the dependency. The breaker "trips" when failure rate exceeds a threshold.
- Half-open — after a cool-down, a handful of probe requests are allowed. If they succeed, close the breaker. If they fail, open again.
The point isn't to hide failures. It's to fail fast. A timing-out dependency is worse than a clearly-broken one because every caller waits 30s and piles up.
The rule most people break
One breaker per dependency, not per service.
If your orders service talks to payments, search, and a vendor API, you need three breakers. A single breaker wrapping the whole service will trip because of one downstream — and suddenly healthy paths (like reading order history) start rejecting.
Where to place them
Right at the client of each remote dependency. If you're using a typed HTTP client, that's the wrapper. If you're using a SDK, wrap the SDK.
Don't wrap database calls with the same breaker library. Your ORM already has connection pooling, timeouts, and retry semantics. Adding a breaker on top usually causes double-counting or lock contention.
Tuning (the part nobody writes about)
Default settings are almost always wrong. You need to think about:
| Parameter | Typical value | Why |
|---|---|---|
| Failure threshold | 50% over 20 req | Too low and you trip on noise; too high and you never protect |
| Minimum requests | 20 per window | Don't trip on 2/3 failures of 3 requests |
| Open duration | 5–30s | Short for fast-recovering deps, long for flaky ones |
| Half-open probes | 3–5 | Enough signal, not a thundering herd |
| Timeout | p99 × 2 | Must be shorter than the caller's timeout |
That last one matters. If your caller times out at 1s and your breaker times out at 5s, the breaker never sees the failure. It only protects the dependency, not the caller's latency budget.
Graceful degradation > failure
An open breaker should do more than reject. It should return a fallback whenever one is safe:
- Search service down → return cached popular results.
- Maps tiles down → return the last known tiles.
- Recommendation service down → return a hand-curated default.
- Payments down → return a clear error to the user. Never return a fake success.
The rule: fallbacks are fine for non-authoritative reads. For writes and for anything involving money, fail loudly.
The worst failure mode
A cluster-wide retry storm triggered by simultaneous breaker recovery. All 20 instances trip their breakers, wait the same 10s, then all 20 send probe requests at the same moment. The dependency, just coming back to life, gets hammered flat.
Fix: jitter the half-open window per instance. Add ±20% random skew. Trivial, essential.
Observability or it didn't happen
Every breaker should emit:
- State (closed / open / half-open)
- Failure rate (rolling window)
- Reject count (per state)
- Fallback count
And every breaker state change should be a log line with the dependency name. At Bolt we had a grafana dashboard showing breaker state across all services. You could see incidents propagate in real time.
What I'd tell a team starting today
- One breaker per dependency.
- Timeouts shorter than caller timeouts.
- Fallbacks only for safe reads.
- Jitter the recovery window.
- Dashboards from day one.
Circuit breakers are simple in theory and subtle in practice. Place them carefully, tune them honestly, and they'll save you. Deploy them on autopilot, and they'll lie to you.
Rolling out resilience patterns across your services? Let's talk.
Have a system that needs to scale — or stop breaking?
I work with a small number of teams each month on architecture reviews, scaling, and hands-on backend engineering. If that sounds like you, let's talk.