Circuit Breaker Pattern: Prevent Cascading Failures
The circuit breaker pattern stops your application from repeatedly trying to call a failing service, preventing a cascade where one service's failure brings down your entire system. Like an electrical circuit breaker, it has three states: closed (normal operation), open (failing, reject requests immediately), and half-open (testing if the service recovered). When your API calls a flaky payment processor that's timing out, the circuit breaker opens after a few failures, immediately rejecting new requests with a fast error instead of waiting 30 seconds for each timeout. This protects your API's responsiveness and gives the failing service time to recover without being hammered by retries.
I managed an e-commerce platform where one slow database replica would occasionally become unresponsive. Client-side retries would cascade: browsers making 5 retries each, browsers refreshing and retrying again, until our entire frontend felt frozen. Adding a circuit breaker made the service return errors in milliseconds instead of waiting 30 seconds for timeouts. Users got a "service temporarily unavailable" message and refreshed—the quick failure unblocked them instead of making them wait.
Circuit Breaker States and Transitions
A circuit breaker has three states:
Closed (Normal): Requests flow through. Failures are counted. After N consecutive failures or M failures per time window, the breaker opens.
Open (Failing): Requests fail immediately without attempting to call the downstream service. No timeouts, no retries—instant rejection. After a timeout period (e.g., 30 seconds), the breaker transitions to half-open to test recovery.
Half-Open (Testing): A single request is allowed through. If it succeeds, the breaker closes and normal operation resumes. If it fails, the breaker opens again.
The transitions prevent two problems:
- Overload: Once a service fails, you stop retrying and give it breathing room.
- Indefinite failure: Half-open state ensures you attempt recovery; you don't stay open forever.
Simple Circuit Breaker Implementation
Here's a minimal circuit breaker suitable for single-threaded applications (or with a lock for threading):
import time
from enum import Enum
from datetime import datetime, timedelta
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing recovery
class CircuitBreaker:
def __init__(self,
failure_threshold: int = 5,
recovery_timeout: int = 60,
expected_exception: Exception = Exception):
"""
Initialize the circuit breaker.
Args:
failure_threshold: Open after N consecutive failures
recovery_timeout: Seconds before attempting half-open
expected_exception: Exception type that triggers the breaker
"""
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.state = CircuitState.CLOSED
self.opened_at = None
def call(self, func, *args, **kwargs):
"""Execute func with circuit breaker protection."""
# If open, check if we should try half-open
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.failure_count = 0
else:
# Still in recovery timeout
raise Exception(f"Circuit breaker is open; service unavailable")
try:
result = func(*args, **kwargs)
# Success: reset on any state
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
elif self.state == CircuitState.CLOSED:
self.failure_count = 0 # Reset counter on success
return result
except self.expected_exception as e:
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
self.opened_at = datetime.utcnow()
raise
def _should_attempt_reset(self) -> bool:
"""Check if recovery timeout has elapsed."""
if not self.opened_at:
return False
elapsed = (datetime.utcnow() - self.opened_at).total_seconds()
return elapsed >= self.recovery_timeout
# Usage
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)
def call_payment_service(amount):
# Simulate a flaky service
import random
if random.random() < 0.7: # 70% failure rate
raise Exception("Payment service timeout")
return f"Payment of ${amount} processed"
# Client code
for i in range(10):
try:
result = breaker.call(call_payment_service, 100)
print(f"Success: {result}")
except Exception as e:
print(f"Failed: {e}")
After 3 failures, the circuit opens and subsequent calls fail immediately without even attempting to call the service.
Thread-Safe Circuit Breaker for Multi-Threaded APIs
In a real Flask/FastAPI application, multiple requests execute concurrently. We need thread-safe state updates:
import threading
class ThreadSafeCircuitBreaker(CircuitBreaker):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.lock = threading.RLock()
def call(self, func, *args, **kwargs):
with self.lock:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.failure_count = 0
else:
raise Exception("Circuit breaker is open")
try:
result = func(*args, **kwargs)
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except self.expected_exception as e:
with self.lock:
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
self.opened_at = datetime.utcnow()
raise
Circuit Breaker in FastAPI
Here's how to integrate circuit breaker into a FastAPI application with a fallback response:
from fastapi import FastAPI, HTTPException
import httpx
app = FastAPI()
# One breaker per external service
payment_breaker = ThreadSafeCircuitBreaker(
failure_threshold=3,
recovery_timeout=30,
expected_exception=httpx.TimeoutException
)
async def call_payment_service(amount: float) -> dict:
"""Call external payment service."""
async with httpx.AsyncClient(timeout=5.0) as client:
response = await client.post(
'https://payment-api.example.com/charge',
json={'amount': amount}
)
return response.json()
@app.post('/checkout')
async def checkout(amount: float):
try:
# Circuit breaker protects the call
result = payment_breaker.call(call_payment_service, amount)
return {'status': 'success', 'result': result}
except Exception as e:
if 'Circuit breaker is open' in str(e):
# Service is down; return degraded response
return {
'status': 'unavailable',
'message': 'Payment service temporarily unavailable. Try again in 30 seconds.',
'retry_after': 30
}, 503
else:
raise HTTPException(status_code=500, detail=str(e))
Now when the payment service fails 3 times, the circuit opens and subsequent requests fail immediately with a 503, instead of each waiting 5+ seconds for a timeout.
Circuit Breaker with Fallback Logic
Often you want to serve cached or degraded data instead of just failing:
import json
import redis
class CircuitBreakerWithCache:
def __init__(self, redis_client, cache_ttl: int = 3600):
self.breaker = ThreadSafeCircuitBreaker(failure_threshold=3, recovery_timeout=30)
self.redis = redis_client
self.cache_ttl = cache_ttl
def call_with_fallback(self, service_name: str, func, *args, cache_key: str = None, **kwargs):
"""Call service with circuit breaker and fallback to cache."""
cache_key = cache_key or f"{service_name}:args:{json.dumps([args, kwargs])}"
try:
result = self.breaker.call(func, *args, **kwargs)
# Cache successful result
self.redis.setex(cache_key, self.cache_ttl, json.dumps(result))
return result
except Exception as e:
# Try to return cached version
cached = self.redis.get(cache_key)
if cached:
print(f"Service {service_name} failed; serving cached version")
return json.loads(cached)
# No cache; raise error
raise HTTPException(status_code=503, detail=f"{service_name} unavailable")
# Usage
breaker_cache = CircuitBreakerWithCache(redis_client)
@app.get('/user/{user_id}')
async def get_user(user_id: int):
data = breaker_cache.call_with_fallback(
'user-service',
fetch_user_from_service,
user_id,
cache_key=f'user:{user_id}'
)
return data
When the user service fails, you serve the last cached version instead of a blank 503 error. Your API remains partially functional.
Key Takeaways
- Circuit breaker prevents cascading failures by stopping retry loops to a failing service after a threshold.
- Three states (closed, open, half-open) allow automatic recovery without staying stuck in a failed state.
- In open state, fail fast (microseconds) instead of waiting for timeouts (seconds)—this is critical for user experience.
- Combine with caching or fallback logic to serve degraded but functional responses when a dependency fails.
- Monitor state transitions; frequent opens indicate an unstable dependency.
Frequently Asked Questions
What is the difference between circuit breaker and bulkhead?
Circuit breaker stops requests to a failing service. Bulkhead isolates resources (e.g., separate thread pools for different services) so one service's slowness doesn't consume all threads. Use both: circuit breaker for resilience, bulkhead for resource isolation.
Should I open the circuit after N failures or after a percentage of failures?
Both are valid. N consecutive failures is simpler and works well. Percentage-based (e.g., "open if 50% of requests in the last minute fail") is more nuanced but requires tracking rates. Start with consecutive failures.
How long should the recovery timeout be?
60 seconds for most services. If a database deadlock causes a cascade, 60 seconds is usually enough for the deadlock to resolve. For very flaky services, use 30 seconds; for slow services, 120 seconds. Monitor and adjust based on actual recovery times.
Can I have per-user circuit breakers?
Not typically. A circuit breaker is per-downstream-service or per-operation. Opening a breaker blocks all users equally, which is fair. If you need fine-grained control, use rate limiting instead.
What happens during half-open state?
One request is allowed through (without circuit breaker protection). If it succeeds, breaker closes. If it fails, breaker opens again. This gives the service a chance to prove it's recovered without getting hammered.
Further Reading
- Release It! The Definitive Guide - Michael T. Nygard — The canonical reference on circuit breakers and stability patterns.
- AWS Well-Architected Framework: Fault Isolation — Circuit breakers in cloud systems.
- PyBreaker Library — Production circuit breaker implementation for Python.
- Resilience4j (Java) — Reference implementation; patterns apply to Python.