Health Checks and Service Readiness
Health checks allow orchestration platforms (Kubernetes, Docker Swarm) and load balancers to determine whether a service is alive (liveness) and ready to serve traffic (readiness). gRPC defines a standard health check service (grpc.health.v1.Health) that responds to probe requests. Without health checks, orchestrators blindly restart services, leading to cascading failures. With them, you get graceful shutdowns, zero-downtime deployments, and automatic recovery. This guide covers implementing both liveness and readiness probes, custom health logic, and integration with Kubernetes.
Standard gRPC Health Check Service
gRPC includes a built-in health check service definition in the grpc-health-probe package. Use it:
pip install grpcio-health-checking
This provides a standard Health service and Python implementation:
from grpc_health.v1 import health
from grpc_health.v1 import health_pb2_grpc
import grpc
class OrderServicer(order_pb2_grpc.OrderServiceServicer):
def CreateOrder(self, request, context):
# ... normal RPC logic
pass
async def serve():
server = grpc.aio.server()
# Add your service
order_pb2_grpc.add_OrderServiceServicer_to_server(OrderServicer(), server)
# Add health check service
health_servicer = health.HealthServicer()
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
# Set initial status: serving
health_servicer.set("ecommerce.orders.OrderService", health_pb2.HealthCheckResponse.SERVING)
server.add_insecure_port("[::]:50051")
await server.start()
print("Server started with health checks enabled")
await server.wait_for_termination()
if __name__ == "__main__":
asyncio.run(serve())
The health check service responds to probe requests:
# Client-side (e.g., Kubernetes probe)
with grpc.insecure_channel("localhost:50051") as channel:
health_stub = health_pb2_grpc.HealthStub(channel)
try:
response = health_stub.Check(health_pb2.HealthCheckRequest(
service="ecommerce.orders.OrderService"
))
print(f"Service status: {response.status}")
# Output: SERVING, NOT_SERVING, UNKNOWN, TRANSIENT_FAILURE, etc.
except grpc.RpcError as e:
print(f"Health check failed: {e.details()}")
Liveness vs. Readiness Probes
Kubernetes uses two types of health checks:
Liveness probe: "Is the service still running?" (e.g., can we connect?). If it fails repeatedly, Kubernetes kills and restarts the pod.
Readiness probe: "Is the service ready to accept traffic?" (e.g., are dependencies online?). If it fails, Kubernetes removes the pod from load balancer rotation.
Implement both by reporting different statuses:
from grpc_health.v1 import health, health_pb2
import asyncio
class OrderServicer(order_pb2_grpc.OrderServiceServicer):
def __init__(self, health_servicer):
self.health = health_servicer
self.db_connected = False
self.cache_connected = False
async def check_dependencies(self):
"""Periodic task to check if critical dependencies are online."""
while True:
# Check database
try:
self.db_connected = await check_db_health()
except Exception as e:
print(f"Database health check failed: {e}")
self.db_connected = False
# Check cache
try:
self.cache_connected = await check_cache_health()
except Exception as e:
print(f"Cache health check failed: {e}")
self.cache_connected = False
# Update health service status
if self.db_connected and self.cache_connected:
# All dependencies online; service is ready
self.health.set(
"ecommerce.orders.OrderService",
health_pb2.HealthCheckResponse.SERVING
)
else:
# Some dependencies offline; service can't serve requests
self.health.set(
"ecommerce.orders.OrderService",
health_pb2.HealthCheckResponse.NOT_SERVING
)
await asyncio.sleep(5.0) # Check every 5 seconds
def CreateOrder(self, request, context):
# Handler should also check dependencies
if not self.db_connected:
context.abort(
grpc.StatusCode.UNAVAILABLE,
"Database is offline; service temporarily unavailable"
)
# ... process order
pass
async def check_db_health():
"""Pseudo-code: verify database connection."""
return await db.ping()
async def check_cache_health():
"""Pseudo-code: verify cache connection."""
return await cache.ping()
async def serve():
server = grpc.aio.server()
# Create health servicer
health_servicer = health.HealthServicer()
# Create application servicer with reference to health
app_servicer = OrderServicer(health_servicer)
# Start background task: periodic health checks
asyncio.create_task(app_servicer.check_dependencies())
# Register services
order_pb2_grpc.add_OrderServiceServicer_to_server(app_servicer, server)
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
# Initial status: NOT_SERVING (until dependencies come online)
health_servicer.set("ecommerce.orders.OrderService", health_pb2.HealthCheckResponse.NOT_SERVING)
server.add_insecure_port("[::]:50051")
await server.start()
print("Server started; health status initially NOT_SERVING until dependencies online")
await server.wait_for_termination()
Kubernetes Integration
Define liveness and readiness probes in a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:1.0
ports:
- containerPort: 50051
name: grpc
# Liveness probe: restart if unresponsive
livenessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:50051"]
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe: remove from load balancer if not ready
readinessProbe:
exec:
command: [
"/bin/grpc_health_probe",
"-addr=:50051",
"-service=ecommerce.orders.OrderService"
]
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 1
The grpc_health_probe binary is provided by gRPC:
# Install in your Docker image
RUN go install github.com/grpc-ecosystem/grpc-health-probe@latest
Or use a shell script instead:
# Simple health check using grpcurl
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check
Custom Health Check Logic
For advanced scenarios, create a custom health check endpoint:
class CustomHealthServicer(health_pb2_grpc.HealthServicer):
"""
Extend the standard health check with custom logic.
"""
def __init__(self):
self._status = {}
self.db = None
self.cache = None
def Check(self, request, context):
"""
Handle health check requests.
Args:
request: HealthCheckRequest (has a 'service' field)
Returns:
HealthCheckResponse with status enum
"""
service_name = request.service
# Check specific service
if service_name == "ecommerce.orders.OrderService":
# Perform detailed checks
db_ok = self._check_db()
cache_ok = self._check_cache()
if db_ok and cache_ok:
return health_pb2.HealthCheckResponse(
status=health_pb2.HealthCheckResponse.SERVING
)
else:
return health_pb2.HealthCheckResponse(
status=health_pb2.HealthCheckResponse.NOT_SERVING
)
# Unknown service
return health_pb2.HealthCheckResponse(
status=health_pb2.HealthCheckResponse.UNKNOWN
)
def Watch(self, request, context):
"""
Stream health status changes (called by persistent watchers).
"""
# Yield status updates as they change
while True:
status = self.Check(request, context)
yield status
time.sleep(5.0)
def _check_db(self):
try:
return self.db.ping()
except Exception:
return False
def _check_cache(self):
try:
return self.cache.ping()
except Exception:
return False
Graceful Shutdown with Health Checks
On shutdown, mark the service as NOT_SERVING before closing:
import signal
class GracefulShutdown:
def __init__(self, server, health_servicer):
self.server = server
self.health = health_servicer
self.loop = asyncio.get_event_loop()
def handle_signal(self, signum, frame):
"""Called when receiving SIGTERM (Kubernetes sends this)."""
print(f"Received signal {signum}; starting graceful shutdown")
# Mark service as NOT_SERVING immediately
self.health.set(
"ecommerce.orders.OrderService",
health_pb2.HealthCheckResponse.NOT_SERVING
)
# Give Kubernetes 5 seconds to remove this pod from load balancer
time.sleep(5.0)
# Close the server (in-flight requests get 10 seconds to complete)
asyncio.create_task(self.server.stop(grace=10.0))
async def serve():
server = grpc.aio.server()
health_servicer = health.HealthServicer()
# ... register services
server.add_insecure_port("[::]:50051")
await server.start()
# Set up graceful shutdown
shutdown = GracefulShutdown(server, health_servicer)
signal.signal(signal.SIGTERM, shutdown.handle_signal)
print("Server started; waiting for termination")
await server.wait_for_termination()
asyncio.run(serve())
Key Takeaways
- gRPC's standard Health service allows orchestrators to probe liveness (is it alive?) and readiness (can it serve?).
- Implement both probes: liveness to detect crashes, readiness to manage traffic during dependency outages.
- Use
health_servicer.set(service_name, status)to update health status based on dependency checks. - Integrate with Kubernetes via livenessProbe and readinessProbe; use
grpc_health_probefor the health check command. - Graceful shutdown: mark service NOT_SERVING, wait for load balancer to drain, then close connections.
Frequently Asked Questions
What health statuses are available?
SERVING: Ready to handle requests.NOT_SERVING: Online but not accepting traffic (dependencies offline, draining, etc.).UNKNOWN: Status unclear (default for unknown services).TRANSIENT_FAILURE: Temporary issue; may recover shortly.UNIMPLEMENTED: Service not implemented.
Can I check the health of a specific service method?
The Health service checks entire services, not methods. If you need per-method health, implement custom checks in your handler:
def CreateOrder(self, request, context):
if not self.service_ready:
context.abort(grpc.StatusCode.UNAVAILABLE, "Service not ready")
How often should I probe health?
Kubernetes defaults: liveness every 10s, readiness every 5s. Adjust periodSeconds based on your tolerance for downtime.
Can I use HTTP/1.1 health checks with gRPC?
No. gRPC services only respond to gRPC probes. For HTTP health checks, run a separate HTTP server or use a sidecar that probes gRPC and exposes HTTP.
What's the difference between Check and Watch?
Check is a unary RPC: one request, one response. Watch is a server-streaming RPC that yields status updates whenever health changes. Use Check for periodic probes; Watch for continuous monitoring.