Skip to main content

Health Checks and Service Readiness

Health checks allow orchestration platforms (Kubernetes, Docker Swarm) and load balancers to determine whether a service is alive (liveness) and ready to serve traffic (readiness). gRPC defines a standard health check service (grpc.health.v1.Health) that responds to probe requests. Without health checks, orchestrators blindly restart services, leading to cascading failures. With them, you get graceful shutdowns, zero-downtime deployments, and automatic recovery. This guide covers implementing both liveness and readiness probes, custom health logic, and integration with Kubernetes.

Standard gRPC Health Check Service

gRPC includes a built-in health check service definition in the grpc-health-probe package. Use it:

pip install grpcio-health-checking

This provides a standard Health service and Python implementation:

from grpc_health.v1 import health
from grpc_health.v1 import health_pb2_grpc
import grpc

class OrderServicer(order_pb2_grpc.OrderServiceServicer):
def CreateOrder(self, request, context):
# ... normal RPC logic
pass

async def serve():
server = grpc.aio.server()

# Add your service
order_pb2_grpc.add_OrderServiceServicer_to_server(OrderServicer(), server)

# Add health check service
health_servicer = health.HealthServicer()
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)

# Set initial status: serving
health_servicer.set("ecommerce.orders.OrderService", health_pb2.HealthCheckResponse.SERVING)

server.add_insecure_port("[::]:50051")
await server.start()
print("Server started with health checks enabled")
await server.wait_for_termination()

if __name__ == "__main__":
asyncio.run(serve())

The health check service responds to probe requests:

# Client-side (e.g., Kubernetes probe)
with grpc.insecure_channel("localhost:50051") as channel:
health_stub = health_pb2_grpc.HealthStub(channel)

try:
response = health_stub.Check(health_pb2.HealthCheckRequest(
service="ecommerce.orders.OrderService"
))
print(f"Service status: {response.status}")
# Output: SERVING, NOT_SERVING, UNKNOWN, TRANSIENT_FAILURE, etc.
except grpc.RpcError as e:
print(f"Health check failed: {e.details()}")

Liveness vs. Readiness Probes

Kubernetes uses two types of health checks:

Liveness probe: "Is the service still running?" (e.g., can we connect?). If it fails repeatedly, Kubernetes kills and restarts the pod.

Readiness probe: "Is the service ready to accept traffic?" (e.g., are dependencies online?). If it fails, Kubernetes removes the pod from load balancer rotation.

Implement both by reporting different statuses:

from grpc_health.v1 import health, health_pb2
import asyncio

class OrderServicer(order_pb2_grpc.OrderServiceServicer):
def __init__(self, health_servicer):
self.health = health_servicer
self.db_connected = False
self.cache_connected = False

async def check_dependencies(self):
"""Periodic task to check if critical dependencies are online."""
while True:
# Check database
try:
self.db_connected = await check_db_health()
except Exception as e:
print(f"Database health check failed: {e}")
self.db_connected = False

# Check cache
try:
self.cache_connected = await check_cache_health()
except Exception as e:
print(f"Cache health check failed: {e}")
self.cache_connected = False

# Update health service status
if self.db_connected and self.cache_connected:
# All dependencies online; service is ready
self.health.set(
"ecommerce.orders.OrderService",
health_pb2.HealthCheckResponse.SERVING
)
else:
# Some dependencies offline; service can't serve requests
self.health.set(
"ecommerce.orders.OrderService",
health_pb2.HealthCheckResponse.NOT_SERVING
)

await asyncio.sleep(5.0) # Check every 5 seconds

def CreateOrder(self, request, context):
# Handler should also check dependencies
if not self.db_connected:
context.abort(
grpc.StatusCode.UNAVAILABLE,
"Database is offline; service temporarily unavailable"
)
# ... process order
pass

async def check_db_health():
"""Pseudo-code: verify database connection."""
return await db.ping()

async def check_cache_health():
"""Pseudo-code: verify cache connection."""
return await cache.ping()

async def serve():
server = grpc.aio.server()

# Create health servicer
health_servicer = health.HealthServicer()

# Create application servicer with reference to health
app_servicer = OrderServicer(health_servicer)

# Start background task: periodic health checks
asyncio.create_task(app_servicer.check_dependencies())

# Register services
order_pb2_grpc.add_OrderServiceServicer_to_server(app_servicer, server)
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)

# Initial status: NOT_SERVING (until dependencies come online)
health_servicer.set("ecommerce.orders.OrderService", health_pb2.HealthCheckResponse.NOT_SERVING)

server.add_insecure_port("[::]:50051")
await server.start()
print("Server started; health status initially NOT_SERVING until dependencies online")
await server.wait_for_termination()

Kubernetes Integration

Define liveness and readiness probes in a Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: order-service:1.0
ports:
- containerPort: 50051
name: grpc

# Liveness probe: restart if unresponsive
livenessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:50051"]
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

# Readiness probe: remove from load balancer if not ready
readinessProbe:
exec:
command: [
"/bin/grpc_health_probe",
"-addr=:50051",
"-service=ecommerce.orders.OrderService"
]
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 1

The grpc_health_probe binary is provided by gRPC:

# Install in your Docker image
RUN go install github.com/grpc-ecosystem/grpc-health-probe@latest

Or use a shell script instead:

# Simple health check using grpcurl
grpcurl -plaintext localhost:50051 grpc.health.v1.Health/Check

Custom Health Check Logic

For advanced scenarios, create a custom health check endpoint:

class CustomHealthServicer(health_pb2_grpc.HealthServicer):
"""
Extend the standard health check with custom logic.
"""

def __init__(self):
self._status = {}
self.db = None
self.cache = None

def Check(self, request, context):
"""
Handle health check requests.

Args:
request: HealthCheckRequest (has a 'service' field)

Returns:
HealthCheckResponse with status enum
"""
service_name = request.service

# Check specific service
if service_name == "ecommerce.orders.OrderService":
# Perform detailed checks
db_ok = self._check_db()
cache_ok = self._check_cache()

if db_ok and cache_ok:
return health_pb2.HealthCheckResponse(
status=health_pb2.HealthCheckResponse.SERVING
)
else:
return health_pb2.HealthCheckResponse(
status=health_pb2.HealthCheckResponse.NOT_SERVING
)

# Unknown service
return health_pb2.HealthCheckResponse(
status=health_pb2.HealthCheckResponse.UNKNOWN
)

def Watch(self, request, context):
"""
Stream health status changes (called by persistent watchers).
"""
# Yield status updates as they change
while True:
status = self.Check(request, context)
yield status
time.sleep(5.0)

def _check_db(self):
try:
return self.db.ping()
except Exception:
return False

def _check_cache(self):
try:
return self.cache.ping()
except Exception:
return False

Graceful Shutdown with Health Checks

On shutdown, mark the service as NOT_SERVING before closing:

import signal

class GracefulShutdown:
def __init__(self, server, health_servicer):
self.server = server
self.health = health_servicer
self.loop = asyncio.get_event_loop()

def handle_signal(self, signum, frame):
"""Called when receiving SIGTERM (Kubernetes sends this)."""
print(f"Received signal {signum}; starting graceful shutdown")

# Mark service as NOT_SERVING immediately
self.health.set(
"ecommerce.orders.OrderService",
health_pb2.HealthCheckResponse.NOT_SERVING
)

# Give Kubernetes 5 seconds to remove this pod from load balancer
time.sleep(5.0)

# Close the server (in-flight requests get 10 seconds to complete)
asyncio.create_task(self.server.stop(grace=10.0))

async def serve():
server = grpc.aio.server()
health_servicer = health.HealthServicer()

# ... register services

server.add_insecure_port("[::]:50051")
await server.start()

# Set up graceful shutdown
shutdown = GracefulShutdown(server, health_servicer)
signal.signal(signal.SIGTERM, shutdown.handle_signal)

print("Server started; waiting for termination")
await server.wait_for_termination()

asyncio.run(serve())

Key Takeaways

  • gRPC's standard Health service allows orchestrators to probe liveness (is it alive?) and readiness (can it serve?).
  • Implement both probes: liveness to detect crashes, readiness to manage traffic during dependency outages.
  • Use health_servicer.set(service_name, status) to update health status based on dependency checks.
  • Integrate with Kubernetes via livenessProbe and readinessProbe; use grpc_health_probe for the health check command.
  • Graceful shutdown: mark service NOT_SERVING, wait for load balancer to drain, then close connections.

Frequently Asked Questions

What health statuses are available?

  • SERVING: Ready to handle requests.
  • NOT_SERVING: Online but not accepting traffic (dependencies offline, draining, etc.).
  • UNKNOWN: Status unclear (default for unknown services).
  • TRANSIENT_FAILURE: Temporary issue; may recover shortly.
  • UNIMPLEMENTED: Service not implemented.

Can I check the health of a specific service method?

The Health service checks entire services, not methods. If you need per-method health, implement custom checks in your handler:

def CreateOrder(self, request, context):
if not self.service_ready:
context.abort(grpc.StatusCode.UNAVAILABLE, "Service not ready")

How often should I probe health?

Kubernetes defaults: liveness every 10s, readiness every 5s. Adjust periodSeconds based on your tolerance for downtime.

Can I use HTTP/1.1 health checks with gRPC?

No. gRPC services only respond to gRPC probes. For HTTP health checks, run a separate HTTP server or use a sidecar that probes gRPC and exposes HTTP.

What's the difference between Check and Watch?

Check is a unary RPC: one request, one response. Watch is a server-streaming RPC that yields status updates whenever health changes. Use Check for periodic probes; Watch for continuous monitoring.

Further Reading