Skip to main content

Load Balancing gRPC: Production Deployment

Load balancing distributes traffic across multiple service instances, eliminating single points of failure and enabling horizontal scaling. gRPC's persistent HTTP/2 connections pose unique challenges: a naive round-robin load balancer creates one connection per client and routes all requests over it, concentrating load on one backend. Proper gRPC load balancing requires client-side awareness, connection pooling, or proxy-layer intelligence. This guide covers client-side load balancing (built into gRPC), server-side proxies (Envoy, nginx), sticky sessions, and production patterns for zero-downtime deployments.

Client-Side Load Balancing: The Default

gRPC clients can load-balance across multiple endpoints without a proxy. Each client resolves the service name to multiple IP addresses (via DNS or service discovery) and distributes RPCs across them.

import grpc
import order_pb2
import order_pb2_grpc
import time
from concurrent.futures import ThreadPoolExecutor

def load_test_client_side():
"""Demonstrate client-side load balancing."""

# Connect to service (DNS resolves to multiple endpoints)
channel = grpc.insecure_channel(
"dns:///order-service.default.svc.cluster.local",
options=[
# Round-robin across all endpoints
("grpc.service_config", '{"loadBalancingPolicy":"round_robin"}'),
# Pool multiple connections to handle concurrent requests
("grpc.max_concurrent_streams", 100),
# Enable keepalive
("grpc.keepalive_time_ms", 30000),
]
)

stub = order_pb2_grpc.OrderServiceStub(channel)

# Simulate 10 concurrent clients
def create_order(order_id):
try:
response = stub.CreateOrder(order_pb2.Order(
order_id=order_id,
customer_id="CUST-123"
))
print(f"Order {order_id}: {response.status}")
except grpc.RpcError as e:
print(f"Order {order_id}: Error {e.code()}")

with ThreadPoolExecutor(max_workers=10) as executor:
for i in range(100):
executor.submit(create_order, f"ORD-{i:04d}")

# Run
load_test_client_side()

Under the hood:

  • gRPC's round_robin policy creates a single connection pool.
  • Each RPC is routed to a different endpoint in round-robin fashion.
  • If an instance fails, the client automatically retries on another endpoint.

Advantages:

  • No proxy overhead; clients connect directly to backends.
  • Language-agnostic policies (all clients support round_robin, least_request, etc.).

Disadvantages:

  • Requires all clients to implement load balancing (not possible for non-gRPC clients).
  • Doesn't work across network boundaries (firewalls, NAT).

Server-Side Load Balancing: Envoy Proxy

For complex deployments (cross-boundary, mixed protocols, advanced routing), use a load balancer proxy:

# Envoy proxy configuration for gRPC load balancing
static_resources:
listeners:
- name: grpc_listener
address:
socket_address:
address: 0.0.0.0
port_number: 50051
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
http_filters:
- name: envoy.filters.http.router
route_config:
name: grpc_routes
virtual_hosts:
- name: default
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: order_service
timeout: 30s

clusters:
- name: order_service
connect_timeout: 1s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
health_checks:
- timeout: 1s
interval: 10s
type: GRPC
grpc_health_check:
service_name: "ecommerce.orders.OrderService"
load_assignment:
cluster_name: order_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: order-service-1
port_number: 50051
- endpoint:
address:
socket_address:
address: order-service-2
port_number: 50051
- endpoint:
address:
socket_address:
address: order-service-3
port_number: 50051

Kubernetes Ingress with gRPC:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: order-service-ingress
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
spec:
rules:
- host: api.example.com
http:
paths:
- path: /ecommerce.orders
pathType: Prefix
backend:
service:
name: order-service
port:
number: 50051

Connection Pooling and Reuse

Efficient load balancing requires smart connection reuse. Clients should pool connections to avoid the overhead of creating new connections per request:

import threading

class OrderServiceClient:
"""
Thread-safe gRPC client with connection pooling.
"""

def __init__(self, target: str, pool_size: int = 10):
self.target = target
self.channels = []
self.stubs = []
self.lock = threading.Lock()
self.current_idx = 0

# Create a pool of channels for concurrent requests
for _ in range(pool_size):
channel = grpc.insecure_channel(
target,
options=[
("grpc.max_concurrent_streams", 100),
("grpc.keepalive_time_ms", 30000),
]
)
self.channels.append(channel)
self.stubs.append(order_pb2_grpc.OrderServiceStub(channel))

def create_order(self, order: order_pb2.Order) -> order_pb2.OrderResponse:
"""Get next stub from pool and send request."""
with self.lock:
stub = self.stubs[self.current_idx]
self.current_idx = (self.current_idx + 1) % len(self.stubs)

return stub.CreateOrder(order)

def close(self):
"""Close all channels."""
for channel in self.channels:
channel.close()

# Usage
client = OrderServiceClient("dns:///order-service.default:50051", pool_size=10)
try:
response = client.create_order(order_pb2.Order(...))
finally:
client.close()

Sticky Sessions (Session Affinity)

Some workloads require routing related requests to the same backend (e.g., maintaining local cache). Use sticky sessions with a hash of client ID or request header:

# Server-side tracking (Envoy example)
# Use hash_policy to route based on client IP or header
route:
cluster: order_service
hash_policy:
- header:
header_name: "x-client-id" # Route based on this header

Client sends the header:

def create_order_with_affinity(client_id: str, order: order_pb2.Order):
with grpc.insecure_channel("order-service.default:50051") as channel:
stub = order_pb2_grpc.OrderServiceStub(channel)

# Call with metadata (headers) to ensure affinity
response = stub.CreateOrder(
order,
metadata=[("x-client-id", client_id)]
)
return response

Envoy (or another proxy) routes all requests with the same x-client-id to the same backend.

Traffic Splitting and Canary Deployments

Gradually roll out new versions by splitting traffic:

# Istio VirtualService: 90% to v1, 10% to v2
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- match:
- uri:
prefix: "/"
route:
- destination:
host: order-service
subset: v1
weight: 90
- destination:
host: order-service
subset: v2
weight: 10
---
# DestinationRule: define subsets (versions)
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 10000
http:
http1MaxPendingRequests: 2048
maxRequestsPerConnection: 2
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2

As errors decrease on v2, gradually increase the weight (90 → 80, 80 → 50, 50 → 0).

Zero-Downtime Deployment

During deployments, mark instances as NOT_SERVING before draining:

import signal
import asyncio

class DeploymentSafeServer:
def __init__(self, health_servicer):
self.health = health_servicer
self.server = None

def handle_shutdown(self, signum, frame):
"""SIGTERM: graceful shutdown for zero-downtime deployments."""
print("Deployment shutdown signal received")

# Step 1: Mark service as NOT_SERVING (removes from load balancer)
self.health.set(
"ecommerce.orders.OrderService",
grpc_health_v1.HealthCheckResponse.NOT_SERVING
)
print("Marked as NOT_SERVING; waiting for load balancer to drain...")

# Step 2: Wait for load balancer to remove this pod
time.sleep(10.0)
print("Gracefully closing server (allowing 30s for in-flight requests)")

# Step 3: Close server with grace period
asyncio.create_task(self.server.stop(grace=30.0))

async def serve(self):
self.server = grpc.aio.server()
health_servicer = grpc_health_v1.HealthServicer()
self.health = health_servicer

# ... register services

signal.signal(signal.SIGTERM, self.handle_shutdown)

self.server.add_insecure_port("[::]:50051")
await self.server.start()
await self.server.wait_for_termination()

asyncio.run(DeploymentSafeServer().serve())

Kubernetes orchestrates this:

spec:
template:
spec:
terminationGracePeriodSeconds: 40 # Allow 40s to shut down gracefully
containers:
- name: order-service
lifecycle:
preStop:
exec:
command: ["/bin/sleep", "5"] # Brief delay before killing

Key Takeaways

  • Client-side load balancing is the default; gRPC clients automatically round-robin across endpoints discovered via DNS or service discovery.
  • Server-side proxies (Envoy, nginx) are needed for cross-boundary traffic, advanced routing, and mixed protocols.
  • Connection pooling reuses HTTP/2 connections to avoid the overhead of creating new connections per request.
  • Sticky sessions route related requests to the same backend via headers or IP hashing.
  • Zero-downtime deployments require graceful shutdown: mark NOT_SERVING, wait for drain, close with grace period.

Frequently Asked Questions

Should I use client-side or server-side load balancing?

Client-side (default) is simpler and more efficient (no proxy overhead). Use it when all clients are gRPC-aware. For heterogeneous clients, legacy systems, or advanced routing, use server-side proxies.

What's the difference between round_robin and least_request?

round_robin rotates through endpoints equally. least_request routes to the endpoint with the fewest open requests, adapting to varying backend latencies. Use least_request for high-variance workloads.

How do I handle connection limits?

gRPC streams are lightweight (100s of streams per connection). Limits are usually connections per backend, not streams. Configure via server options:

server = grpc.aio.server(options=[
("grpc.max_connection_idle_ms", 300000),
("grpc.max_connection_age_ms", 600000),
])

Can I load-balance based on request headers?

Yes, via a proxy (Envoy, nginx) or service mesh (Istio). gRPC client-side policies don't support custom header-based routing; use sticky sessions instead.

How do I detect backend failures?

Health checks (configured in proxy or service discovery) detect failures. Clients automatically retry on UNAVAILABLE errors. Use exponential backoff to avoid thundering herd:

from grpc import aio
channel = aio.insecure_channel(
"order-service.default:50051",
options=[("grpc.service_config", '{"retryPolicy":{...}}')]
)

Further Reading