Production Deployment: Containerize and Scale Real-Time Apps
Deploying a WebSocket application to production differs significantly from HTTP services. Connections are stateful and long-lived, making traditional load balancing trickier: you can't just randomly route requests across servers. This article guides you through containerization, horizontal scaling with sticky sessions, Redis state sharing, health checks, and monitoring for a production-ready real-time application.
Containerization with Docker
Start with a Dockerfile for the chat server:
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY main.py .
COPY chat_manager.py .
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import socket; socket.create_connection(('localhost', 8000), timeout=2)" || exit 1
# Run server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
The --workers 1 flag is critical: FastAPI/Uvicorn in production typically runs multiple worker processes, but WebSocket connections are per-worker. Running one worker per container and scaling horizontally (many containers) is the recommended pattern.
requirements.txt:
fastapi==0.104.0
uvicorn[standard]==0.24.0
websockets==12.0
redis[asyncio]==5.0.0
pydantic==2.5.0
python-jose[cryptography]==3.3.0
Build and run locally:
docker build -t chat-server:1.0 .
docker run -p 8000:8000 -e REDIS_URL=redis://localhost:6379 chat-server:1.0
Environment-Based Configuration
Externalize configuration for different environments (dev, staging, prod):
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
redis_url: str = "redis://localhost:6379"
allowed_origins: str = "http://localhost:3000,https://example.com"
jwt_secret: str = "dev-secret-change-in-production"
log_level: str = "info"
workers: int = 1
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
settings = Settings()
ALLOWED_ORIGINS = settings.allowed_origins.split(",")
# Use in FastAPI
app = FastAPI()
@app.on_event("startup")
async def startup():
logging.basicConfig(level=settings.log_level.upper())
await redis_manager.init()
Create .env for local development:
REDIS_URL=redis://localhost:6379
ALLOWED_ORIGINS=http://localhost:3000
JWT_SECRET=dev-secret-12345
LOG_LEVEL=debug
For production, set environment variables in the deployment (Kubernetes, Docker Compose, ECS, etc.) without committing .env.
Load Balancing and Sticky Sessions
A load balancer (AWS ALB, Nginx, HAProxy) distributes incoming connections across server instances. For HTTP, it can route requests round-robin. For WebSocket, once a connection is established, all messages from that client must route to the same server (sticky sessions). Configure the load balancer for WebSocket affinity:
Nginx:
upstream websocket_backend {
# IP hash ensures same client always routes to same backend
ip_hash;
server server1:8000;
server server2:8000;
server server3:8000;
}
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://websocket_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts for long-lived connections
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
}
}
AWS ALB: Enable stickiness in the target group settings:
- Stickiness type: Application-controlled (uses cookies if available).
- Duration: 1 day.
- Alternatively, use source IP-based stickiness if cookies aren't feasible.
Kubernetes Ingress: Use a session affinity annotation:
apiVersion: v1
kind: Service
metadata:
name: chat-service
spec:
type: LoadBalancer
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIPConfig:
timeoutSeconds: 10800
selector:
app: chat-server
ports:
- port: 80
targetPort: 8000
protocol: TCP
Kubernetes Deployment with Scaling
Deploy the chat server to Kubernetes with horizontal pod autoscaling:
apiVersion: apps/v1
kind: Deployment
metadata:
name: chat-server
spec:
replicas: 3
selector:
matchLabels:
app: chat-server
template:
metadata:
labels:
app: chat-server
spec:
containers:
- name: chat-server
image: chat-server:1.0
ports:
- containerPort: 8000
env:
- name: REDIS_URL
value: "redis://redis-service:6379"
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: app-secrets
key: jwt-secret
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: chat-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: chat-server
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Implement the health check endpoints:
@app.get("/health")
async def health():
"""Liveness probe: is the server running?"""
return {"status": "alive"}
@app.get("/ready")
async def ready():
"""Readiness probe: is the server ready to accept traffic?"""
try:
await redis_manager.redis_client.ping()
return {"status": "ready"}
except Exception:
raise HTTPException(status_code=503, detail="Redis unavailable")
Kubernetes runs the liveness probe every 30 seconds; if it fails 3 times, the pod is killed and restarted. The readiness probe runs every 10 seconds; if it fails, the pod is removed from the load balancer but not killed, allowing it to recover gracefully.
Monitoring and Alerting
Expose Prometheus metrics:
from prometheus_client import Counter, Gauge, Histogram, generate_latest, CollectorRegistry
from prometheus_client import CONTENT_TYPE_LATEST
registry = CollectorRegistry()
websocket_connections = Gauge("websocket_connections_active", "Active WebSocket connections", registry=registry)
websocket_messages = Counter("websocket_messages_total", "Total messages sent", registry=registry)
broadcast_latency = Histogram("websocket_broadcast_latency_seconds", "Broadcast latency", registry=registry)
@app.get("/metrics")
async def metrics():
return Response(generate_latest(registry), media_type=CONTENT_TYPE_LATEST)
# In your WebSocket handler
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket, ...):
await manager.connect(...)
websocket_connections.inc()
try:
while True:
data = await websocket.receive_text()
websocket_messages.inc()
start = time.time()
await manager.broadcast(...)
broadcast_latency.observe(time.time() - start)
except Exception:
websocket_connections.dec()
Configure Prometheus to scrape the /metrics endpoint every 15 seconds, then set up alerts in Grafana:
- Alert if
websocket_connections_active > 50000(resource warning). - Alert if
websocket_broadcast_latency_seconds > 0.5(latency degradation). - Alert if
up{job="chat-server"} == 0for more than 2 minutes (pod crash).
Logging and Distributed Tracing
Use structured logging for easier searching:
import json
import logging
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name
}
return json.dumps(log_data)
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket, ...):
client_ip = websocket.client.host
username = payload.get("sub")
logger.info(json.dumps({
"event": "websocket_connect",
"username": username,
"client_ip": client_ip
}))
try:
while True:
data = await websocket.receive_text()
logger.info(json.dumps({
"event": "websocket_message",
"username": username,
"message_length": len(data)
}))
except Exception:
logger.error(json.dumps({
"event": "websocket_disconnect",
"username": username,
"error": str(Exception)
}))
Ship logs to a centralized service (ELK, Datadog, CloudWatch):
# Docker: log to stdout (Kubernetes captures and ships logs)
# or use a log shipper sidecar container
Graceful Shutdown and Connection Draining
When a pod is terminated (deployment update, node drain), give existing connections time to finish:
import signal
app.shutdown = False
@app.on_event("shutdown")
async def shutdown_event():
"""Called when the server is shutting down."""
app.shutdown = True
# Wait for active connections to close gracefully
await asyncio.sleep(30)
logger.info("Server shutdown complete")
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket, ...):
await manager.connect(...)
try:
while True:
if app.shutdown:
# Server is shutting down; close the connection gracefully
await websocket.close(code=1001, reason="Server shutting down")
return
data = await websocket.receive_text()
# ...
except Exception:
pass
Configure Kubernetes to allow a 30-second termination grace period:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: chat-server
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"] # Wait for LB to remove pod
When a pod is terminated:
- Kubernetes sends SIGTERM.
- The pod's preStop hook sleeps 15 seconds (allowing load balancers to remove it).
- FastAPI's shutdown event fires; the app stops accepting new connections.
- Existing connections have up to 30 seconds to complete; after that, the pod is killed forcefully.
Database and State Persistence
Store chat history and user data in a database:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
DATABASE_URL = "postgresql://user:password@localhost/chatdb"
engine = create_engine(DATABASE_URL)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket, ...):
db = SessionLocal()
try:
while True:
data = await websocket.receive_text()
# Save to database
message = Message(room_id=room, username=username, text=data)
db.add(message)
db.commit()
# Broadcast
await manager.broadcast(...)
except Exception:
db.close()
Use connection pooling to avoid exhausting database connections:
engine = create_engine(
DATABASE_URL,
pool_size=20,
max_overflow=0,
pool_pre_ping=True # Verify connections before use
)
Key Takeaways
- Containerize with Docker; run one worker per container, scaling horizontally with orchestrators.
- Use sticky sessions (IP hash, cookies) in load balancers to route clients to the same server.
- Implement health checks (liveness, readiness) for reliable Kubernetes deployments.
- Monitor WebSocket metrics (connections, latency, errors) with Prometheus and Grafana.
- Use structured JSON logging for searchability and centralized collection.
- Implement graceful shutdown with connection draining to minimize message loss during updates.
- Persist chat history to a database; use connection pooling.
Frequently Asked Questions
Can I deploy WebSocket apps to serverless (Lambda, Cloud Functions)?
Generally no. Serverless platforms are request-response oriented; they don't support long-lived persistent connections. If you need serverless real-time, use managed services like Firebase Realtime Database, AWS AppSync, or Pusher, which provide WebSocket-like APIs.
How do I handle database connection limits?
Set max_overflow=0 in SQLAlchemy to prevent connection starvation. Monitor pool usage: engine.pool.checkedout(). If you hit limits, scale the database or increase connection pool size. For very high concurrency, use a connection pooler like PgBouncer.
What's the maximum WebSocket connections per Kubernetes pod?
Practically, 5,000–20,000 depending on message frequency and CPU. Monitor CPU and memory; when CPU hits 70% (your HPA threshold), new pods are created. Each pod should comfortably handle its share, ensuring latency stays low.
How do I drain WebSocket connections without losing messages?
Implement a shutdown flag that stops accepting new connections but allows existing ones to close. A 30-second termination grace period gives clients time to gracefully close and resend buffered messages. Some messages may be lost if the client crashes, but this is expected; the application must tolerate temporary message loss.