Scale Python ML Inference with Kubernetes
Kubernetes (K8s) is the standard platform for deploying containerized ML services at scale. It automates container orchestration, handles failing pods, distributes traffic, scales based on load, and enables seamless rolling updates. This article teaches you how to deploy a containerized ML API to Kubernetes, configure autoscaling, and implement safe rollout strategies.
Kubernetes is complex, but the core pattern is simple: define your service as YAML, push the container image, and Kubernetes handles the rest (scheduling, networking, health checks, auto-repair). For production ML, Kubernetes provides the reliability and observability that notebooks and single VMs cannot.
Kubernetes Concepts for ML Deployment
Pod: The smallest unit; wraps a container (or sidecar containers). A pod runs your FastAPI inference service.
Deployment: Manages replica pods. You define "I want 3 copies of my inference service running." Kubernetes creates 3 pods and replaces any that crash.
Service: Exposes pods via DNS and load balancing. External traffic reaches your deployment via a stable IP/DNS name, which distributes requests to all pods.
ConfigMap: Stores configuration (model version, log level) as key-value data; pods read it at startup. Decouples config from container images.
Secret: Stores credentials (API keys, database passwords) securely.
HorizontalPodAutoscaler (HPA): Watches CPU/memory/custom metrics and automatically scales replicas. If average CPU > 80%, spin up more pods.
Deploying an ML Service to Kubernetes
Step 1: Containerize Your Service
(From Article 6: write a Dockerfile and build an image)
docker build -t myrepo/ml-inference:v1.2.0 .
docker push myrepo/ml-inference:v1.2.0
Step 2: Create a Kubernetes Deployment
Write a YAML file (deployment.yaml):
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference
labels:
app: ml-inference
spec:
replicas: 3 # Run 3 pods
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during rollout
maxUnavailable: 0 # Never take all pods down
selector:
matchLabels:
app: ml-inference
template:
metadata:
labels:
app: ml-inference
spec:
containers:
- name: inference
image: myrepo/ml-inference:v1.2.0
ports:
- containerPort: 8000
# Resource requests help Kubernetes scheduler
resources:
requests:
memory: "256Mi"
cpu: "250m" # 0.25 CPU cores
limits:
memory: "512Mi"
cpu: "500m"
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 2
env:
- name: MODEL_VERSION
valueFrom:
configMapKeyRef:
name: ml-config
key: model-version
- name: LOG_LEVEL
value: "INFO"
Key fields:
replicas: 3: Start with 3 copies.strategy.rollingUpdate: During updates, create new pods before killing old ones (zero downtime).livenessProbe: Kubernetes kills pods that fail this check; used to detect hung processes.readinessProbe: Kubernetes removes pods from traffic if they fail this check; used during startup.resources.requests: Kubernetes reserves CPU/memory; helps it schedule pods.resources.limits: Kubernetes kills pods exceeding this; prevents runaway processes.
Step 3: Create a Service
Write service.yaml:
apiVersion: v1
kind: Service
metadata:
name: ml-inference-service
spec:
selector:
app: ml-inference
ports:
- protocol: TCP
port: 80 # External port
targetPort: 8000 # Port inside container
type: LoadBalancer # Or ClusterIP for internal-only
External traffic to port 80 is routed to port 8000 in the pods; load balanced across replicas.
Step 4: ConfigMap for Model Versioning
configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: ml-config
data:
model-version: "1.2.0"
prediction-timeout-ms: "5000"
Pods read MODEL_VERSION from this ConfigMap. To change the active model version, update the ConfigMap; Kubernetes does not restart pods automatically, but new pods pick up the new version.
For automatic rollout, use a custom controller or a GitOps tool (ArgoCD, Flux) that updates the Deployment image when you push a new version.
Step 5: Deploy
kubectl apply -f configmap.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Verify
kubectl get pods -l app=ml-inference
kubectl get svc ml-inference-service
kubectl logs -l app=ml-inference -f # Stream logs
kubectl describe pod <pod-name> # Debug a pod
Autoscaling with HorizontalPodAutoscaler
Automatically scale replicas based on metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-inference
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up if avg CPU > 70%
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50 # Scale down by at most 50% at a time
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100 # Double the replicas if needed
When traffic spikes, CPU rises; HPA creates more pods. When traffic drops, HPA removes pods (after stabilization period).
Rolling Updates with Zero Downtime
Change the image in the Deployment to trigger a rolling update:
kubectl set image deployment/ml-inference inference=myrepo/ml-inference:v1.3.0
# Watch the rollout
kubectl rollout status deployment/ml-inference
kubectl rollout history deployment/ml-inference
# Rollback if needed
kubectl rollout undo deployment/ml-inference
Kubernetes creates new pods with v1.3.0, waits for them to pass readiness checks, then terminates old v1.2.0 pods. Clients never see downtime because the Service always routes to healthy pods.
Monitoring in Kubernetes
Kubernetes exposes metrics via Prometheus. Scrape kubelet and kube-state-metrics endpoints:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
scrape_configs:
- job_name: "kubernetes"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
Grafana dashboards can then visualize pod CPU, memory, network, and custom metrics from your /metrics endpoint.
Cost Optimization
# Request small amounts; Kubernetes schedules pods densely
resources:
requests:
cpu: "100m" # 0.1 cores; share cores with other pods
memory: "128Mi" # 128 MB
# Scale to 0 during off-hours
apiVersion: v1
kind: CronJob
metadata:
name: scale-down-evening
spec:
schedule: "0 18 * * *" # 6 PM every day
jobSpec:
template:
spec:
serviceAccountName: scaler
containers:
- name: scaler
image: bitnami/kubectl:latest
command: ["kubectl", "scale", "deployment/ml-inference", "--replicas=1"]
restartPolicy: OnFailure
Comparison Table: Kubernetes Hosting Options
| Option | Managed | Setup | Cost | Support |
|---|---|---|---|---|
| Self-hosted K8s | No | Hard | Low | Community |
| Amazon EKS | Yes | Moderate | Per-node | AWS |
| Google GKE | Yes | Moderate | Per-node | |
| Azure AKS | Yes | Moderate | Per-node | Microsoft |
| Heroku | Yes | Very Easy | Per-dyno | Heroku |
| Modal, Replicate | Minimal | Very Easy | Per-inference | Specialized |
Key Takeaways
- Kubernetes automates container orchestration: scheduling, health checks, load balancing, and scaling.
- Define Deployments (replicas), Services (DNS/routing), and ConfigMaps (config) in YAML.
- Use liveness and readiness probes for health checks; Kubernetes repairs unhealthy pods.
- HPA automatically scales replicas based on CPU, memory, or custom metrics.
- Rolling updates create new pods, verify readiness, then retire old pods—zero downtime.
- Monitor Kubernetes metrics (pod CPU, memory) via Prometheus + Grafana.
Frequently Asked Questions
Do I need Kubernetes for small-scale ML services?
No. Start with a managed platform (Heroku, Railway, Render) for simplicity. Use Kubernetes when you need multi-region redundancy, fine-grained resource control, or open-source flexibility.
What is the minimum Kubernetes cluster size?
3 nodes for high availability; 1 node for development/testing. Each node should have >= 2 cores and 4 GB RAM to avoid resource contention.
How do I update a model without restarting the API?
Update the ConfigMap and implement a reload mechanism in your API (watch ConfigMap for changes and reload the model). Or use a sidecar that fetches the model from cloud storage on a schedule.
Can I run GPU inference on Kubernetes?
Yes. Node pools with GPU support (NVIDIA GPUs) are available on major cloud providers. Specify nvidia.com/gpu: 1 in resource requests to schedule on GPU nodes.
Further Reading
- Kubernetes Official Documentation — comprehensive K8s reference.
- KServe: Kubernetes Native Model Serving — abstraction for ML serving on K8s.
- Kubeflow — ML workflows on Kubernetes.
- Helm Charts for ML — reusable Kubernetes templates.