Scale Python ML Inference with Kubernetes

Kubernetes (K8s) is the standard platform for deploying containerized ML services at scale. It automates container orchestration, handles failing pods, distributes traffic, scales based on load, and enables seamless rolling updates. This article teaches you how to deploy a containerized ML API to Kubernetes, configure autoscaling, and implement safe rollout strategies.

Kubernetes is complex, but the core pattern is simple: define your service as YAML, push the container image, and Kubernetes handles the rest (scheduling, networking, health checks, auto-repair). For production ML, Kubernetes provides the reliability and observability that notebooks and single VMs cannot.

Kubernetes Concepts for ML Deployment

Pod: The smallest unit; wraps a container (or sidecar containers). A pod runs your FastAPI inference service.

Deployment: Manages replica pods. You define "I want 3 copies of my inference service running." Kubernetes creates 3 pods and replaces any that crash.

Service: Exposes pods via DNS and load balancing. External traffic reaches your deployment via a stable IP/DNS name, which distributes requests to all pods.

ConfigMap: Stores configuration (model version, log level) as key-value data; pods read it at startup. Decouples config from container images.

Secret: Stores credentials (API keys, database passwords) securely.

HorizontalPodAutoscaler (HPA): Watches CPU/memory/custom metrics and automatically scales replicas. If average CPU > 80%, spin up more pods.

Deploying an ML Service to Kubernetes

Step 1: Containerize Your Service

(From Article 6: write a Dockerfile and build an image)

docker build -t myrepo/ml-inference:v1.2.0 .
docker push myrepo/ml-inference:v1.2.0

Step 2: Create a Kubernetes Deployment

Write a YAML file (deployment.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference
  labels:
    app: ml-inference
spec:
  replicas: 3  # Run 3 pods
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during rollout
      maxUnavailable: 0  # Never take all pods down
  selector:
    matchLabels:
      app: ml-inference
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      containers:
      - name: inference
        image: myrepo/ml-inference:v1.2.0
        ports:
        - containerPort: 8000
        # Resource requests help Kubernetes scheduler
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"  # 0.25 CPU cores
          limits:
            memory: "512Mi"
            cpu: "500m"
        # Health checks
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 10
          failureThreshold: 2
        env:
        - name: MODEL_VERSION
          valueFrom:
            configMapKeyRef:
              name: ml-config
              key: model-version
        - name: LOG_LEVEL
          value: "INFO"

Key fields:

replicas: 3: Start with 3 copies.
strategy.rollingUpdate: During updates, create new pods before killing old ones (zero downtime).
livenessProbe: Kubernetes kills pods that fail this check; used to detect hung processes.
readinessProbe: Kubernetes removes pods from traffic if they fail this check; used during startup.
resources.requests: Kubernetes reserves CPU/memory; helps it schedule pods.
resources.limits: Kubernetes kills pods exceeding this; prevents runaway processes.

Step 3: Create a Service

Write service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: ml-inference-service
spec:
  selector:
    app: ml-inference
  ports:
  - protocol: TCP
    port: 80        # External port
    targetPort: 8000  # Port inside container
  type: LoadBalancer  # Or ClusterIP for internal-only

External traffic to port 80 is routed to port 8000 in the pods; load balanced across replicas.

Step 4: ConfigMap for Model Versioning

configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ml-config
data:
  model-version: "1.2.0"
  prediction-timeout-ms: "5000"

Pods read MODEL_VERSION from this ConfigMap. To change the active model version, update the ConfigMap; Kubernetes does not restart pods automatically, but new pods pick up the new version.

For automatic rollout, use a custom controller or a GitOps tool (ArgoCD, Flux) that updates the Deployment image when you push a new version.

Step 5: Deploy

kubectl apply -f configmap.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

# Verify
kubectl get pods -l app=ml-inference
kubectl get svc ml-inference-service
kubectl logs -l app=ml-inference -f  # Stream logs
kubectl describe pod <pod-name>  # Debug a pod

Autoscaling with HorizontalPodAutoscaler

Automatically scale replicas based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up if avg CPU > 70%
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50  # Scale down by at most 50% at a time
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100  # Double the replicas if needed

When traffic spikes, CPU rises; HPA creates more pods. When traffic drops, HPA removes pods (after stabilization period).

Rolling Updates with Zero Downtime

Change the image in the Deployment to trigger a rolling update:

kubectl set image deployment/ml-inference inference=myrepo/ml-inference:v1.3.0

# Watch the rollout
kubectl rollout status deployment/ml-inference
kubectl rollout history deployment/ml-inference

# Rollback if needed
kubectl rollout undo deployment/ml-inference

Kubernetes creates new pods with v1.3.0, waits for them to pass readiness checks, then terminates old v1.2.0 pods. Clients never see downtime because the Service always routes to healthy pods.

Monitoring in Kubernetes

Kubernetes exposes metrics via Prometheus. Scrape kubelet and kube-state-metrics endpoints:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
    - job_name: "kubernetes"
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"

Grafana dashboards can then visualize pod CPU, memory, network, and custom metrics from your /metrics endpoint.

Cost Optimization

# Request small amounts; Kubernetes schedules pods densely
resources:
  requests:
    cpu: "100m"      # 0.1 cores; share cores with other pods
    memory: "128Mi"   # 128 MB

# Scale to 0 during off-hours
apiVersion: v1
kind: CronJob
metadata:
  name: scale-down-evening
spec:
  schedule: "0 18 * * *"  # 6 PM every day
  jobSpec:
    template:
      spec:
        serviceAccountName: scaler
        containers:
        - name: scaler
          image: bitnami/kubectl:latest
          command: ["kubectl", "scale", "deployment/ml-inference", "--replicas=1"]
        restartPolicy: OnFailure

Comparison Table: Kubernetes Hosting Options

Option	Managed	Setup	Cost	Support
Self-hosted K8s	No	Hard	Low	Community
Amazon EKS	Yes	Moderate	Per-node	AWS
Google GKE	Yes	Moderate	Per-node	Google
Azure AKS	Yes	Moderate	Per-node	Microsoft
Heroku	Yes	Very Easy	Per-dyno	Heroku
Modal, Replicate	Minimal	Very Easy	Per-inference	Specialized

Key Takeaways

Kubernetes automates container orchestration: scheduling, health checks, load balancing, and scaling.
Define Deployments (replicas), Services (DNS/routing), and ConfigMaps (config) in YAML.
Use liveness and readiness probes for health checks; Kubernetes repairs unhealthy pods.
HPA automatically scales replicas based on CPU, memory, or custom metrics.
Rolling updates create new pods, verify readiness, then retire old pods—zero downtime.
Monitor Kubernetes metrics (pod CPU, memory) via Prometheus + Grafana.

Frequently Asked Questions

Do I need Kubernetes for small-scale ML services?

No. Start with a managed platform (Heroku, Railway, Render) for simplicity. Use Kubernetes when you need multi-region redundancy, fine-grained resource control, or open-source flexibility.

What is the minimum Kubernetes cluster size?

3 nodes for high availability; 1 node for development/testing. Each node should have >= 2 cores and 4 GB RAM to avoid resource contention.

How do I update a model without restarting the API?

Update the ConfigMap and implement a reload mechanism in your API (watch ConfigMap for changes and reload the model). Or use a sidecar that fetches the model from cloud storage on a schedule.

Can I run GPU inference on Kubernetes?

Yes. Node pools with GPU support (NVIDIA GPUs) are available on major cloud providers. Specify nvidia.com/gpu: 1 in resource requests to schedule on GPU nodes.

Kubernetes Concepts for ML Deployment​

Deploying an ML Service to Kubernetes​

Step 1: Containerize Your Service​

Step 2: Create a Kubernetes Deployment​

Step 3: Create a Service​

Step 4: ConfigMap for Model Versioning​

Step 5: Deploy​

Autoscaling with HorizontalPodAutoscaler​

Rolling Updates with Zero Downtime​

Monitoring in Kubernetes​

Cost Optimization​

Comparison Table: Kubernetes Hosting Options​

Key Takeaways​

Frequently Asked Questions​

Do I need Kubernetes for small-scale ML services?​

What is the minimum Kubernetes cluster size?​

How do I update a model without restarting the API?​

Can I run GPU inference on Kubernetes?​

Further Reading​