Skip to main content

Containerize Python ML Models with Docker

Docker containers bundle your model, Python runtime, dependencies, and code into a single immutable artifact that runs identically everywhere: your laptop, a VM, or Kubernetes. This eliminates "works on my machine" problems and enables reliable production deployment at scale.

A containerized ML service is portable (move between cloud providers without rewriting), reproducible (exact dependency versions locked), secure (minimal attack surface), and efficient (layers cached, quick startup). This article teaches you how to write a production-grade Dockerfile for ML inference, optimize image size, and test locally before deploying.

The Minimal Dockerfile for ML Inference

Here is a complete Dockerfile for a FastAPI inference service:

# Use an official Python runtime as a base image
FROM python:3.11-slim

# Set working directory in container
WORKDIR /app

# Copy requirements (pinned versions)
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application code
COPY model.joblib .
COPY main.py .

# Expose the port FastAPI runs on
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')"

# Run FastAPI via Uvicorn
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run it:

# Build the image
docker build -t ml-inference:v1 .

# Run a container
docker run -p 8000:8000 ml-inference:v1

# Test
curl http://localhost:8000/docs

Multi-Stage Builds: Reduce Image Size

Multi-stage builds compile or pre-process in a "builder" stage, then copy only the final artifacts to a "runtime" stage, dramatically reducing image size:

# Stage 1: Builder (compile dependencies, prepare data)
FROM python:3.11-slim as builder

WORKDIR /build
COPY requirements.txt .

# Install dependencies to a local directory
RUN pip install --user --no-cache-dir -r requirements.txt

# Stage 2: Runtime (only artifacts needed to run)
FROM python:3.11-slim

WORKDIR /app

# Copy only the installed packages from builder
COPY --from=builder /root/.local /root/.local

# Copy model and code
COPY model.joblib .
COPY main.py .

# Add local pip directory to PATH
ENV PATH=/root/.local/bin:$PATH

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')"

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

This can reduce image size by 40–60% compared to a single-stage build.

Production-Grade Dockerfile with Best Practices

Here is a hardened Dockerfile following security and reliability best practices:

# Use an official Python image (pinned version)
FROM python:3.11.8-slim-bookworm as builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM python:3.11.8-slim-bookworm

# Create a non-root user for security
RUN groupadd -r mluser && useradd -r -g mluser mluser

WORKDIR /app

# Copy dependencies from builder
COPY --from=builder /root/.local /home/mluser/.local
COPY --chown=mluser:mluser model.joblib .
COPY --chown=mluser:mluser main.py .

ENV PATH=/home/mluser/.local/bin:$PATH \
PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1

# Switch to non-root user
USER mluser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

Key hardening techniques:

  • Non-root user: Run as mluser instead of root to limit damage if the container is compromised.
  • Pinned base image version: python:3.11.8-slim-bookworm instead of python:3.11 ensures reproducibility.
  • Slim variant: slim-bookworm is 200 MB vs 900 MB for the full image, cutting deployment time and attack surface.
  • Environment variables: PYTHONUNBUFFERED ensures logs stream immediately; PYTHONDONTWRITEBYTECODE prevents .pyc bloat.
  • Health check: Kubernetes and orchestrators use this to detect hung containers.
  • Multi-worker Uvicorn: --workers 2 handles concurrent requests better than 1 worker.

requirements.txt: Pin Versions

Always pin dependency versions in requirements.txt for reproducibility:

fastapi==0.109.0
uvicorn[standard]==0.27.0
scikit-learn==1.4.1
numpy==1.24.3
joblib==1.3.2
pydantic==2.6.1

Pin exact versions, not ranges like fastapi>=0.100. This ensures the exact same dependencies are installed every time, eliminating subtle bugs from transitive dependency updates.

Optimizing for Production Deployment

Layer Caching

Docker caches layers; put code that changes frequently at the end:

FROM python:3.11-slim
WORKDIR /app

# Rarely changes → cached across builds
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Model changes occasionally
COPY model.joblib .

# Code changes frequently → last to avoid invalidating earlier layers
COPY main.py .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]

Minimal Context

Build context (files sent to Docker daemon) impacts build speed. Use a .dockerignore file:

.git
__pycache__
*.pyc
.pytest_cache
.env
venv
notebooks

Multi-Architecture Builds

To support both x86-64 and ARM64 (Apple Silicon, Raspberry Pi), use buildx:

# Enable buildx (requires Docker 19.03+)
docker buildx create --use

# Build for multiple platforms
docker buildx build --platform linux/amd64,linux/arm64 \
-t myrepo/ml-inference:v1 --push .

Testing and Validating the Container

Before pushing to production:

# Build the image
docker build -t ml-inference:v1 .

# Run and test
docker run -p 8000:8000 ml-inference:v1 &

# Give it a moment to start
sleep 2

# Test the health endpoint
curl http://localhost:8000/health

# Test a prediction
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [5.1, 3.5, 1.4, 0.2]}'

# Check logs
docker logs <container_id>

# Check resource usage
docker stats

Comparison Table: Container Image Strategies

StrategyImage SizeBuild TimeRuntime SpeedBest For
Single-stage slim400-600 MBFastFastSmall models, simple APIs
Multi-stage slim250-350 MBModerateFastStandard ML services
Alpine base100-200 MBModerateSlowerEdge/resource-constrained
Distroless150-250 MBModerateVery fastSecurity-critical services
GPU base2-5 GBSlowVery fastDeep learning inference

Key Takeaways

  • Use multi-stage builds to separate compilation from runtime; this reduces image size by 40–60%.
  • Pin base image versions and all dependencies for reproducibility.
  • Run as a non-root user to limit security blast radius.
  • Include a HEALTHCHECK endpoint for orchestrators to monitor container health.
  • Use .dockerignore and layer ordering to speed up builds.
  • Test locally before pushing to a registry; validate health checks and API endpoints.

Frequently Asked Questions

How big should my ML image be?

Typical: 300–600 MB for scikit-learn/FastAPI, 1–3 GB for PyTorch/TensorFlow. Larger than 1 GB slows down image pulls in production; optimize with multi-stage builds and Alpine/Distroless bases.

Should I include the model in the image or mount it at runtime?

Include it in the image for immutability (a versioned image always has the same model). For very large models (> 1 GB), mount at runtime from cloud storage (S3, GCS) to keep images small.

How do I add GPU support?

Use a GPU base image: nvidia/cuda:12.0-runtime-ubuntu22.04 instead of python:3.11-slim. Install CUDA libraries and your ML framework compiled for GPU (PyTorch, TensorFlow).

Can I run multiple models in one container?

Yes, but it complicates orchestration. Typically, deploy one model per container for clarity and independent scaling. Use an API gateway to route requests to the right container.

Further Reading