Skip to main content

Deploy Machine Learning Models: Python Intro

Deploying a machine learning model means taking a trained Python model and putting it into a production system where users or other applications can make predictions on new data. This involves serializing the model to disk, wrapping it in an API, containerizing it for portability, and monitoring its performance continuously.

Most developers train models locally in Jupyter notebooks, but the real work begins when you move that model to production. Your model must handle diverse input formats, respond in milliseconds, scale to thousands of concurrent requests, survive crashes, and degrade gracefully when resources are tight. This article sets up the mental model and workflow you'll use throughout the series.

What Does Model Deployment Actually Mean?

Model deployment is not a single step—it is a pipeline with distinct phases: (1) serialization (save the trained model to a file), (2) API wrapping (expose the model via HTTP), (3) containerization (package code and dependencies), (4) versioning (track which model version is live), (5) monitoring (track prediction accuracy and latency), and (6) scaling (route requests across multiple replicas). Each phase has its own tools and best practices.

In production, you will rarely run a model directly from Python; instead, you'll run it behind an API server (usually FastAPI or Flask) inside a container (Docker) orchestrated by a scheduler (Kubernetes or a cloud platform). This separation is deliberate: it decouples model updates from application code, allows polyglot serving (one inference server can handle Python, C++, and TensorFlow models simultaneously), and provides clear observability boundaries.

The Complete Deployment Workflow

Here is the end-to-end workflow you will implement across this series:

  1. Train and export your model using scikit-learn, PyTorch, TensorFlow, or similar.
  2. Choose a serialization format (pickle, joblib, ONNX, SavedModel) based on your framework and cross-platform needs.
  3. Write a Python API using FastAPI to load the model and respond to prediction requests.
  4. Test locally with curl or the interactive docs at http://localhost:8000/docs.
  5. Write a Dockerfile that bundles the API, model, and dependencies into a container.
  6. Define a version for the model (e.g., v1.2.3) and tag the image accordingly.
  7. Deploy to production (Kubernetes, cloud function, or managed inference service).
  8. Set up monitoring to track latency, throughput, and prediction distribution.
  9. Implement gradual rollouts (canary deployments, A/B testing) to validate new versions.

In 2026, this workflow is fully standardized. Tools like KServe, BentoML, Hugging Face Spaces, and AWS SageMaker automate many steps, but understanding the foundation is essential for customization and troubleshooting.

Key Deployment Patterns

Model as a Service (MaaS)

A dedicated HTTP service exposes your model as a REST or gRPC endpoint. Clients send JSON payloads, and the service returns predictions. This is the industry standard and what we build in this series.

Batch Prediction

For non-real-time workloads (nightly reporting, bulk classification), you submit large batches of data, the system processes them offline, and results are written to a database or data warehouse. This is more efficient than request-response for high-volume scenarios.

Embedded Models

For low-latency scenarios (mobile apps, browser inference), you serialize the model to a lightweight format (ONNX, TensorFlow Lite, or WASM) and bundle it directly with the application. This eliminates network round-trips.

A Real Example: Iris Classifier

Let's ground this with a tiny working example. Suppose you trained a scikit-learn model to classify iris flowers:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pickle

# Train the model
iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Serialize it
with open("iris_model.pkl", "wb") as f:
pickle.dump(model, f)

In production, you'd load that pickle file inside a FastAPI endpoint:

from fastapi import FastAPI
import pickle
import numpy as np

app = FastAPI()

with open("iris_model.pkl", "rb") as f:
model = pickle.load(f)

@app.post("/predict")
async def predict(features: list[float]):
X = np.array(features).reshape(1, -1)
pred_class = model.predict(X)[0]
pred_proba = model.predict_proba(X)[0].tolist()
return {
"prediction": int(pred_class),
"probability": pred_proba
}

Run it with uvicorn main:app --reload, and you have a live inference service on http://localhost:8000/predict. The next articles will expand on request validation, batching, ONNX export, Docker, and monitoring.

Why Deployment Matters

Deployment transforms a model from an artifact to a product. A 95% accurate model that takes 30 seconds per prediction and crashes under load is worthless. A 90% accurate model that responds in 50 ms, handles 10,000 requests per second, and automatically retrains weekly is business value.

Production deployment forces you to think about:

  • Latency: Can you respond fast enough to meet SLAs?
  • Availability: Does your service recover from failures automatically?
  • Correctness: Are you serving the right model version to the right fraction of users?
  • Observability: Can you see what the model predicts and why it's drifting?

The subsequent articles in this series address each of these concerns systematically.

Key Takeaways

  • Model deployment is a multi-phase pipeline: serialize → API → container → version → monitor → scale.
  • A production model lives behind an HTTP service inside a container, not in a notebook.
  • The three primary deployment patterns are request-response (MaaS), batch, and embedded.
  • Serialization format (pickle, ONNX, SavedModel) is your first choice and depends on your framework and cross-platform needs.
  • Production demands discipline around versioning, monitoring, and gradual rollouts—not optional add-ons.

Frequently Asked Questions

What is the difference between training and deployment?

Training optimizes model parameters on historical data offline; deployment runs the trained model online to make predictions on live data. Deployment focuses on speed, reliability, and scalability; training optimizes accuracy.

Do I need Kubernetes to deploy a model?

No. You can deploy to a single VM, a managed platform (Heroku, Railway, AWS Lambda), or a Kubernetes cluster. Kubernetes is useful when you need auto-scaling, rolling updates, and multi-zone redundancy; for small services, a VM or serverless function is sufficient.

Can I serve multiple models from one API?

Yes. You can load multiple model versions in memory or route different request types to different model endpoints. The pattern depends on your framework and latency budget.

How do I update a model without downtime?

Use a blue-green deployment or canary rollout. Run the old model version on some replicas (blue) and the new version on others (green). Gradually shift traffic to green, then retire blue. This is covered in Article 9 (Kubernetes) and Article 10 (A/B testing).

Further Reading