Build FastAPI ML Endpoints for Python Models
Building a REST API that serves machine learning predictions is the core of deployment. FastAPI is the industry standard for this in 2026 because it combines high performance (Uvicorn, async I/O), automatic API documentation (OpenAPI/Swagger), type safety (Pydantic validation), and minimal boilerplate. This article shows you how to build a production-ready inference endpoint from scratch, with proper request validation, error handling, and performance tuning.
FastAPI automatically validates incoming JSON against your Pydantic models, generates interactive docs at /docs, and handles async requests efficiently. A single FastAPI server can process thousands of concurrent requests without blocking, which is critical for ML inference workloads.
Anatomy of an ML Inference Endpoint
An ML inference API has three jobs: (1) validate and parse the incoming request, (2) load and run the model, (3) format and return the prediction. Here's the minimal example:
from fastapi import FastAPI
import joblib
import numpy as np
from pydantic import BaseModel
app = FastAPI()
# Load model once at startup
model = joblib.load("model.joblib")
# Define request schema
class PredictionRequest(BaseModel):
features: list[float]
# Define response schema
class PredictionResponse(BaseModel):
prediction: int
confidence: float
@app.post("/predict")
async def predict(request: PredictionRequest) -> PredictionResponse:
# Convert request to NumPy array
X = np.array(request.features).reshape(1, -1)
# Run inference
pred_class = model.predict(X)[0]
pred_proba = model.predict_proba(X)[0]
confidence = float(pred_proba.max())
return PredictionResponse(
prediction=int(pred_class),
confidence=confidence
)
Run it with uvicorn main:app --reload, then visit http://localhost:8000/docs to see interactive API docs. POST a JSON payload like {"features": [5.1, 3.5, 1.4, 0.2]} and get a prediction.
Request and Response Validation with Pydantic
Pydantic automatically validates incoming JSON against your schema and rejects malformed requests with a 422 Unprocessable Entity response. This prevents crashes from unexpected input:
from pydantic import BaseModel, Field, validator
class IrisFeatures(BaseModel):
sepal_length: float = Field(..., gt=0, le=10, description="Sepal length in cm")
sepal_width: float = Field(..., gt=0, le=10, description="Sepal width in cm")
petal_length: float = Field(..., ge=0, le=7, description="Petal length in cm")
petal_width: float = Field(..., ge=0, le=3, description="Petal width in cm")
@validator("sepal_length", "sepal_width", "petal_length", "petal_width")
def features_not_nan(cls, v):
if v is None or np.isnan(v):
raise ValueError("Features cannot be None or NaN")
return v
class PredictionResponse(BaseModel):
species_id: int
species_name: str
probabilities: dict[str, float]
SPECIES = {0: "setosa", 1: "versicolor", 2: "virginica"}
@app.post("/predict-iris")
async def predict_iris(features: IrisFeatures) -> PredictionResponse:
X = np.array([
features.sepal_length,
features.sepal_width,
features.petal_length,
features.petal_width
]).reshape(1, -1)
pred_class = model.predict(X)[0]
pred_proba = model.predict_proba(X)[0]
return PredictionResponse(
species_id=int(pred_class),
species_name=SPECIES[pred_class],
probabilities={
SPECIES[i]: float(pred_proba[i])
for i in range(len(SPECIES))
}
)
Now the API is self-documenting and safe: callers see exactly what fields are required, what ranges are valid, and what they'll get back.
Async Endpoints and Background Tasks
FastAPI endpoints are async by default, which means they don't block the event loop while waiting for I/O. For CPU-bound inference (running the model), this is less critical, but it matters if you do database lookups or call external services:
from fastapi import BackgroundTasks
import time
# Simulate slow prediction
@app.post("/predict-slow")
async def predict_slow(request: PredictionRequest):
# This runs on the async event loop
# For true CPU-bound work, consider using ThreadPoolExecutor or ProcessPoolExecutor
start = time.time()
X = np.array(request.features).reshape(1, -1)
pred = model.predict(X)[0]
elapsed = time.time() - start
return {"prediction": int(pred), "latency_ms": elapsed * 1000}
# Log prediction to database (background task)
@app.post("/predict-with-logging")
async def predict_with_logging(
request: PredictionRequest,
background_tasks: BackgroundTasks
):
X = np.array(request.features).reshape(1, -1)
pred = model.predict(X)[0]
# Log asynchronously without blocking the response
background_tasks.add_task(log_prediction_to_db, pred, request.features)
return {"prediction": int(pred)}
async def log_prediction_to_db(prediction: int, features: list[float]):
# Simulate writing to a database
await async_db_write({"pred": prediction, "features": features})
Error Handling and Health Checks
Production APIs need graceful error handling and health checks. Use HTTPException to return proper HTTP status codes:
from fastapi import HTTPException, status
@app.post("/predict-safe")
async def predict_safe(request: PredictionRequest) -> PredictionResponse:
try:
if len(request.features) != 4:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=f"Expected 4 features, got {len(request.features)}"
)
X = np.array(request.features).reshape(1, -1)
# Check for NaN or infinity
if not np.isfinite(X).all():
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="Features contain NaN or infinity"
)
pred_class = model.predict(X)[0]
pred_proba = model.predict_proba(X)[0]
return PredictionResponse(
prediction=int(pred_class),
confidence=float(pred_proba.max())
)
except Exception as e:
# Log the error for debugging
print(f"Prediction error: {str(e)}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Model inference failed"
)
# Health check endpoint
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": model is not None}
@app.get("/ready")
async def readiness():
# More detailed readiness check
try:
# Test inference
test_input = np.array([[5.0, 3.0, 1.5, 0.3]])
model.predict(test_input)
return {"ready": True}
except Exception as e:
return {"ready": False, "error": str(e)}
Performance Optimization: Model Pooling and Caching
For compute-intensive models, you can pre-warm GPU memory and use connection pooling. For lightweight models, simple in-memory caching is sufficient:
from functools import lru_cache
from concurrent.futures import ThreadPoolExecutor
# For CPU-bound inference, use a thread pool
executor = ThreadPoolExecutor(max_workers=4)
@app.post("/predict-threaded")
async def predict_threaded(request: PredictionRequest):
loop = asyncio.get_event_loop()
X = np.array(request.features).reshape(1, -1)
# Run prediction in a thread to avoid blocking
pred = await loop.run_in_executor(executor, model.predict, X)
return {"prediction": int(pred[0])}
# Cache predictions for identical requests (demo only)
@lru_cache(maxsize=1000)
def cached_predict(features_tuple):
X = np.array(features_tuple).reshape(1, -1)
return int(model.predict(X)[0])
@app.post("/predict-cached")
async def predict_cached(request: PredictionRequest):
# Convert list to tuple for hashability
features_tuple = tuple(request.features)
pred = cached_predict(features_tuple)
return {"prediction": pred, "cached": True}
Comparison Table: API Frameworks
| Framework | Performance | Ease | Async | Type Safety | Docs | Best For |
|---|---|---|---|---|---|---|
| FastAPI | Very Fast | Easy | Built-in | Pydantic | Auto OpenAPI | Production ML APIs |
| Flask | Slow | Very Easy | Manual | None | Manual | Simple prototypes |
| Django | Slow | Moderate | Manual | None | Manual | Large web apps |
| Starlette | Fast | Moderate | Built-in | None | Manual | Low-level async |
| aiohttp | Fast | Moderate | Built-in | None | Manual | Async HTTP clients |
Key Takeaways
- Use Pydantic models to define request and response schemas; FastAPI validates automatically.
- Load the model once at startup, not on every request.
- Use
/healthand/readyendpoints for Kubernetes probes. - Implement proper error handling with HTTPException and appropriate HTTP status codes.
- For CPU-bound inference, consider using a thread pool or ProcessPoolExecutor to avoid blocking.
- Cache predictions if identical requests are common; profile first to verify benefit.
Frequently Asked Questions
Should I use async def or def for the endpoint?
Use async def by default for consistency. FastAPI will run it on the event loop. If your endpoint does purely CPU-bound work (most ML inference), the performance difference is negligible, but async keeps the framework consistent.
How do I load a large model to avoid startup latency?
Load the model at module import time (before FastAPI app initialization). For very large models (GPU-based), use a separate model-loading service or lazy-load on the first request and cache it.
Can I serve multiple models from one API?
Yes. Load multiple models in a dict keyed by model ID or version. Route requests to the appropriate model based on a query parameter or header.
What is the difference between health and readiness checks?
Health checks (liveness probes) verify the service is running. Readiness checks verify it can actually serve requests (model loaded, database connected). Kubernetes uses both to decide whether to restart or route traffic.
Further Reading
- FastAPI Official Documentation — comprehensive guide and examples.
- Pydantic Docs — request/response validation reference.
- Uvicorn: ASGI Web Server — FastAPI's default server; tuning guide.
- MLflow Model Serving — production-grade model serving framework.