Version Control for ML Models: Managing Changes
In production, a model is not a static artifact—it evolves. You'll train new versions with more data, fix bugs, retrain on recent data, and A/B test variants. Each version must be tracked, indexed, and reproducible. A model versioning system ensures you know which version is live, can roll back instantly if a new version regresses, and can compare performance across versions.
This article covers semantic versioning for models, metadata tracking, model registries, and deployment strategies that enable safe, controlled model updates.
Why Model Versioning Matters
Without versioning, model deployment becomes chaotic:
- No rollback: A bad model goes live and you cannot quickly revert to the previous working version.
- No reproducibility: Six months later, you cannot recreate the model that performed best, because you did not track hyperparameters or training data.
- No audit trail: You cannot answer "which model served this prediction?" when investigating customer complaints.
- Accidental overwrites: Two engineers train models simultaneously and overwrite each other's work.
A versioning system solves all of these. It enables safe experimentation, fast rollbacks, and full auditability.
Semantic Versioning for Models
Adapt semantic versioning (MAJOR.MINOR.PATCH) to ML:
- MAJOR: Breaking change (new features, incompatible input schema). Increment when the model's expected input or output format changes fundamentally.
- MINOR: Improvement (better accuracy, new capability). Backward-compatible; old code still works. Increment when you retrain with more data or new features.
- PATCH: Bug fix or hygiene (fix data preprocessing, adjust hyperparameters, no accuracy change expected). Increment for fixes that do not warrant full retraining.
Examples:
model-v1.0.0: Initial production release.model-v1.1.0: Retrained on 3 months of new data; accuracy +2%.model-v1.1.1: Fixed a preprocessing bug; no accuracy change.model-v2.0.0: Changed input schema (added new features); incompatible with v1.
Metadata Tracking: Model Cards
Store metadata alongside your model. A model card is a document that explains:
- Training data (size, distribution, collection date)
- Hyperparameters (learning rate, regularization, tree depth)
- Performance metrics (accuracy, precision, recall, latency)
- Known limitations (biases, failure modes)
- Update date and author
Implement this as a JSON file bundled with the model:
{
"name": "iris-classifier",
"version": "1.2.0",
"description": "Iris flower classification (setosa, versicolor, virginica)",
"created_at": "2026-06-02T15:30:00Z",
"updated_at": "2026-06-02T15:30:00Z",
"author": "Dr. Alex Turner",
"framework": "scikit-learn",
"algorithm": "RandomForestClassifier",
"hyperparameters": {
"n_estimators": 100,
"max_depth": 10,
"random_state": 42
},
"training_data": {
"source": "UCI Machine Learning Repository",
"samples": 150,
"features": 4,
"date_collected": "1936"
},
"performance": {
"accuracy": 0.973,
"precision": 0.975,
"recall": 0.973,
"f1_score": 0.973,
"latency_ms": 2.4
},
"input_schema": {
"type": "array",
"items": {
"type": "number",
"description": "[sepal_length, sepal_width, petal_length, petal_width]"
}
},
"output_schema": {
"type": "object",
"properties": {
"prediction": {"type": "integer", "enum": [0, 1, 2]},
"probabilities": {"type": "array", "items": {"type": "number"}}
}
},
"limitations": [
"Trained only on UCI Iris dataset; may not generalize to other iris species.",
"Assumes numerical input; does not handle missing values."
]
}
Then in your Python code, load and check this metadata:
import json
import joblib
# Load metadata
with open("iris-v1.2.0.json") as f:
metadata = json.load(f)
# Verify we are loading the expected version
print(f"Loading model: {metadata['name']} v{metadata['version']}")
print(f"Accuracy: {metadata['performance']['accuracy']:.3f}")
# Load the model
model = joblib.load(f"iris-v{metadata['version']}.joblib")
Model Registry: Central Storage and Discovery
A model registry is a centralized system where all versions are stored, discoverable, and trackable. Popular options:
Option 1: MLflow Model Registry
MLflow is an open-source platform for tracking experiments and managing model lifecycles. It stores models, metrics, parameters, and metadata:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
# Start MLflow run
mlflow.start_run()
# Log hyperparameters
mlflow.log_params({
"n_estimators": 100,
"max_depth": 10,
"random_state": 42
})
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Log metrics
mlflow.log_metrics({"accuracy": 0.973, "precision": 0.975})
# Register model
mlflow.sklearn.log_model(model, "iris-model")
mlflow.end_run()
# Later, load from MLflow
logged_model = "runs:/abc123def456/iris-model"
loaded_model = mlflow.pyfunc.load_model(logged_model)
MLflow provides a web UI to browse all versions, compare metrics, and promote models.
Option 2: Hugging Face Model Hub
For transformer models, Hugging Face Hub is convenient:
from transformers import AutoModelForSequenceClassification
from huggingface_hub import push_to_hub_with_auth, create_repo
# Train a model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# ... training ...
# Push to Hub
model.push_to_hub("my-username/my-model", private=True)
# Later, load from Hub
model = AutoModelForSequenceClassification.from_pretrained("my-username/my-model")
Option 3: Self-Hosted Registry
For models you want to keep internal, build a simple registry:
import os
from datetime import datetime
class ModelRegistry:
def __init__(self, base_dir="/models"):
self.base_dir = base_dir
os.makedirs(base_dir, exist_ok=True)
def save_model(self, name: str, version: str, model, metadata: dict):
"""Save model and metadata."""
version_dir = os.path.join(self.base_dir, name, version)
os.makedirs(version_dir, exist_ok=True)
# Save model
model_path = os.path.join(version_dir, "model.joblib")
joblib.dump(model, model_path)
# Save metadata
metadata["saved_at"] = datetime.now().isoformat()
metadata_path = os.path.join(version_dir, "metadata.json")
with open(metadata_path, "w") as f:
json.dump(metadata, f, indent=2)
print(f"Saved {name} v{version} to {version_dir}")
def load_model(self, name: str, version: str):
"""Load model and metadata."""
version_dir = os.path.join(self.base_dir, name, version)
model_path = os.path.join(version_dir, "model.joblib")
model = joblib.load(model_path)
metadata_path = os.path.join(version_dir, "metadata.json")
with open(metadata_path) as f:
metadata = json.load(f)
return model, metadata
def list_versions(self, name: str):
"""List all versions of a model."""
model_dir = os.path.join(self.base_dir, name)
if not os.path.exists(model_dir):
return []
return sorted(os.listdir(model_dir))
# Usage
registry = ModelRegistry()
# Save a new version
registry.save_model("iris", "1.2.0", model, metadata)
# List all versions
versions = registry.list_versions("iris")
print(f"Available versions: {versions}")
# Load a specific version
model, metadata = registry.load_model("iris", "1.2.0")
Deployment Strategies with Versioning
Strategy 1: Blue-Green Deployment
Run two versions simultaneously: blue (current) and green (new). Route all traffic to blue. When ready, flip traffic to green. If green fails, flip back to blue instantly.
# In your API gateway or load balancer
ACTIVE_VERSION = "1.1.0" # Blue (current)
CANARY_VERSION = "1.2.0" # Green (new)
@app.post("/predict")
async def predict(request: PredictionRequest):
# Route 100% to active, 0% to canary (initially)
if random.random() < 0.0: # 0% to canary
model, _ = registry.load_model("iris", CANARY_VERSION)
else:
model, _ = registry.load_model("iris", ACTIVE_VERSION)
# ... run inference ...
To flip: set ACTIVE_VERSION = "1.2.0".
Strategy 2: Canary Deployment
Gradually shift traffic to the new version (5% → 25% → 50% → 100%) while monitoring metrics. If error rate spikes, halt and rollback.
ACTIVE_VERSION = "1.1.0"
CANARY_VERSION = "1.2.0"
CANARY_TRAFFIC_RATIO = 0.25 # 25% traffic to canary
@app.post("/predict")
async def predict(request: PredictionRequest, background_tasks: BackgroundTasks):
# Decide which version to use
if random.random() < CANARY_TRAFFIC_RATIO:
version = CANARY_VERSION
else:
version = ACTIVE_VERSION
model, metadata = registry.load_model("iris", version)
pred = model.predict(...)
# Log prediction for monitoring
background_tasks.add_task(
log_prediction,
version=version,
prediction=pred,
timestamp=time.time()
)
return {"prediction": pred, "model_version": version}
Monitor error rate and latency; if canary's error rate exceeds active's by >5%, flip CANARY_TRAFFIC_RATIO back to 0.
Comparison Table: Versioning Approaches
| Approach | Setup | Scalability | Monitoring | Rollback Speed | Cost |
|---|---|---|---|---|---|
| Manual file naming | Easy | Low | Manual | Slow | Free |
| MLflow | Moderate | Medium | Good | Medium | Free (self-hosted) |
| Hugging Face Hub | Easy | High | Good | Medium | Free tier + paid |
| Self-hosted registry | Complex | High | Custom | Fast | Self-hosted |
| Cloud ML (SageMaker, Vertex) | Moderate | High | Excellent | Fast | Pay-per-use |
Key Takeaways
- Use semantic versioning (MAJOR.MINOR.PATCH) to signal compatibility and scope of changes.
- Store metadata (hyperparameters, metrics, training data) alongside the model for reproducibility.
- Use a model registry (MLflow, Hugging Face, or custom) to centralize version management and discovery.
- Implement blue-green or canary deployments to safely roll out new versions with instant rollback.
- Monitor error rate and latency during canary; halt if metrics diverge significantly.
Frequently Asked Questions
Should I version the model or the code (or both)?
Both. Version the model code (training script) separately from the model artifact. A model version 1.2.0 might be trained by code version 1.2.0, but they can diverge (same code, retrained → new model version).
How long should I keep old model versions?
Indefinitely, if storage is cheap (cloud is ~$0.02 per GB-month). In practice, keep the last 10 versions and delete older ones after 1 year. Exception: keep any version that was live in production, for compliance/audit.
Can I version PyTorch SavedModel or ONNX the same way?
Yes. The versioning system is model-agnostic. Just change the file extension and adjust deserialization code.
How do I handle retraining on a schedule?
Use a cron job or a cloud scheduler (AWS CloudWatch, GCP Cloud Scheduler) to trigger training weekly or monthly. On success, bump MINOR version and push to registry. On failure, alert the team.
Further Reading
- MLflow Documentation — experiment tracking and model registry.
- Hugging Face Model Hub — 2 million pre-trained models and hosting.
- Semantic Versioning — standard versioning scheme.
- Model Cards for Model Reporting — academic paper on model metadata.