Version Control for ML Models: Managing Changes

In production, a model is not a static artifact—it evolves. You'll train new versions with more data, fix bugs, retrain on recent data, and A/B test variants. Each version must be tracked, indexed, and reproducible. A model versioning system ensures you know which version is live, can roll back instantly if a new version regresses, and can compare performance across versions.

This article covers semantic versioning for models, metadata tracking, model registries, and deployment strategies that enable safe, controlled model updates.

Why Model Versioning Matters

Without versioning, model deployment becomes chaotic:

No rollback: A bad model goes live and you cannot quickly revert to the previous working version.
No reproducibility: Six months later, you cannot recreate the model that performed best, because you did not track hyperparameters or training data.
No audit trail: You cannot answer "which model served this prediction?" when investigating customer complaints.
Accidental overwrites: Two engineers train models simultaneously and overwrite each other's work.

A versioning system solves all of these. It enables safe experimentation, fast rollbacks, and full auditability.

Semantic Versioning for Models

Adapt semantic versioning (MAJOR.MINOR.PATCH) to ML:

MAJOR: Breaking change (new features, incompatible input schema). Increment when the model's expected input or output format changes fundamentally.
MINOR: Improvement (better accuracy, new capability). Backward-compatible; old code still works. Increment when you retrain with more data or new features.
PATCH: Bug fix or hygiene (fix data preprocessing, adjust hyperparameters, no accuracy change expected). Increment for fixes that do not warrant full retraining.

Examples:

model-v1.0.0: Initial production release.
model-v1.1.0: Retrained on 3 months of new data; accuracy +2%.
model-v1.1.1: Fixed a preprocessing bug; no accuracy change.
model-v2.0.0: Changed input schema (added new features); incompatible with v1.

Metadata Tracking: Model Cards

Store metadata alongside your model. A model card is a document that explains:

Training data (size, distribution, collection date)
Hyperparameters (learning rate, regularization, tree depth)
Performance metrics (accuracy, precision, recall, latency)
Known limitations (biases, failure modes)
Update date and author

Implement this as a JSON file bundled with the model:

{
  "name": "iris-classifier",
  "version": "1.2.0",
  "description": "Iris flower classification (setosa, versicolor, virginica)",
  "created_at": "2026-06-02T15:30:00Z",
  "updated_at": "2026-06-02T15:30:00Z",
  "author": "Dr. Alex Turner",
  "framework": "scikit-learn",
  "algorithm": "RandomForestClassifier",
  "hyperparameters": {
    "n_estimators": 100,
    "max_depth": 10,
    "random_state": 42
  },
  "training_data": {
    "source": "UCI Machine Learning Repository",
    "samples": 150,
    "features": 4,
    "date_collected": "1936"
  },
  "performance": {
    "accuracy": 0.973,
    "precision": 0.975,
    "recall": 0.973,
    "f1_score": 0.973,
    "latency_ms": 2.4
  },
  "input_schema": {
    "type": "array",
    "items": {
      "type": "number",
      "description": "[sepal_length, sepal_width, petal_length, petal_width]"
    }
  },
  "output_schema": {
    "type": "object",
    "properties": {
      "prediction": {"type": "integer", "enum": [0, 1, 2]},
      "probabilities": {"type": "array", "items": {"type": "number"}}
    }
  },
  "limitations": [
    "Trained only on UCI Iris dataset; may not generalize to other iris species.",
    "Assumes numerical input; does not handle missing values."
  ]
}

Then in your Python code, load and check this metadata:

import json
import joblib

# Load metadata
with open("iris-v1.2.0.json") as f:
    metadata = json.load(f)

# Verify we are loading the expected version
print(f"Loading model: {metadata['name']} v{metadata['version']}")
print(f"Accuracy: {metadata['performance']['accuracy']:.3f}")

# Load the model
model = joblib.load(f"iris-v{metadata['version']}.joblib")

Model Registry: Central Storage and Discovery

A model registry is a centralized system where all versions are stored, discoverable, and trackable. Popular options:

Option 1: MLflow Model Registry

MLflow is an open-source platform for tracking experiments and managing model lifecycles. It stores models, metrics, parameters, and metadata:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

# Start MLflow run
mlflow.start_run()

# Log hyperparameters
mlflow.log_params({
    "n_estimators": 100,
    "max_depth": 10,
    "random_state": 42
})

# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)

# Log metrics
mlflow.log_metrics({"accuracy": 0.973, "precision": 0.975})

# Register model
mlflow.sklearn.log_model(model, "iris-model")

mlflow.end_run()

# Later, load from MLflow
logged_model = "runs:/abc123def456/iris-model"
loaded_model = mlflow.pyfunc.load_model(logged_model)

MLflow provides a web UI to browse all versions, compare metrics, and promote models.

Option 2: Hugging Face Model Hub

For transformer models, Hugging Face Hub is convenient:

from transformers import AutoModelForSequenceClassification
from huggingface_hub import push_to_hub_with_auth, create_repo

# Train a model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# ... training ...

# Push to Hub
model.push_to_hub("my-username/my-model", private=True)

# Later, load from Hub
model = AutoModelForSequenceClassification.from_pretrained("my-username/my-model")

Option 3: Self-Hosted Registry

For models you want to keep internal, build a simple registry:

import os
from datetime import datetime

class ModelRegistry:
    def __init__(self, base_dir="/models"):
        self.base_dir = base_dir
        os.makedirs(base_dir, exist_ok=True)
    
    def save_model(self, name: str, version: str, model, metadata: dict):
        """Save model and metadata."""
        version_dir = os.path.join(self.base_dir, name, version)
        os.makedirs(version_dir, exist_ok=True)
        
        # Save model
        model_path = os.path.join(version_dir, "model.joblib")
        joblib.dump(model, model_path)
        
        # Save metadata
        metadata["saved_at"] = datetime.now().isoformat()
        metadata_path = os.path.join(version_dir, "metadata.json")
        with open(metadata_path, "w") as f:
            json.dump(metadata, f, indent=2)
        
        print(f"Saved {name} v{version} to {version_dir}")
    
    def load_model(self, name: str, version: str):
        """Load model and metadata."""
        version_dir = os.path.join(self.base_dir, name, version)
        
        model_path = os.path.join(version_dir, "model.joblib")
        model = joblib.load(model_path)
        
        metadata_path = os.path.join(version_dir, "metadata.json")
        with open(metadata_path) as f:
            metadata = json.load(f)
        
        return model, metadata
    
    def list_versions(self, name: str):
        """List all versions of a model."""
        model_dir = os.path.join(self.base_dir, name)
        if not os.path.exists(model_dir):
            return []
        return sorted(os.listdir(model_dir))

# Usage
registry = ModelRegistry()

# Save a new version
registry.save_model("iris", "1.2.0", model, metadata)

# List all versions
versions = registry.list_versions("iris")
print(f"Available versions: {versions}")

# Load a specific version
model, metadata = registry.load_model("iris", "1.2.0")

Deployment Strategies with Versioning

Strategy 1: Blue-Green Deployment

Run two versions simultaneously: blue (current) and green (new). Route all traffic to blue. When ready, flip traffic to green. If green fails, flip back to blue instantly.

# In your API gateway or load balancer
ACTIVE_VERSION = "1.1.0"  # Blue (current)
CANARY_VERSION = "1.2.0"  # Green (new)

@app.post("/predict")
async def predict(request: PredictionRequest):
    # Route 100% to active, 0% to canary (initially)
    if random.random() < 0.0:  # 0% to canary
        model, _ = registry.load_model("iris", CANARY_VERSION)
    else:
        model, _ = registry.load_model("iris", ACTIVE_VERSION)
    
    # ... run inference ...

To flip: set ACTIVE_VERSION = "1.2.0".

Strategy 2: Canary Deployment

Gradually shift traffic to the new version (5% → 25% → 50% → 100%) while monitoring metrics. If error rate spikes, halt and rollback.

ACTIVE_VERSION = "1.1.0"
CANARY_VERSION = "1.2.0"
CANARY_TRAFFIC_RATIO = 0.25  # 25% traffic to canary

@app.post("/predict")
async def predict(request: PredictionRequest, background_tasks: BackgroundTasks):
    # Decide which version to use
    if random.random() < CANARY_TRAFFIC_RATIO:
        version = CANARY_VERSION
    else:
        version = ACTIVE_VERSION
    
    model, metadata = registry.load_model("iris", version)
    pred = model.predict(...)
    
    # Log prediction for monitoring
    background_tasks.add_task(
        log_prediction,
        version=version,
        prediction=pred,
        timestamp=time.time()
    )
    
    return {"prediction": pred, "model_version": version}

Monitor error rate and latency; if canary's error rate exceeds active's by >5%, flip CANARY_TRAFFIC_RATIO back to 0.

Comparison Table: Versioning Approaches

Approach	Setup	Scalability	Monitoring	Rollback Speed	Cost
Manual file naming	Easy	Low	Manual	Slow	Free
MLflow	Moderate	Medium	Good	Medium	Free (self-hosted)
Hugging Face Hub	Easy	High	Good	Medium	Free tier + paid
Self-hosted registry	Complex	High	Custom	Fast	Self-hosted
Cloud ML (SageMaker, Vertex)	Moderate	High	Excellent	Fast	Pay-per-use

Key Takeaways

Use semantic versioning (MAJOR.MINOR.PATCH) to signal compatibility and scope of changes.
Store metadata (hyperparameters, metrics, training data) alongside the model for reproducibility.
Use a model registry (MLflow, Hugging Face, or custom) to centralize version management and discovery.
Implement blue-green or canary deployments to safely roll out new versions with instant rollback.
Monitor error rate and latency during canary; halt if metrics diverge significantly.

Frequently Asked Questions

Should I version the model or the code (or both)?

Both. Version the model code (training script) separately from the model artifact. A model version 1.2.0 might be trained by code version 1.2.0, but they can diverge (same code, retrained → new model version).

How long should I keep old model versions?

Indefinitely, if storage is cheap (cloud is ~$0.02 per GB-month). In practice, keep the last 10 versions and delete older ones after 1 year. Exception: keep any version that was live in production, for compliance/audit.

Can I version PyTorch SavedModel or ONNX the same way?

Yes. The versioning system is model-agnostic. Just change the file extension and adjust deserialization code.

How do I handle retraining on a schedule?

Use a cron job or a cloud scheduler (AWS CloudWatch, GCP Cloud Scheduler) to trigger training weekly or monthly. On success, bump MINOR version and push to registry. On failure, alert the team.

Why Model Versioning Matters​

Semantic Versioning for Models​

Metadata Tracking: Model Cards​

Model Registry: Central Storage and Discovery​

Option 1: MLflow Model Registry​

Option 2: Hugging Face Model Hub​

Option 3: Self-Hosted Registry​

Deployment Strategies with Versioning​

Strategy 1: Blue-Green Deployment​

Strategy 2: Canary Deployment​

Comparison Table: Versioning Approaches​

Key Takeaways​

Frequently Asked Questions​

Should I version the model or the code (or both)?​

How long should I keep old model versions?​

Can I version PyTorch SavedModel or ONNX the same way?​

How do I handle retraining on a schedule?​

Further Reading​