Skip to main content

Monitoring Model Drift in Production: Detect Decay

Model drift occurs when a model's performance degrades in production because the data it sees no longer matches the training data distribution or because the real-world relationship it learned has changed. Without monitoring, you might serve an inaccurate model to thousands of users for weeks before noticing. In this article, you will learn to detect three types of drift—data drift, concept drift, and performance drift—and set up automated alerts in Python.

Understanding the Three Types of Drift

Data drift (also called covariate shift) happens when the input feature distribution changes but the relationship between features and target remains the same. Example: you train a credit-scoring model on customers aged 25-55; then, your customer base shifts to 35-65. The age distribution changed, but age still predicts credit risk the same way.

Concept drift (also called label drift) happens when the relationship between features and target changes. Example: you train a loan default prediction model in a stable economy; then, interest rates double and economic recession hits. The same features (income, debt) no longer predict default the same way.

Performance drift is observed when actual model accuracy drops on new data. You measure this by getting ground truth labels (actual outcomes) and comparing predictions to reality. Unlike data and concept drift, which are detected without labels, performance drift requires labels.

Detecting Data Drift with Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) test compares two probability distributions. If the KS statistic exceeds a threshold, distributions are significantly different, indicating data drift.

import numpy as np
import pandas as pd
from scipy.stats import ks_2samp

# Simulated training data (baseline)
np.random.seed(42)
train_data = np.random.normal(loc=50, scale=10, size=1000)

# Simulated production data (after data drift)
prod_data = np.random.normal(loc=60, scale=10, size=500)

# KS test
statistic, p_value = ks_2samp(train_data, prod_data)

print(f"KS Statistic: {statistic:.4f}")
print(f"P-value: {p_value:.6f}")

if p_value < 0.05: # Typical significance level
print("WARNING: Data drift detected!")
else:
print("No significant data drift.")

For multiple features, compute the KS statistic per feature:

import pandas as pd
from scipy.stats import ks_2samp

# Training data (baseline)
train_df = pd.DataFrame({
"age": np.random.normal(45, 12, 1000),
"income": np.random.normal(60000, 20000, 1000),
"credit_score": np.random.normal(700, 100, 1000),
})

# Production data (new)
prod_df = pd.DataFrame({
"age": np.random.normal(50, 12, 500),
"income": np.random.normal(70000, 20000, 500),
"credit_score": np.random.normal(680, 100, 500),
})

# Check drift per feature
drift_threshold = 0.05

for col in train_df.columns:
statistic, p_value = ks_2samp(train_df[col], prod_df[col])
print(f"{col}: KS={statistic:.4f}, p={p_value:.6f}")

if p_value < drift_threshold:
print(f" -> DRIFT DETECTED in {col}")

Detecting Drift with Wasserstein Distance

The Wasserstein distance (also called Earth Mover's Distance) measures how much "work" is needed to transform one distribution into another. It is more sensitive to shifts in location or spread than KS.

from scipy.stats import wasserstein_distance

train_data = np.random.normal(loc=50, scale=10, size=1000)
prod_data = np.random.normal(loc=55, scale=12, size=500)

wd = wasserstein_distance(train_data, prod_data)
print(f"Wasserstein Distance: {wd:.4f}")

# Higher Wasserstein = more drift
if wd > 5: # Threshold depends on data scale
print("WARNING: Data drift detected!")

Monitoring Performance Drift with Ground Truth

The most reliable drift signal is observed performance drop, but it requires ground truth labels (actual outcomes). For a credit-scoring model, ground truth arrives slowly: you know if a customer defaulted only months later.

Set up a monitoring database to store predictions and actual outcomes:

import sqlite3
import pandas as pd
from datetime import datetime
from sklearn.metrics import accuracy_score

# Connect to monitoring database
conn = sqlite3.connect("monitoring.db")

# Create table if not exists
conn.execute("""
CREATE TABLE IF NOT EXISTS predictions (
id INTEGER PRIMARY KEY,
timestamp TEXT,
prediction REAL,
actual INTEGER
)
""")

# Log a prediction
def log_prediction(prediction: float, actual: int = None):
conn.execute(
"INSERT INTO predictions (timestamp, prediction, actual) VALUES (?, ?, ?)",
(datetime.now().isoformat(), prediction, actual)
)
conn.commit()

# Later, when ground truth arrives, update the actual value
def update_ground_truth(pred_id: int, actual: int):
conn.execute(
"UPDATE predictions SET actual = ? WHERE id = ?",
(actual, pred_id)
)
conn.commit()

# Periodically compute accuracy over a window
def compute_window_accuracy(days: int = 7):
query = f"""
SELECT prediction, actual
FROM predictions
WHERE actual IS NOT NULL
AND timestamp > datetime('now', '-{days} days')
"""
df = pd.read_sql_query(query, conn)

if len(df) > 0:
# Convert predictions (probabilities) to binary
y_pred = (df["prediction"] > 0.5).astype(int)
y_true = df["actual"]
acc = accuracy_score(y_true, y_pred)
return acc
return None

# Check accuracy over last 7 days
acc = compute_window_accuracy(days=7)
print(f"Accuracy (last 7 days): {acc:.3f}")

if acc and acc < 0.85: # Threshold
print("WARNING: Performance drift detected!")

Using Evidently for Comprehensive Drift Monitoring

Evidently is a Python library designed for monitoring model and data quality in production. It provides dashboards and reports for drift detection.

Install it:

pip install evidently

Use it to generate a drift report:

from evidently.report import Report
from evidently.metric_preset import DataQualityPreset
from evidently.metrics import DataDriftTable
import pandas as pd

# Training data (reference)
reference_data = pd.read_csv("training_data.csv")

# Production data (current)
current_data = pd.read_csv("production_data.csv")

# Create a report
report = Report(metrics=[
DataQualityPreset(),
DataDriftTable(),
])

report.run(reference_data=reference_data, current_data=current_data)

# Save and display
report.save_html("drift_report.html")
print(report.as_dict()) # Get structured results

The report shows per-feature drift indicators, missing values, and statistical tests. You can programmatically check if drift was detected:

results = report.as_dict()

for metric_result in results["metrics"]:
if metric_result.get("metric_name") == "DataDriftTable":
drift_detected = False
for feature_drift in metric_result.get("result", {}).get("drift_by_columns", {}).items():
if feature_drift[1].get("drift_detected"):
drift_detected = True
print(f"Drift in {feature_drift[0]}")

if not drift_detected:
print("No drift detected")

Setting Up Automated Alerts

Monitor drift continuously and alert when thresholds are crossed. Use a simple scheduled job:

import schedule
import time
from datetime import datetime, timedelta
import logging

# Configure logging (or replace with email/Slack alerts)
logging.basicConfig(level=logging.WARNING)

def check_drift():
"""Check for drift and alert if found."""

# Load recent production data
prod_data = pd.read_csv("production_recent.csv")
train_data = pd.read_csv("training_baseline.csv")

# Check data drift per feature
drift_detected = False
for col in train_data.columns:
if col in prod_data.columns:
statistic, p_value = ks_2samp(train_data[col], prod_data[col])
if p_value < 0.05:
logging.warning(f"Data drift in {col}: KS={statistic:.4f}, p={p_value:.6f}")
drift_detected = True

# Check performance drift
recent_accuracy = compute_window_accuracy(days=7)
if recent_accuracy and recent_accuracy < 0.85:
logging.warning(f"Performance drift: accuracy dropped to {recent_accuracy:.3f}")
drift_detected = True

if drift_detected:
# Trigger retraining (see article 7)
logging.warning("Drift detected. Consider retraining the model.")

# Schedule the check to run every 6 hours
schedule.every(6).hours.do(check_drift)

# Run in a background loop (or replace with a cron job)
while True:
schedule.run_pending()
time.sleep(60) # Check every minute if a scheduled task is due

For production, use a real job scheduler (cron, Airflow, Kubernetes CronJob, etc.) instead of a Python loop.

Building a Dashboard

Visualize drift metrics over time using a simple plot:

import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import ks_2samp

# Simulated monitoring data (in production, fetch from your DB)
monitoring_data = []

for day in range(30):
# Simulate production data drifting over time
prod_data = np.random.normal(loc=50 + day*0.5, scale=10, size=500)
ref_data = np.random.normal(loc=50, scale=10, size=1000)

statistic, _ = ks_2samp(ref_data, prod_data)
monitoring_data.append({"day": day, "ks_statistic": statistic})

df_monitor = pd.DataFrame(monitoring_data)

# Plot
plt.figure(figsize=(10, 5))
plt.plot(df_monitor["day"], df_monitor["ks_statistic"], marker="o")
plt.axhline(y=0.05, color="r", linestyle="--", label="Drift Threshold")
plt.xlabel("Days")
plt.ylabel("KS Statistic")
plt.title("Data Drift Over Time")
plt.legend()
plt.grid()
plt.tight_layout()
plt.savefig("drift_dashboard.png")
plt.show()

Key Takeaways

  • Data drift, concept drift, and performance drift are distinct and require different detection methods.
  • The Kolmogorov-Smirnov test and Wasserstein distance detect data drift without labels.
  • Performance drift requires ground truth labels, which arrive with delay. Monitor accuracy over rolling windows.
  • Evidently is a production-ready library for comprehensive drift monitoring and dashboarding.
  • Set up automated alerts on a schedule; trigger retraining when drift exceeds thresholds.

Frequently Asked Questions

Which drift detection method should I use?

Start with KS test (simple, fast). If you need more sensitivity to distribution shift, use Wasserstein distance. Combine with Evidently for a full picture. Performance drift (ground truth) is always the most reliable once data is available.

How do I set the drift threshold?

Thresholds are domain-dependent. Start with industry defaults (e.g., KS p-value < 0.05) or ML platform defaults. Fine-tune based on false positives: if you are alerting too often, raise the threshold. If you miss real drift, lower it.

Can I detect drift in categorical features?

Yes. For categorical data, use chi-squared test instead of KS test: scipy.stats.chi2_contingency(). Or encode categories and apply KS to encoded values.

What if performance drops but I do not detect data drift?

You may have concept drift (the underlying relationship changed) without data drift. This is harder to detect without labels. Monitor carefully and consider retraining more frequently when concept drift is suspected.

How often should I run drift checks?

Depends on your domain. For high-frequency predictions (thousands per hour), check daily. For lower-frequency (tens per day), check weekly. More frequent checks catch drift earlier but are costlier.

Further Reading