Avoiding Data Leakage in ML Pipelines
Data leakage is the silent killer of machine learning. Your model achieves 95% accuracy in validation but crashes to 60% in production. The cause: you accidentally used information from the test set during training. A preprocessor fit on the combined dataset, a feature computed using the target value, or a temporal split done wrong all leak test information into the training set. The model memorizes patterns specific to your validation set, not generalizable patterns. Real-world deployments revealed that 40% of failing ML models suffered from data leakage (Harvard Business Review / Gartner survey, 2025).
Types of Leakage
Data leakage falls into two categories: explicit (using future information or test data directly) and subtle (using statistics computed on the full dataset).
Leakage Type 1: Target Leakage
Using the target variable or information derived from it as a feature.
import pandas as pd
# Example: predicting credit default
df = pd.DataFrame({
'age': [25, 35, 45],
'income': [50000, 75000, 100000],
'has_default': [0, 1, 0],
'default_probability_estimate': [0.1, 0.8, 0.05] # LEAKAGE: computed from target
})
# WRONG: using default_probability_estimate as a feature
X_bad = df[['age', 'income', 'default_probability_estimate']]
y = df['has_default']
# RIGHT: use only features that exist before observing default
X_good = df[['age', 'income']]
# The bad model memorizes the target; the good model learns causal patterns.
Target leakage is catastrophic and hard to detect if you don't understand your data's origin. A variable named innocently like risk_score might already encode the target.
Leakage Type 2: Test Set Leakage
Using information from the test set during training.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = pd.DataFrame({'age': [25, 35, 45, 55], 'income': [50000, 75000, 100000, 125000]})
y = pd.Series([0, 1, 0, 1])
# WRONG: fit scaler on full dataset, then split
scaler_bad = StandardScaler()
X_scaled_bad = scaler_bad.fit_transform(X) # Fit includes test data!
X_train, X_test, y_train, y_test = train_test_split(X_scaled_bad, y, test_size=0.2)
# RIGHT: split first, fit on training data only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler_good = StandardScaler()
X_train_scaled = scaler_good.fit_transform(X_train) # Fit only training data
X_test_scaled = scaler_good.transform(X_test) # Transform test with training stats
The bad approach leaks test-set statistics (mean, std, max, min) into the training set. The model unknowingly learns from test data, producing inflated metrics.
Leakage Type 3: Feature Engineering on Full Dataset
Computing new features (mean encoding, polynomial features) on the combined dataset before splitting.
# WRONG: compute target encoding on full dataset, then split
target_mean = df.groupby('city')['price'].mean() # Uses all data including test!
df['city_encoded'] = df['city'].map(target_mean)
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis=1), df['price'], test_size=0.2)
# RIGHT: split first, compute encoding on training data only
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis=1), df['price'], test_size=0.2, random_state=42)
target_mean = X_train.join(y_train).groupby('city')['price'].mean() # Training data only!
X_train['city_encoded'] = X_train['city'].map(target_mean)
X_test['city_encoded'] = X_test['city'].map(target_mean)
The leaky version computes the mean for each city using test data. The test encoding includes information about test targets, inflating validation metrics.
Leakage Type 4: Temporal Leakage
Using future information for time-series predictions.
import pandas as pd
# Example: predict stock price tomorrow
dates = pd.date_range('2026-01-01', periods=100)
prices = [100 + i + (i % 10) for i in range(100)]
df = pd.DataFrame({'date': dates, 'price': prices})
# WRONG: create target as next day's price, then shuffle and split
df['target'] = df['price'].shift(-1) # Tomorrow's price
df = df.sample(frac=1, random_state=42) # Shuffle (BREAKS TIME ORDER)
train_size = 80
X_train = df.iloc[:train_size].drop('target', axis=1)
y_train = df.iloc[:train_size]['target']
X_test = df.iloc[train_size:].drop('target', axis=1)
y_test = df.iloc[train_size:]['target']
# Training data includes dates from Dec 2026, test has dates from Jan 2026 (future info!)
# RIGHT: split by time, no shuffling
train_size = 80
X_train = df.iloc[:train_size].drop('target', axis=1)
y_train = df.iloc[:train_size]['target']
X_test = df.iloc[train_size:].drop('target', axis=1)
y_test = df.iloc[train_size:]['target']
# Training: Jan-Mar 2026, Test: Apr-Jun 2026 (proper time order)
Shuffling time-series data before splitting destroys temporal structure. The model trains on data from the entire timeline, then tests on data it's already seen (future data looks like past to the shuffled set). Never shuffle time-series data; always split by time.
Leakage Type 5: Imputation on Full Dataset
Fitting an imputer on the full dataset before splitting.
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
# WRONG: fit imputer on full data
imputer_bad = SimpleImputer(strategy='mean')
X_imputed_bad = imputer_bad.fit_transform(X) # Mean includes test data!
X_train, X_test, y_train, y_test = train_test_split(X_imputed_bad, y, test_size=0.2)
# RIGHT: split first, fit imputer on training only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
imputer_good = SimpleImputer(strategy='mean')
X_train_imputed = imputer_good.fit_transform(X_train) # Mean from training only
X_test_imputed = imputer_good.transform(X_test) # Transform with training params
Leakage through imputation is common. The mean computed on the combined dataset includes test-set values, inflating imputation quality.
Safe Pipeline Pattern
Use scikit-learn's Pipeline and ColumnTransformer for leakage-free preprocessing.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
# Define pipeline: preprocessing + model
pipeline = Pipeline(steps=[
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000))
])
# Split data first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit pipeline on training data only
pipeline.fit(X_train, y_train)
# Evaluate on test data
score = pipeline.score(X_test, y_test)
# Cross-validation: pipeline is fit and evaluated per fold independently
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Score: {cv_scores.mean():.4f}")
The pipeline automatically prevents leakage: transformers (Scaler) are fit on training data, then applied to test data. Cross-validation refits the pipeline on each fold independently.
Leakage-Free Encoding Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Identify categorical and numeric columns
categorical_features = ['city', 'category']
numeric_features = ['age', 'income']
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
# Full pipeline
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', LogisticRegression(max_iter=1000))
])
# Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and evaluate (leakage-free)
full_pipeline.fit(X_train, y_train)
score = full_pipeline.score(X_test, y_test)
ColumnTransformer handles multiple feature types and fits on training data only.
Leakage Checklist
- Split train/test BEFORE any preprocessing, imputation, or feature engineering.
- Fit all transformers (scalers, encoders, imputers) on training data only.
- For time-series data, split by time; never shuffle.
- Compute feature engineering (target encoding, polynomial features) on training data; apply to test via pre-fit transformers.
- Use pipelines to automate leakage prevention.
- Never use the target variable to compute features (except for target encoding, which is done properly on training data).
- Stratify train/test splits for classification to preserve class distribution.
- In cross-validation, the pipeline refits from scratch each fold; never pre-compute features on the full dataset.
Common Leakage Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Fit scaler on full data, split later | High | Split first, fit on train only |
| Shuffle time-series before splitting | Critical | Split by time, no shuffling |
| Target encoding on full dataset | High | Split first, encode training only |
| Impute before splitting | Medium | Split first, impute training only |
| Use target-derived features | Critical | Don't; compute features from input only |
| Average multiple models fit on overlapping folds | Medium | Use cross-validation; no overlapping folds |
Key Takeaways
- Data leakage inflates metrics, crashes in production. Avoid it at all costs.
- Always split train/test BEFORE preprocessing. Fit transformers on training data; transform test with training parameters.
- For time-series, split by time (no shuffling). Test data must be temporally after training.
- Use
scikit-learnPipelineandColumnTransformerto automate leakage prevention. - Never compute feature statistics (mean, target encoding, polynomial expansion) on the combined dataset.
- In cross-validation, the pipeline refits independently each fold; never pre-fit on the full data.
Frequently Asked Questions
Is it okay to use the test set to evaluate my preprocessing choices?
No. Evaluating preprocessing choices (scaler type, feature selection threshold) on the test set leaks test information into your decisions. Make those choices on validation data from cross-validation of the training set.
Should I do feature selection before or after train/test split?
After. Split first, then select features on training data using cross-validation. Apply the same feature set to test data. Never fit the feature selector on the combined dataset.
What if I have a small dataset and can't afford to lose test data to leakage prevention?
Use cross-validation instead of a single train/test split. K-fold cross-validation uses all data for training and testing while preventing leakage: each fold's test set is held out while training and preprocessing fit on the complementary training set.
How do I prevent leakage in a custom preprocessing function?
Separate your preprocessing into two parts: fit() (learn parameters on training data) and transform() (apply to any data). Never fit on the combined dataset. Follow scikit-learn's transformer interface.