Cross-Validation for Robust Model Evaluation
Cross-validation (CV) estimates model performance by training on multiple data subsets and averaging results. Instead of a single train-test split that depends on which samples land in which set, k-fold cross-validation splits data into k folds, trains k models (each on k-1 folds), and evaluates on the held-out fold. This reuses data efficiently, reduces variance in performance estimates, and prevents overfitting to a single validation set. For datasets under 10,000 samples, cross-validation is more reliable than a single split.
Why Cross-Validation Beats Single Splits
A single train-test split gives you one performance estimate, which may be optimistic or pessimistic depending on how samples are distributed. Different random seeds can shift test accuracy by 5-10%, making it hard to judge if a model change truly improved performance. Cross-validation averages across k different train-test configurations, giving a more stable estimate:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
# Single train-test split: high variance
single_scores = []
for seed in range(10):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=seed
)
model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
single_scores.append(model.score(X_test, y_test))
print(f"Single split scores (10 runs): {np.array(single_scores):.3f}")
print(f"Mean: {np.mean(single_scores):.3f}, Std: {np.std(single_scores):.3f}")
# 5-fold cross-validation: lower variance
model = LogisticRegression(max_iter=200, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"\n5-fold CV scores: {cv_scores:.3f}")
print(f"Mean: {cv_scores.mean():.3f}, Std: {cv_scores.std():.3f}")
Cross-validation produces lower standard deviation across runs, indicating a more reliable estimate.
K-Fold Cross-Validation: The Standard
K-fold CV splits data into k equal-sized folds, trains k models, and averages performance:
- Split data into k folds (e.g., 5 folds of 100 samples each for 500 total)
- For each fold i: Train on folds 1..i-1, i+1..k (k-1 folds), test on fold i
- Average the k test scores
from sklearn.model_selection import cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier
# Standard 5-fold CV with default shuffle=False (preserves order)
cv_scores = cross_val_score(
DecisionTreeClassifier(random_state=42),
X, y,
cv=5 # Can be integer (uses KFold) or a CV object
)
print(f"5-fold CV scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
# Equivalent: explicitly use KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(
DecisionTreeClassifier(random_state=42),
X, y,
cv=kfold
)
print(f"Explicit KFold CV scores: {cv_scores}")
The cv parameter accepts an integer (auto-creates KFold) or a CV object for customization.
Stratified K-Fold for Classification
For imbalanced classification, use stratified k-fold to preserve class distribution in each fold:
from sklearn.model_selection import StratifiedKFold
# Stratified: each fold has the same class distribution as the full dataset
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(
LogisticRegression(max_iter=200, random_state=42),
X, y,
cv=stratified_kfold
)
print(f"Stratified 5-fold scores: {cv_scores}")
# For imbalanced data (e.g., 95% class 0, 5% class 1)
# Stratified ensures each fold has ~95% class 0 and ~5% class 1
# Non-stratified folds might have 100% class 0 or 90% class 1 by chance
Always use StratifiedKFold for classification, especially with imbalanced targets.
Time-Series Cross-Validation
For time-series data, use TimeSeriesSplit to respect temporal order (no look-ahead bias):
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
import numpy as np
# Simulate time-series data (100 timesteps)
n_samples = 100
X_ts = np.arange(n_samples).reshape(-1, 1)
y_ts = 2 * X_ts.ravel() + np.random.randn(n_samples) * 5
# TimeSeriesSplit trains on past, tests on future
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X_ts):
# train_idx is always earlier than test_idx (chronological order)
X_train, X_test = X_ts[train_idx], X_ts[test_idx]
y_train, y_test = y_ts[train_idx], y_ts[test_idx]
print(f"Train: {train_idx[0]}-{train_idx[-1]}, Test: {test_idx[0]}-{test_idx[-1]}")
# Use TimeSeriesSplit with cross_val_score
cv_scores = cross_val_score(LinearRegression(), X_ts, y_ts, cv=tscv)
print(f"Time-series CV scores: {cv_scores}")
TimeSeriesSplit ensures training always precedes testing, avoiding the look-ahead bias that would occur with random shuffling.
Cross-Validation for Hyperparameter Tuning
GridSearchCV and RandomizedSearchCV internally use cross-validation to tune hyperparameters without overfitting to a single validation set:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Define hyperparameter grid
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}
# GridSearchCV: trains each configuration on CV folds, reports mean CV score
grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5, # 5-fold CV for each configuration
scoring='accuracy'
)
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Test on held-out data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
grid_search.fit(X_train, y_train)
test_score = grid_search.score(X_test, y_test)
print(f"Test score with best hyperparameters: {test_score:.3f}")
GridSearchCV trains on k-1 folds of training data, validates on the held-out fold, and repeats for each parameter combination. This prevents overfitting to validation data while using all training data efficiently.
Cross-Validation with Multiple Metrics
Use cross_validate() to compute multiple metrics in a single cross-validation run:
from sklearn.model_selection import cross_validate
from sklearn.metrics import precision_score, recall_score, f1_score
# Compute multiple metrics without re-running CV
scoring = {
'accuracy': 'accuracy',
'precision': 'precision_macro',
'recall': 'recall_macro',
'f1': 'f1_macro'
}
results = cross_validate(
LogisticRegression(max_iter=200, random_state=42),
X, y,
cv=5,
scoring=scoring,
return_train_score=True
)
for metric, scores in results.items():
if metric.startswith('test_'):
print(f"{metric}: {scores.mean():.3f} +/- {scores.std():.3f}")
cross_validate() returns all fold scores, letting you inspect variance and compare multiple metrics.
Choosing the Right CV Strategy
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| 5-Fold KFold | Default for most datasets | Good variance reduction, fast | Less stable on very small datasets |
| 10-Fold KFold | When accuracy is critical | Lower variance | Slower training |
| Leave-One-Out | Very small datasets (n < 50) | Uses all data | Very slow, high variance |
| StratifiedKFold | Classification with any class distribution | Preserves class balance | Slightly slower |
| TimeSeriesSplit | Time-series or sequential data | Prevents look-ahead bias | Cannot shuffle, fewer train samples |
| ShuffleSplit | Need exact control over split size | Flexible proportions | Slower if high n_iter |
For most cases, 5-fold stratified CV is the safe default.
Key Takeaways
- Cross-validation reduces variance in performance estimates by averaging across multiple train-test splits.
- 5-fold cross-validation is the default; use 10-fold if you need tighter estimates or 3-fold on very small datasets.
- Always use
StratifiedKFoldfor classification to preserve class distribution in each fold. - Use
TimeSeriesSplitfor sequential data to respect temporal order (prevent look-ahead bias). - GridSearchCV internally uses CV to tune hyperparameters, preventing overfitting to validation data.
Frequently Asked Questions
How many folds should I use?
For datasets with 100-10,000 samples, 5-fold is standard. Use 10-fold for larger datasets or when you need lower variance. For very small datasets (under 50 samples), use leave-one-out cross-validation (LOO), but it is computationally expensive: LeaveOneOut() trains n models for n samples.
Does cv=5 mean 5 training runs?
Yes. With 5-fold CV, the model is trained 5 times, each on a different 80% of the data. Total training time is roughly 5x a single train-test split (though the 20% test set is smaller than a typical 20% hold-out, so actual time is similar).
Can I use CV with hyperparameter tuning?
Yes, and you should. GridSearchCV and RandomizedSearchCV use CV internally. For each parameter combination, they train on k-1 folds and test on the held-out fold, returning the mean CV score for comparison.
What if I have a time-series dataset?
Use TimeSeriesSplit, which trains on earlier timesteps and tests on later ones. Never shuffle time-series data; order matters. TimeSeriesSplit handles this correctly.
How do I interpret CV results with high standard deviation?
High std means performance varies across folds, suggesting the model is sensitive to the training set. Possible causes: small dataset (use fewer folds), imbalanced data (use StratifiedKFold), or unstable model (tune hyperparameters or use ensemble methods).