Skip to main content

Cross-Validation for Robust Model Evaluation

Cross-validation (CV) estimates model performance by training on multiple data subsets and averaging results. Instead of a single train-test split that depends on which samples land in which set, k-fold cross-validation splits data into k folds, trains k models (each on k-1 folds), and evaluates on the held-out fold. This reuses data efficiently, reduces variance in performance estimates, and prevents overfitting to a single validation set. For datasets under 10,000 samples, cross-validation is more reliable than a single split.

Why Cross-Validation Beats Single Splits

A single train-test split gives you one performance estimate, which may be optimistic or pessimistic depending on how samples are distributed. Different random seeds can shift test accuracy by 5-10%, making it hard to judge if a model change truly improved performance. Cross-validation averages across k different train-test configurations, giving a more stable estimate:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

# Single train-test split: high variance
single_scores = []
for seed in range(10):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=seed
)
model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
single_scores.append(model.score(X_test, y_test))

print(f"Single split scores (10 runs): {np.array(single_scores):.3f}")
print(f"Mean: {np.mean(single_scores):.3f}, Std: {np.std(single_scores):.3f}")

# 5-fold cross-validation: lower variance
model = LogisticRegression(max_iter=200, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"\n5-fold CV scores: {cv_scores:.3f}")
print(f"Mean: {cv_scores.mean():.3f}, Std: {cv_scores.std():.3f}")

Cross-validation produces lower standard deviation across runs, indicating a more reliable estimate.

K-Fold Cross-Validation: The Standard

K-fold CV splits data into k equal-sized folds, trains k models, and averages performance:

  1. Split data into k folds (e.g., 5 folds of 100 samples each for 500 total)
  2. For each fold i: Train on folds 1..i-1, i+1..k (k-1 folds), test on fold i
  3. Average the k test scores
from sklearn.model_selection import cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier

# Standard 5-fold CV with default shuffle=False (preserves order)
cv_scores = cross_val_score(
DecisionTreeClassifier(random_state=42),
X, y,
cv=5 # Can be integer (uses KFold) or a CV object
)
print(f"5-fold CV scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

# Equivalent: explicitly use KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(
DecisionTreeClassifier(random_state=42),
X, y,
cv=kfold
)
print(f"Explicit KFold CV scores: {cv_scores}")

The cv parameter accepts an integer (auto-creates KFold) or a CV object for customization.

Stratified K-Fold for Classification

For imbalanced classification, use stratified k-fold to preserve class distribution in each fold:

from sklearn.model_selection import StratifiedKFold

# Stratified: each fold has the same class distribution as the full dataset
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(
LogisticRegression(max_iter=200, random_state=42),
X, y,
cv=stratified_kfold
)
print(f"Stratified 5-fold scores: {cv_scores}")

# For imbalanced data (e.g., 95% class 0, 5% class 1)
# Stratified ensures each fold has ~95% class 0 and ~5% class 1
# Non-stratified folds might have 100% class 0 or 90% class 1 by chance

Always use StratifiedKFold for classification, especially with imbalanced targets.

Time-Series Cross-Validation

For time-series data, use TimeSeriesSplit to respect temporal order (no look-ahead bias):

from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
import numpy as np

# Simulate time-series data (100 timesteps)
n_samples = 100
X_ts = np.arange(n_samples).reshape(-1, 1)
y_ts = 2 * X_ts.ravel() + np.random.randn(n_samples) * 5

# TimeSeriesSplit trains on past, tests on future
tscv = TimeSeriesSplit(n_splits=5)

for train_idx, test_idx in tscv.split(X_ts):
# train_idx is always earlier than test_idx (chronological order)
X_train, X_test = X_ts[train_idx], X_ts[test_idx]
y_train, y_test = y_ts[train_idx], y_ts[test_idx]
print(f"Train: {train_idx[0]}-{train_idx[-1]}, Test: {test_idx[0]}-{test_idx[-1]}")

# Use TimeSeriesSplit with cross_val_score
cv_scores = cross_val_score(LinearRegression(), X_ts, y_ts, cv=tscv)
print(f"Time-series CV scores: {cv_scores}")

TimeSeriesSplit ensures training always precedes testing, avoiding the look-ahead bias that would occur with random shuffling.

Cross-Validation for Hyperparameter Tuning

GridSearchCV and RandomizedSearchCV internally use cross-validation to tune hyperparameters without overfitting to a single validation set:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define hyperparameter grid
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}

# GridSearchCV: trains each configuration on CV folds, reports mean CV score
grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5, # 5-fold CV for each configuration
scoring='accuracy'
)

grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Test on held-out data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
grid_search.fit(X_train, y_train)
test_score = grid_search.score(X_test, y_test)
print(f"Test score with best hyperparameters: {test_score:.3f}")

GridSearchCV trains on k-1 folds of training data, validates on the held-out fold, and repeats for each parameter combination. This prevents overfitting to validation data while using all training data efficiently.

Cross-Validation with Multiple Metrics

Use cross_validate() to compute multiple metrics in a single cross-validation run:

from sklearn.model_selection import cross_validate
from sklearn.metrics import precision_score, recall_score, f1_score

# Compute multiple metrics without re-running CV
scoring = {
'accuracy': 'accuracy',
'precision': 'precision_macro',
'recall': 'recall_macro',
'f1': 'f1_macro'
}

results = cross_validate(
LogisticRegression(max_iter=200, random_state=42),
X, y,
cv=5,
scoring=scoring,
return_train_score=True
)

for metric, scores in results.items():
if metric.startswith('test_'):
print(f"{metric}: {scores.mean():.3f} +/- {scores.std():.3f}")

cross_validate() returns all fold scores, letting you inspect variance and compare multiple metrics.

Choosing the Right CV Strategy

StrategyWhen to UseProsCons
5-Fold KFoldDefault for most datasetsGood variance reduction, fastLess stable on very small datasets
10-Fold KFoldWhen accuracy is criticalLower varianceSlower training
Leave-One-OutVery small datasets (n < 50)Uses all dataVery slow, high variance
StratifiedKFoldClassification with any class distributionPreserves class balanceSlightly slower
TimeSeriesSplitTime-series or sequential dataPrevents look-ahead biasCannot shuffle, fewer train samples
ShuffleSplitNeed exact control over split sizeFlexible proportionsSlower if high n_iter

For most cases, 5-fold stratified CV is the safe default.

Key Takeaways

  • Cross-validation reduces variance in performance estimates by averaging across multiple train-test splits.
  • 5-fold cross-validation is the default; use 10-fold if you need tighter estimates or 3-fold on very small datasets.
  • Always use StratifiedKFold for classification to preserve class distribution in each fold.
  • Use TimeSeriesSplit for sequential data to respect temporal order (prevent look-ahead bias).
  • GridSearchCV internally uses CV to tune hyperparameters, preventing overfitting to validation data.

Frequently Asked Questions

How many folds should I use?

For datasets with 100-10,000 samples, 5-fold is standard. Use 10-fold for larger datasets or when you need lower variance. For very small datasets (under 50 samples), use leave-one-out cross-validation (LOO), but it is computationally expensive: LeaveOneOut() trains n models for n samples.

Does cv=5 mean 5 training runs?

Yes. With 5-fold CV, the model is trained 5 times, each on a different 80% of the data. Total training time is roughly 5x a single train-test split (though the 20% test set is smaller than a typical 20% hold-out, so actual time is similar).

Can I use CV with hyperparameter tuning?

Yes, and you should. GridSearchCV and RandomizedSearchCV use CV internally. For each parameter combination, they train on k-1 folds and test on the held-out fold, returning the mean CV score for comparison.

What if I have a time-series dataset?

Use TimeSeriesSplit, which trains on earlier timesteps and tests on later ones. Never shuffle time-series data; order matters. TimeSeriesSplit handles this correctly.

How do I interpret CV results with high standard deviation?

High std means performance varies across folds, suggesting the model is sensitive to the training set. Possible causes: small dataset (use fewer folds), imbalanced data (use StratifiedKFold), or unstable model (tune hyperparameters or use ensemble methods).

Further Reading