Feature Scaling: Normalize and Standardize Data
Feature scaling transforms your numeric features to a common range or distribution. Models like linear regression, logistic regression, and support vector machines are distance-sensitive: they treat features with larger numeric ranges as more important, even when they are not. StandardScaler and MinMaxScaler solve this by making all features comparable. Scaling also accelerates convergence in gradient-based optimization and is essential for regularized models. Skipping it is one of the most common ML engineering mistakes—a simple preprocessing step that improves accuracy by 5-20% on many real datasets.
Why Feature Scaling Matters
Distance-based and gradient-based models treat all features equally by default. If one feature ranges from 1 to 1000 and another from 0 to 1, the model sees the first as 1000x more important by magnitude alone. Consider a house price model with features like square footage (500-5000) and number of bathrooms (1-5): if you train an unscaled model, it will overweight square footage purely because its numeric scale is larger.
Scaling ensures every feature contributes proportionally to the model's decision boundary. It also speeds up training: gradient-based optimizers converge faster on scaled data because the loss landscape becomes more symmetric. Regularized models like Ridge and Lasso use feature scale to compute penalties, so unscaled features with large magnitudes can be unfairly shrunk.
StandardScaler: Standardization (Z-Score Normalization)
StandardScaler transforms features to have mean 0 and standard deviation 1. This is the most common scaling method and works well for normally distributed data and linear models.
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
# Load data
iris = load_iris()
X = iris.data
print("Original feature statistics:")
print(f" Feature 0 - Mean: {X[:, 0].mean():.2f}, Std: {X[:, 0].std():.2f}")
print(f" Feature 1 - Mean: {X[:, 1].mean():.2f}, Std: {X[:, 1].std():.2f}")
# Fit scaler on training data only
scaler = StandardScaler()
scaler.fit(X)
# Transform features
X_scaled = scaler.transform(X)
print("\nAfter StandardScaler:")
print(f" Feature 0 - Mean: {X_scaled[:, 0].mean():.2e}, Std: {X_scaled[:, 0].std():.2f}")
print(f" Feature 1 - Mean: {X_scaled[:, 1].mean():.2e}, Std: {X_scaled[:, 1].std():.2f}")
# StandardScaler stores learned statistics
print(f"\nLearned mean: {scaler.mean_}")
print(f"Learned std: {scaler.scale_}")
StandardScaler uses the formula: z = (x - mean) / std. The scaler learns mean and standard deviation during fit(), then applies the same transformation to all data (train and test). This ensures test data is scaled using training statistics—preventing data leakage.
MinMaxScaler: Normalization (Rescaling to [0, 1])
MinMaxScaler rescales features to a fixed range, typically [0, 1]. Use this when you know the intended range of a feature or when you want bounded output for tree-based models (though trees are scale-invariant, MinMaxScaler can help with downstream steps).
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(X)
X_normalized = scaler.transform(X)
print("MinMaxScaler output (all features in [0, 1]):")
print(f" Min per feature: {X_normalized.min(axis=0)}")
print(f" Max per feature: {X_normalized.max(axis=0)}")
# For custom range like [-1, 1]
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler.fit(X)
X_custom = scaler.transform(X)
print(f"\nCustom range [-1, 1] - Min: {X_custom.min():.2f}, Max: {X_custom.max():.2f}")
MinMaxScaler uses: x_scaled = (x - min) / (max - min) times the desired range. It is sensitive to outliers—a single extreme value stretches the entire feature range.
When to Scale: Model-Dependent Guidance
| Model Type | Requires Scaling? | Why |
|---|---|---|
| Linear Regression, Logistic Regression | Yes | Distance-based; unequal scales bias coefficients |
| SVM, KNN | Yes | Distance-sensitive; scale needed for fair distance metrics |
| Neural Networks | Yes | Gradient descent converges faster on scaled data |
| Decision Trees, Random Forests | No | Tree splits are scale-invariant |
| Naive Bayes | No | Computes class-conditional distributions independently |
| Linear/Ridge/Lasso Regularization | Yes | Regularization penalizes large coefficients; scale affects penalty |
As a safe default: always scale before fitting any linear or distance-based model. Tree-based models don't require scaling, but it does not hurt (except minor computational overhead).
Scaling in Pipelines: The Right Way
Use scikit-learn pipelines to attach scalers to models, ensuring the same scaler is applied to train and test data automatically:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Split data first
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Pipeline: scaler → model
# Scaler is fit only on X_train inside the pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42))
])
# fit() scales training data and trains the classifier
pipeline.fit(X_train, y_train)
# predict() scales test data using training statistics and predicts
test_accuracy = pipeline.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")
# Access the fitted scaler
fitted_scaler = pipeline.named_steps['scaler']
print(f"Learned mean from training data: {fitted_scaler.mean_}")
Pipelines prevent the most common data leakage mistake: fitting a scaler on combined train+test data.
Handling Outliers: RobustScaler
StandardScaler and MinMaxScaler are sensitive to outliers. If your data has extreme values, use RobustScaler, which uses median and interquartile range instead of mean and std:
from sklearn.preprocessing import RobustScaler
# RobustScaler uses median and IQR, resistant to outliers
scaler = RobustScaler()
scaler.fit(X)
X_robust = scaler.transform(X)
print(f"RobustScaler statistics:")
print(f" Center (median): {scaler.center_}")
print(f" Scale (IQR): {scaler.scale_}")
RobustScaler uses: x_scaled = (x - median) / IQR. It is ideal for datasets with outliers that you cannot remove.
Inverse Transform: Converting Back to Original Scale
After scaling, you can reverse the transformation to interpret predictions in the original scale:
# Scale data
X_scaled = scaler.transform(X)
# Make predictions
predictions_scaled = model.predict(X_scaled)
# Inverse transform to original scale (for regression targets)
# Note: inverse_transform only works for feature scalers, not target scalers
X_original = scaler.inverse_transform(X_scaled)
print(f"Recovered original data (shape): {X_original.shape}")
For regression models, scale the target y separately if needed, using a second scaler fitted on training targets only.
Key Takeaways
- StandardScaler (Z-score normalization) is the default; use it for most linear and distance-based models.
- MinMaxScaler rescales to
[0, 1]; useful for bounded outputs or when feature ranges are semantically meaningful. - Fit scalers on training data only to prevent data leakage; apply the learned transformation to test data.
- Use pipelines to ensure scalers and models are applied consistently.
- RobustScaler handles outliers better than StandardScaler or MinMaxScaler.
Frequently Asked Questions
Should I scale categorical features?
No. Categorical features are typically encoded as integers or one-hot vectors after preprocessing. Only scale numeric (continuous) features. Encoded categorical variables like one-hot vectors (0/1) do not benefit from scaling.
Can I scale before train-test splitting?
No. Always split first, then scale using only training data statistics. If you scale before splitting, you leak test information into the scaler, biasing all downstream evaluation.
What if features have different units (e.g., meters and dollars)?
Scaling makes all features unit-less and comparable. StandardScaler is ideal here: it centers on 0 regardless of original units. After scaling, all features are in the same "standardized units."
Does order of operations matter: split then scale, or scale then split?
Always split first, then scale. Splitting first ensures your test set represents unseen data. Scaling train and test with the same training statistics is key.
How do I handle missing values during scaling?
Handle missing values before scaling. Use SimpleImputer to fill them, then scale the completed data. In a pipeline, impute first, then scale: [('impute', SimpleImputer()), ('scale', StandardScaler())].