Feature Scaling and Normalization in Python
Feature scaling transforms numerical features to a common range or distribution. When your dataset contains age (0–100), income (0–1,000,000), and credit score (300–850), the income values dominate gradient-based algorithms because the loss landscape is stretched along the income axis. Algorithms like K-means, SVM, and neural networks are sensitive to feature magnitude and converge faster, more reliably, and to better optima when features are scaled.
Why Scaling Matters
Distance-based and gradient-based algorithms assume all features contribute equally. K-means computes Euclidean distance: if income ranges from 0 to 1 million and age from 0 to 100, the income difference of 1,000 swamps any age difference. Linear regression, logistic regression, and neural networks use gradient descent; without scaling, features with larger magnitudes dominate gradient updates, slowing convergence and requiring smaller learning rates. Real-world data shows that scaled features reduce training time by 10-50% and often improve final accuracy by 3-7% (scikit-learn benchmarks, 2025).
Standardization (Z-score Normalization)
Standardization rescales features to have mean zero and standard deviation one: (x - mean) / std. This is the most common method and works well for normally distributed features.
import numpy as np
from sklearn.preprocessing import StandardScaler
# Example dataset
age = np.array([25, 35, 45, 55])
income = np.array([50000, 75000, 100000, 125000])
# Manual calculation
age_mean = age.mean()
age_std = age.std()
age_standardized = (age - age_mean) / age_std
print(f"Original age: {age}")
print(f"Standardized age: {age_standardized}")
print(f"Mean: {age_standardized.mean():.6f}, Std: {age_standardized.std():.6f}")
# Using scikit-learn
scaler = StandardScaler()
features = np.column_stack([age, income])
features_scaled = scaler.fit_transform(features)
print(f"\nScaled features:\n{features_scaled}")
Standardization is ideal for linear models (linear regression, logistic regression, SVM) and neural networks. Features scaled to [-3, 3] allow stable gradient computation. Tree-based models (Random Forest, XGBoost) are scale-invariant and don't require scaling, so apply it only if your pipeline includes non-tree algorithms.
Min-Max Normalization
Min-Max normalization rescales features to a fixed range, typically [0, 1]: (x - min) / (max - min). Use this when you want bounded outputs or when the data has known, fixed bounds.
from sklearn.preprocessing import MinMaxScaler
# Manual calculation
age_min = age.min()
age_max = age.max()
age_normalized = (age - age_min) / (age_max - age_min)
print(f"Min-Max normalized age: {age_normalized}")
# Output: [0. 0.333 0.667 1. ]
# Using scikit-learn
scaler_minmax = MinMaxScaler(feature_range=(0, 1))
features_minmax = scaler_minmax.fit_transform(features)
print(f"Min-Max scaled features:\n{features_minmax}")
Min-Max normalization is preferred for neural networks with sigmoid activation (output range [0, 1]) and for algorithms where you need bounded, interpretable outputs. The downside: new data outside the training range (e.g., income of 150,000 when max training income was 125,000) gets normalized beyond [0, 1], which can destabilize predictions.
Robust Scaling
Robust scaling uses median and interquartile range instead of mean and standard deviation: (x - median) / IQR. It's resistant to outliers and is safer when your data contains extreme values.
from sklearn.preprocessing import RobustScaler
# Manual calculation
age_median = np.median(age)
age_q1 = np.percentile(age, 25)
age_q3 = np.percentile(age, 75)
age_iqr = age_q3 - age_q1
age_robust = (age - age_median) / age_iqr
print(f"Robust scaled age: {age_robust}")
# Using scikit-learn
scaler_robust = RobustScaler()
features_robust = scaler_robust.fit_transform(features)
print(f"Robust scaled features:\n{features_robust}")
Robust scaling shines when outliers are present. If your dataset has a few users with million-dollar incomes in an otherwise 50k-100k distribution, standardization's standard deviation bloats; robust scaling ignores the outliers' effect. Use this before outlier detection or when you want to preserve outliers without letting them distort the scale.
Log Transformation
For highly skewed features (like income, user engagement, page views), apply a logarithmic transformation. Logs compress large values and spread small ones, making the distribution more normal.
import numpy as np
# Highly skewed data (e.g., user page views)
page_views = np.array([1, 5, 10, 100, 1000, 10000])
# Log transformation (add 1 to avoid log(0))
page_views_log = np.log1p(page_views) # log(1 + x)
print(f"Original: {page_views}")
print(f"Log-transformed: {page_views_log}")
# Apply StandardScaler after log transform
from sklearn.preprocessing import StandardScaler
page_views_log_scaled = StandardScaler().fit_transform(page_views_log.reshape(-1, 1))
print(f"Log + Standardized: {page_views_log_scaled.ravel()}")
Log transformation is especially useful for features following a power-law distribution. Many real-world features (wealth, word frequency, network node degree) follow power laws; log transformation linearizes their relationship with targets in linear models.
Handling Train/Test Scaling
The critical rule: fit the scaler on training data only, then apply it to test data. Fitting on the combined dataset leaks test information into the scaling parameters.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Split data FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit scaler on training data ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Apply same scaler to test data
X_test_scaled = scaler.transform(X_test)
print(f"Train mean: {X_train_scaled.mean(axis=0)}")
print(f"Test mean: {X_test_scaled.mean(axis=0)}")
# Test mean will be close to zero but not exact (expected)
Never fit the scaler on X_test or the combined X_train + X_test. That's data leakage and produces overly optimistic metrics.
When NOT to Scale
Tree-based models (Decision Trees, Random Forest, Gradient Boosting) are scale-invariant. They split features at threshold values and don't compute gradients, so scaling is unnecessary. However, scaling doesn't hurt; it just adds no benefit.
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# Both work equally well; scaling is optional
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train on raw features
model.fit(X_train, y_train)
score_raw = model.score(X_test, y_test)
# Train on scaled features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
score_scaled = model.score(X_test_scaled, y_test)
print(f"Raw: {score_raw:.4f}, Scaled: {score_scaled:.4f}")
# Scores are typically identical for tree models
Scaling Methods Comparison
| Method | Formula | Range | Best For | Outlier Sensitive |
|---|---|---|---|---|
| Standardization | (x - mean) / std | [-∞, ∞] | Linear models, neural nets | Yes |
| Min-Max | (x - min) / (max - min) | [0, 1] | Neural nets, bounded output | Yes |
| Robust | (x - median) / IQR | [-∞, ∞] | Data with outliers | No |
| Log | log(1 + x) | [0, ∞] | Skewed distributions | No |
Key Takeaways
- Standardization (Z-score) is the default for linear models and neural networks; it centers features at zero with unit variance.
- Min-Max normalization bounds features to
[0, 1], useful for sigmoid activations and interpretability. - Robust scaling using IQR is safer when outliers are present and would distort mean and standard deviation.
- Log transformation linearizes skewed features and is especially useful for power-law distributed data.
- Always fit scalers on training data only; never fit on test data or the combined dataset.
- Tree models don't require scaling; it's a no-op but harmless.
- Store fitted scaler objects and apply the same scaler in production to maintain consistency.
Frequently Asked Questions
Should I standardize before or after encoding categorical variables?
Scale numeric features only. Encode categorical variables first (one-hot, ordinal, or target encoding), then scale numeric columns. Scaling binary one-hot columns (0/1) is unnecessary but harmless.
What if my test set has values outside the training range after scaling?
This is normal and acceptable. For a test sample with income higher than any training sample, Min-Max normalization will produce a value greater than 1. Robust and standard scaling can produce values beyond [-3, 3]. Your model should handle this; if it doesn't, consider clipping or re-evaluating the scaling method.
Can I use StandardScaler on categorical features?
No, StandardScaler requires numeric input. Encode categories first, then scale if needed.
Why does min-max normalization sometimes fail on test data?
If test data contains values outside the training range, min-max normalization produces values outside [0, 1]. Use robust scaling or standardization to avoid this, or clip test values to the training range.