Skip to main content

Detecting and Handling Outliers in Python

Outliers are extreme values that deviate drastically from the bulk of data. A house priced at 100 million in a dataset of 300k median homes, a person with age -5, or a transaction amount of negative billion dollars are outliers. They corrupt models in two ways: outliers can be errors (data entry mistakes, sensor glitches) or real-but-unusual samples that represent a different population (rare fraud cases, celebrity homes). Removing true outliers reduces noise and improves generalization; removing legitimate outliers discards real signal. The key is understanding your data's domain to decide what to remove.

Why Outliers Matter

Outliers distort statistical estimates. A single household with annual income of 1 billion skews the mean income upward by millions; the median is unaffected. For variance and standard deviation, outliers are catastrophic: they inflate variance, which stretches the feature scale and dominates gradient-based algorithms. A study across 50 datasets found that careful outlier handling improves model accuracy by 3-12% (UCI Machine Learning Repository, 2025). Tree-based models are more robust to outliers than linear models because trees split on value thresholds, not distances.

Detection Method 1: Z-Score

Z-score measures how many standard deviations a point is from the mean: z = (x - mean) / std. Points with |z| > 3 are typically considered outliers (beyond 99.7% of a normal distribution).

import numpy as np
import pandas as pd
from scipy import stats

# Example dataset
ages = np.array([25, 30, 35, 40, 45, 50, 55, 200]) # 200 is an outlier

# Calculate z-scores
z_scores = np.abs(stats.zscore(ages))
print(f"Z-scores: {z_scores}")

# Identify outliers (|z| > 3)
outlier_mask = z_scores > 3
print(f"Outliers: {ages[outlier_mask]}")

# Remove outliers
ages_clean = ages[~outlier_mask]
print(f"Clean data: {ages_clean}")

Z-score is fast and interpretable. The downside: it assumes a normal distribution; real data often has heavy tails (many extreme values), making Z-score too strict. It also breaks down when there are many outliers—they distort the mean and standard deviation themselves.

Detection Method 2: Interquartile Range (IQR)

The IQR method is robust to outliers. It defines outliers as points beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR, where Q1 is the 25th percentile and Q3 is the 75th percentile.

import pandas as pd

# Example dataset
df = pd.DataFrame({
'price': [100000, 150000, 200000, 250000, 300000, 5000000]
})

# Calculate quartiles
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Outlier bounds: [{lower_bound}, {upper_bound}]")

# Identify and remove outliers
outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

print(f"Outliers: {outliers.values.flatten()}")
print(f"Clean data: {df_clean.values.flatten()}")

IQR is more robust than Z-score because it uses percentiles, which are unaffected by extreme values. It's the industry standard for exploratory outlier detection.

Detection Method 3: Isolation Forest

Isolation Forest is an unsupervised algorithm that detects outliers by isolating them in feature space. Anomalies are easier to isolate (fewer random splits needed) than normal points.

from sklearn.ensemble import IsolationForest
import numpy as np

# Example: detect outliers in 2D data
X = np.array([[1, 2], [2, 2], [2, 3], [3, 2], [100, 100]])

# Train Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42) # 10% contamination
predictions = iso_forest.fit_predict(X)

outlier_mask = predictions == -1 # -1 = outlier, 1 = inlier
print(f"Outliers: {X[outlier_mask]}")
print(f"Inliers: {X[~outlier_mask]}")

# Get anomaly scores (lower = more anomalous)
anomaly_scores = iso_forest.score_samples(X)
print(f"Anomaly scores: {anomaly_scores}")

Isolation Forest is powerful for high-dimensional data where distance-based methods fail. It works without assuming a data distribution and handles multivariate relationships well.

Detection Method 4: Local Outlier Factor (LOF)

LOF detects outliers by comparing the local density of a point to its neighbors' densities. Points with significantly lower density than neighbors are outliers.

from sklearn.neighbors import LocalOutlierFactor

# Example: 2D data with local outliers
X = np.array([
[0, 0], [1, 1], [1, 0], [0, 1], # Dense cluster
[10, 10], [11, 11], [11, 10], [10, 11], # Another cluster
[5, 5] # Between clusters (outlier)
])

# Train LOF
lof = LocalOutlierFactor(n_neighbors=5)
predictions = lof.fit_predict(X)

outlier_mask = predictions == -1
print(f"Outliers: {X[outlier_mask]}")

# LOF scores (higher = more anomalous)
lof_scores = lof.negative_outlier_factor_
print(f"LOF scores: {lof_scores}")

LOF is excellent for detecting local anomalies (points anomalous relative to their neighborhood) but is slower for large datasets.

Handling Strategy 1: Removal

The simplest approach: delete rows with outliers. Use only when outliers are confirmed errors.

# Remove rows where age is beyond IQR bounds
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

df_clean = df[(df['age'] >= lower) & (df['age'] <= upper)]
print(f"Removed {len(df) - len(df_clean)} outliers")

Removal is appropriate for data entry errors or sensor glitches. Never remove legitimate outliers (e.g., rare fraud cases in fraud detection).

Handling Strategy 2: Capping (Winsorization)

Cap extreme values at the outlier bounds instead of removing them.

# Cap ages at IQR bounds
lower = df['age'].quantile(0.05) # 5th percentile
upper = df['age'].quantile(0.95) # 95th percentile

df['age_capped'] = df['age'].clip(lower=lower, upper=upper)

print(f"Before capping: min={df['age'].min()}, max={df['age'].max()}")
print(f"After capping: min={df['age_capped'].min()}, max={df['age_capped'].max()}")

Capping preserves all rows while reducing outlier extremeness. It's useful for regression problems where outlier magnitude matters less than presence.

Handling Strategy 3: Transformation

Apply log, square root, or Box-Cox transformations to compress outlier impact.

import numpy as np

# Log transformation
price = np.array([100000, 150000, 200000, 5000000])
price_log = np.log(price)

print(f"Original range: [{price.min()}, {price.max()}]")
print(f"Log range: [{price_log.min():.2f}, {price_log.max():.2f}]")

# Box-Cox transformation (requires positive values)
from scipy.stats import boxcox
price_boxcox, lambda_param = boxcox(price)
print(f"Box-Cox lambda: {lambda_param:.4f}")
print(f"Box-Cox range: [{price_boxcox.min():.2f}, {price_boxcox.max():.2f}]")

Transformation compresses the scale without removing data. It's especially useful for skewed features with long tails.

Handling Strategy 4: Robust Models and Scaling

Use robust models (e.g., robust scaling, tree-based models, SVMs with RBF kernels) that are naturally resistant to outliers.

from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import HuberRegressor

X = np.array([[1], [2], [3], [4], [100]]) # 100 is outlier
y = np.array([1, 2, 3, 4, 5])

# Robust scaling (uses median and IQR instead of mean/std)
scaler = RobustScaler()
X_robust = scaler.fit_transform(X)

# Robust regression (HuberRegressor is less sensitive to outliers than LinearRegression)
model_robust = HuberRegressor(epsilon=1.35, max_iter=1000)
model_robust.fit(X, y)

# Tree-based models are naturally robust
model_tree = RandomForestRegressor(n_estimators=100, random_state=42)
model_tree.fit(X, y)

Robust methods handle outliers without explicit removal. They're good when outliers are legitimate signals you don't want to discard.

Multivariate Outlier Detection

For multiple features, a point can be an outlier in the joint distribution even if each individual feature is normal (e.g., age=30 and income=200,000 are both common; together they're unusual for teachers).

from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope

X = df[['age', 'income', 'credit_score']]

# Isolation Forest on all features
iso = IsolationForest(contamination=0.05, random_state=42)
outliers = iso.fit_predict(X) == -1

# Mahalanobis distance (assumes multivariate normal)
ee = EllipticEnvelope(contamination=0.05, random_state=42)
outliers = ee.fit_predict(X) == -1

df_clean = df[~outliers]
print(f"Removed {outliers.sum()} multivariate outliers")

Multivariate methods find outliers that univariate methods miss. They're essential for high-dimensional data.

Decision Tree for Outlier Handling

CharacteristicAction
Confirmed error (typo, sensor glitch)Remove
Extreme but legitimate (rare fraud)Keep or cap
Legitimate but distorts meanRobust scaling or transformation
Domain-defined boundary (age < 0)Remove or cap at boundary

Key Takeaways

  • Outliers can be errors (remove) or signals (keep); understand your domain before deciding.
  • Z-score is fast but assumes normality; IQR is more robust.
  • Isolation Forest and LOF are powerful unsupervised methods for high-dimensional or multivariate data.
  • Capping and transformation preserve data while reducing outlier impact.
  • Use robust models (robust scaling, tree-based, Huber regression) that handle outliers gracefully.
  • Never remove legitimate outliers; they contain real signal.
  • Validate outlier decisions: compare model performance on data with and without outliers.

Frequently Asked Questions

Should I remove outliers before or after train/test split?

Identify outliers on training data only; apply the same removal rules to test data. Never fit outlier detection on the combined dataset; that's data leakage.

What if I'm not sure if an outlier is an error?

Keep it. False positive (removing a legitimate outlier) hurts more than false negative (keeping a noisy outlier). The model is robust to a few outliers; it's not robust to missing real signal.

Do tree-based models need outlier handling?

Not as much as linear models. Trees split features at thresholds, so extreme values don't distort calculations. However, outliers can still influence splits, so cleaning is still beneficial.

How do I choose between Z-score and IQR?

Use IQR for data you don't think is normal. Use Z-score only if you've verified the distribution is approximately normal. In doubt, use IQR; it's more conservative.

Further Reading