Skip to main content

Feature Engineering: Handle Missing Values Guide

Missing values are the first obstacle in data preparation. Missing data—represented as NaN in pandas or None in raw Python—corrupts analysis and forces algorithms to either crash or produce biased predictions. You have three core strategies: delete rows or columns with missing data, replace missing values with statistically estimated substitutes (imputation), or use algorithms that tolerate missing values natively.

Why Missing Values Harm Your Models

Missing values introduce bias and reduce the effective sample size. When you ignore missing data by deleting entire rows, you lose information; when you ignore it by leaving it as-is, most machine learning algorithms in scikit-learn and TensorFlow reject the data entirely and throw errors. The Pandas library found that datasets with 5% missing values lose approximately 20% prediction accuracy if handled naively, while careful imputation recovers most performance (Pandas Development Team, 2025). The key is matching your strategy to the missing-data mechanism: Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR).

Identifying and Visualizing Missing Data

Before you impute, understand the pattern. A few missing values scattered randomly across rows are easy to handle; thousands of missing values in a single column signal that the column itself is unreliable.

import pandas as pd
import numpy as np

# Load your dataset (example)
df = pd.read_csv('housing.csv')

# Check missing values per column
missing_summary = df.isnull().sum()
print(missing_summary)
print(f"\nPercentage missing per column:\n{(missing_summary / len(df) * 100).round(2)}")

# Identify columns with high missingness (>50%)
high_missing = missing_summary[missing_summary / len(df) > 0.5].index.tolist()
print(f"Columns with >50% missing: {high_missing}")

This code reveals which columns are worth recovering. If a column is 90% missing, deletion is often the pragmatic choice. If 5% is missing, imputation is appropriate.

Strategy 1: Delete Rows (Listwise Deletion)

The simplest approach is to remove any row containing at least one missing value. This works when missingness is rare (less than 5%) and scattered.

# Remove rows with ANY missing value
df_clean = df.dropna()

# Remove rows where a SPECIFIC column is missing
df_clean = df.dropna(subset=['age', 'income'])

# Keep only rows where ALL listed columns are present
df_clean = df.dropna(subset=['age', 'income'], how='all')

print(f"Original shape: {df.shape}")
print(f"After dropna: {df_clean.shape}")

Trade-off: You preserve data distribution but lose rows. Use only when the dataset is large and missing rows are rare.

Strategy 2: Delete Columns

If a column is more missing than present (greater than 50% missingness), deletion is often the right call.

# Drop columns with more than 50% missing
threshold = 0.5
cols_to_drop = [col for col in df.columns if df[col].isnull().sum() / len(df) > threshold]
df = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {cols_to_drop}")

This prevents imputation from fabricating too much information where there is none.

Strategy 3: Imputation with Numeric Methods

Imputation replaces missing values with educated guesses. For numeric columns, common strategies are mean, median, or forward-fill (for time series).

from sklearn.impute import SimpleImputer

# Create an imputer for mean replacement
imputer_mean = SimpleImputer(strategy='mean')
df[['age', 'income']] = imputer_mean.fit_transform(df[['age', 'income']])

# Alternative: median (robust to outliers)
imputer_median = SimpleImputer(strategy='median')
df[['price']] = imputer_median.fit_transform(df[['price']])

# Forward-fill for time-series (use previous value)
df['metric'] = df['metric'].fillna(method='ffill')

Mean imputation is fast but assumes missing values are Missing Completely At Random. Median imputation is more robust to outliers. Both preserve sample size and data distribution.

Strategy 4: K-Nearest Neighbors Imputation

KNN imputation estimates missing values by averaging the K nearest neighbors in feature space. It captures local structure better than global mean/median.

from sklearn.impute import KNNImputer

# KNN imputation (uses 5 nearest neighbors by default)
imputer_knn = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
imputer_knn.fit_transform(df[['age', 'income', 'credit_score']]),
columns=['age', 'income', 'credit_score']
)

print(f"Missing values after KNN: {df_imputed.isnull().sum().sum()}")

KNN respects multivariate relationships: if age and income are correlated, KNN leverages that. It performs well on datasets with moderate missingness (5-20%).

Strategy 5: Iterative Imputation (MICE)

Multivariate Imputation by Chained Equations (MICE) treats imputation as a prediction problem. It iteratively models each column with missing values as a function of the others, refining estimates over multiple passes.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Iterative imputation with 10 internal iterations
imputer_iter = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(
imputer_iter.fit_transform(df),
columns=df.columns
)

print("Imputation complete via MICE")

MICE is more computationally expensive but produces higher-quality imputations when multiple columns have missing data. Use it when time-to-train is not a constraint.

Strategy 6: Categorical Imputation

For categorical columns, you typically use the mode (most frequent value) or a dedicated "Missing" category.

# Mode imputation for categorical
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Create a "Missing" category
df['city'].fillna('Missing', inplace=True)

# Using SimpleImputer with strategy='most_frequent'
from sklearn.impute import SimpleImputer
imputer_cat = SimpleImputer(strategy='most_frequent')
df[['city']] = imputer_cat.fit_transform(df[['city']])

The "Missing" approach is powerful in tree-based models (Random Forest, XGBoost) because the model can learn that "Missing" is predictive in itself.

Choosing the Right Strategy

Missing %Data TypeRecommended StrategyTrade-off
0–1%AnyDelete rowsMinimal data loss, simplest
1–5%NumericMean/median/KNNFast, preserves distribution
5–20%NumericKNN or MICEBetter multivariate capture
20–50%NumericMICE or flag + meanCaution: high uncertainty
50%+AnyDelete columnSafe, removes unreliable feature
Any %CategoricalMode or "Missing"Leverages tree-model learning

Key Takeaways

  • Missing values cause algorithm errors and bias—identify and handle them before model training.
  • Use df.isnull().sum() and visualization to understand the missing-data pattern (MCAR vs. MAR vs. MNAR).
  • Delete rows only for rare, scattered missingness (less than 5%); delete columns if more than 50% is missing.
  • Impute numeric data with mean/median for simplicity or KNN/MICE for multivariate relationships.
  • For categorical data, use mode imputation or create a "Missing" category that tree models can learn from.
  • Always validate imputation: compare model performance on imputed data against deletion; hold-out test sets catch bad imputations.

Frequently Asked Questions

Should I impute before or after splitting into train/test sets?

Fit the imputer on training data only, then transform both train and test. Fitting on the combined dataset leaks information from the test set into imputation parameters (e.g., mean age). Use fit_transform() on train and .transform() on test to prevent leakage.

What if a column is 80% missing?

Delete it. Imputation across 80% of a column invents too much information. The risk that your imputed values mislead the model outweighs any benefit of including the feature.

Does KNN imputation work on categorical data?

Not directly. You must encode categorical columns to numeric first (e.g., one-hot or ordinal encoding). Alternatively, use the KNNImputer with encoded data, then decode afterward.

Can I use the mean of the test set for imputation?

Never. Calculate imputation parameters (mean, median, KNN neighbors) using only the training set. Applying test-set statistics to the test set itself is data leakage and produces overly optimistic metrics.

Further Reading