Skip to main content

End-to-End Feature Engineering Workflow

Feature engineering is not a one-off task; it's an iterative workflow. You explore data, preprocess, create features, select the best ones, validate, and refine. This article walks through a complete, production-ready pipeline that integrates lessons from all previous articles: handling missing values, encoding categoricals, scaling, creating features, selecting features, avoiding leakage, and validating properly.

The Feature Engineering Workflow

1. Explore & Understand

2. Handle Missing Values

3. Encode Categorical Variables

4. Create New Features

5. Handle Outliers

6. Scale Features

7. Select Features

8. Train & Validate

9. Iterate & Refine

Step 1: Explore and Understand

Begin by understanding your data: shape, data types, missing values, distributions, and correlations.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('housing.csv')

# Basic exploration
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nStatistics:\n{df.describe()}")

# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumeric columns: {numeric_cols}")
print(f"Categorical columns: {categorical_cols}")

# Visualize distributions
fig, axes = plt.subplots(len(numeric_cols), 1, figsize=(10, 4*len(numeric_cols)))
for col, ax in zip(numeric_cols, axes):
df[col].hist(ax=ax, bins=30)
ax.set_title(col)
plt.tight_layout()
plt.show()

# Correlation heatmap (numeric features only)
plt.figure(figsize=(10, 8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', center=0)
plt.show()

Exploration reveals data quality issues, distributions, and relationships. Use this knowledge to guide preprocessing decisions.

Step 2: Split Data Early

Split train/test (or use stratified K-fold) BEFORE any preprocessing to prevent leakage.

from sklearn.model_selection import train_test_split, StratifiedKFold

# Separate features and target
X = df.drop('price', axis=1)
y = df['price']

# Stratified split for classification; regular split for regression
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

Splitting early ensures that every preprocessing step fits only on training data.

Step 3: Handle Missing Values

Use the patterns identified in exploration to choose an imputation strategy.

from sklearn.impute import SimpleImputer, KNNImputer

# Identify numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# Impute numeric features (median is robust to outliers)
imputer_numeric = SimpleImputer(strategy='median')
X_train[numeric_cols] = imputer_numeric.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = imputer_numeric.transform(X_test[numeric_cols])

# Impute categorical features (mode)
imputer_categorical = SimpleImputer(strategy='most_frequent')
X_train[categorical_cols] = imputer_categorical.fit_transform(X_train[categorical_cols])
X_test[categorical_cols] = imputer_categorical.transform(X_test[categorical_cols])

print("Missing values after imputation:")
print(X_train.isnull().sum())

Use separate imputers for numeric and categorical features. Fit on training data, transform on test data.

Step 4: Encode Categorical Variables

Choose encoding based on cardinality and algorithm type.

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# One-hot encode low-cardinality categorical features
low_cardinality = categorical_cols # Assume all are low-cardinality for simplicity
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_train_encoded = encoder.fit_transform(X_train[low_cardinality])
X_test_encoded = encoder.transform(X_test[low_cardinality])

# Get feature names
encoded_cols = encoder.get_feature_names_out(low_cardinality)

# Create DataFrames with encoded features
X_train_encoded = pd.DataFrame(X_train_encoded, columns=encoded_cols, index=X_train.index)
X_test_encoded = pd.DataFrame(X_test_encoded, columns=encoded_cols, index=X_test.index)

# Combine with numeric features
X_train = pd.concat([X_train[numeric_cols], X_train_encoded], axis=1)
X_test = pd.concat([X_test[numeric_cols], X_test_encoded], axis=1)

print(f"Shape after encoding: {X_train.shape}")

Encoding must fit on training data. Use handle_unknown='ignore' to handle unseen categories in test data gracefully.

Step 5: Handle Outliers

Use robust methods (IQR, isolation forest) to detect and handle outliers. For this example, we'll use IQR.

# Detect and handle outliers in numeric features
for col in numeric_cols:
Q1 = X_train[col].quantile(0.25)
Q3 = X_train[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Cap outliers instead of removing (preserves data)
X_train[col] = X_train[col].clip(lower=lower_bound, upper=upper_bound)
X_test[col] = X_test[col].clip(lower=lower_bound, upper=upper_bound)

print("Outliers capped")

Capping is gentler than removal; it preserves all rows while reducing outlier extremeness.

Step 6: Scale Features

Scaling is essential for distance-based and gradient-based algorithms.

from sklearn.preprocessing import StandardScaler

# Scale numeric features only
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

print("Features scaled")

Fit the scaler on training data; apply to test data with training statistics.

Step 7: Create Domain-Driven Features

Engineer features based on domain knowledge.

# Example: create ratio and interaction features (if applicable)
# For housing data:
if 'bedrooms' in X_train.columns and 'bathrooms' in X_train.columns:
X_train['bed_bath_ratio'] = X_train['bedrooms'] / (X_train['bathrooms'] + 1)
X_test['bed_bath_ratio'] = X_test['bedrooms'] / (X_test['bathrooms'] + 1)

if 'sqft' in X_train.columns and 'bedrooms' in X_train.columns:
X_train['sqft_per_bed'] = X_train['sqft'] / (X_train['bedrooms'] + 1)
X_test['sqft_per_bed'] = X_test['sqft'] / (X_test['bedrooms'] + 1)

print(f"Shape after feature engineering: {X_train.shape}")

Feature engineering is domain-specific. Use your understanding of the problem to invent meaningful features.

Step 8: Select Features

Use univariate and model-based feature selection to reduce dimensionality.

from sklearn.feature_selection import mutual_info_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

# Mutual information for numeric targets
mi_scores = mutual_info_regression(X_train, y_train, random_state=42)
mi_df = pd.DataFrame({'feature': X_train.columns, 'mi_score': mi_scores}).sort_values('mi_score', ascending=False)

print("Top 20 features by mutual information:")
print(mi_df.head(20))

# Select top K features
top_k = 20
top_features = mi_df.head(top_k)['feature'].tolist()
X_train = X_train[top_features]
X_test = X_test[top_features]

print(f"\nShape after feature selection: {X_train.shape}")

Feature selection reduces overfitting, training time, and dimensionality. Start with univariate scores; refine with model-based methods.

Step 9: Build and Validate

Use a pipeline to ensure leakage-free training and evaluation.

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# Since we've already preprocessed, create a simple pipeline with model only
# (In a real workflow, you'd include preprocessing in the pipeline)

# Test multiple models
models = {
'Ridge': Ridge(alpha=1.0),
'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
'GradientBoosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

for name, model in models.items():
# Cross-validation on training data
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
print(f"{name}: CV R² = {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Train on full training set and evaluate on test set
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
test_r2 = r2_score(y_test, y_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f" Test R² = {test_r2:.4f}, RMSE = {test_rmse:.4f}\n")

Cross-validation on training data guides model selection. Final test evaluation reveals real generalization performance.

Step 10: Hyperparameter Tuning

Tune the best model's hyperparameters using grid search.

from sklearn.model_selection import GridSearchCV

# Tune the best model (assume RandomForest is best)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
RandomForestRegressor(random_state=42),
param_grid,
cv=5,
scoring='r2',
n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV R²: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"Test R²: {r2_score(y_test, y_pred):.4f}")

Grid search finds optimal hyperparameters via cross-validation. Always evaluate the final model on the held-out test set.

Step 11: Iterate and Refine

If test performance is poor, revisit earlier steps.

Does the model overfit? → More feature selection, less complex model
Does the model underfit? → More features, more complex model
Are predictions biased? → Check for leakage, re-examine features
Do predictions explode on new data? → Validate assumptions, check outliers

Iteration is normal. Rarely is a pipeline perfect on the first try.

Complete Production Pipeline (with Preprocessing)

Here's a cleaner version using scikit-learn pipelines for production:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.ensemble import RandomForestRegressor

# Define preprocessing
numeric_features = df.select_dtypes(include=['number']).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

# Full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('feature_selector', SelectKBest(mutual_info_regression, k=20)),
('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Train and evaluate (no leakage!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Test R²: {score:.4f}")

This pipeline is production-ready: preprocessing and feature selection fit on training data; the entire pipeline is reproducible and leakage-free.

Key Takeaways

  • Follow the workflow: explore, split, preprocess, engineer, select, validate, iterate.
  • Always split train/test BEFORE preprocessing to prevent leakage.
  • Fit all transformers (imputers, scalers, encoders) on training data only.
  • Use scikit-learn pipelines to automate preprocessing and prevent leakage.
  • Cross-validate on training data to guide model selection; evaluate on held-out test data.
  • Feature engineering is iterative; refine based on validation performance.
  • Save the fitted pipeline for production; apply it to new data using .transform().

Frequently Asked Questions

How often should I revisit feature engineering?

Every time model performance plateaus or decreases. Add new features, remove weak ones, try different encodings, adjust scaling. Iteration is part of the process.

What if I don't have domain knowledge about the problem?

Start with exploratory data analysis (EDA): correlations, distributions, patterns. Use univariate feature selection to identify strong predictive signals. As you understand the data better, add domain-informed features.

Should I save the pipeline or individual transformers?

Save the entire pipeline using joblib.dump(pipeline, 'model.pkl'). When you encounter new data, load the pipeline and call .predict(). The pipeline applies preprocessing automatically.

How do I handle new categories at prediction time?

Use handle_unknown='ignore' in OneHotEncoder. Unseen categories become all zeros. Alternatively, pre-bin rare categories into an "Other" category during training.

Further Reading