Feature Engineering: Create New Features Python
Raw features often don't capture the true signal. An age of 35 matters differently in a credit-approval model versus a product-recommendation system. A restaurant's latitude and longitude mean little in isolation but become powerful when combined as distance from downtown. Feature creation—inventing new features from existing ones—is where domain expertise meets machine learning, and it's why the best practitioners spend 60% of their time engineering features and only 20% on model selection.
Domain-Driven Feature Engineering
The most impactful features come from deep understanding of the problem. If you're predicting house prices, the raw bedrooms and bathrooms count matter, but an experienced realtor knows that the ratio of bedrooms to bathrooms, whether the kitchen was recently updated, and proximity to transit are far more predictive.
import pandas as pd
import numpy as np
# Real estate example
df = pd.DataFrame({
'bedrooms': [3, 4, 2, 5],
'bathrooms': [1.5, 2.5, 1, 3],
'sqft': [1500, 2000, 1200, 3000],
'year_built': [1990, 2010, 1985, 2020]
})
# Create domain-informed features
df['bed_bath_ratio'] = df['bedrooms'] / df['bathrooms'] # Design insight
df['sqft_per_bed'] = df['sqft'] / df['bedrooms'] # Density
df['age'] = 2026 - df['year_built'] # Property age
df['modern'] = (df['age'] < 10).astype(int) # Binary: recently built?
print(df[['bed_bath_ratio', 'sqft_per_bed', 'age', 'modern']])
The key: these features encode knowledge about the domain. A real estate expert would immediately recognize that bed_bath_ratio and sqft_per_bed capture meaningful variance. This is harder to automate and requires domain experience.
Polynomial Features
Polynomial features capture non-linear relationships. If the relationship between income and spending is quadratic (spending accelerates as income rises), a linear model on raw income fails, but polynomial features fit perfectly.
from sklearn.preprocessing import PolynomialFeatures
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
# Create polynomial features up to degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Original:\n{X.ravel()}")
print(f"Polynomial (degree 2):\n{X_poly}")
# Columns: [x, x^2]
# [[1. 1.]
# [2. 4.]
# [3. 9.]
# [4. 16.]
# [5. 25.]]
Polynomial features are powerful but risky. A degree-3 polynomial on 10 original features creates 286 features. Use only for small feature sets and validate with cross-validation to prevent overfitting.
# Full example with cross-validation
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
X = np.random.randn(100, 2)
y = 3 * X[:, 0]**2 + 2 * X[:, 1] + np.random.randn(100) * 0.1
# Model 1: Linear (no polynomial features)
model_linear = LinearRegression()
cv_linear = cross_val_score(model_linear, X, y, cv=5, scoring='r2')
# Model 2: With polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model_poly = LinearRegression()
cv_poly = cross_val_score(model_poly, X_poly, y, cv=5, scoring='r2')
print(f"Linear R²: {cv_linear.mean():.4f}")
print(f"Polynomial R²: {cv_poly.mean():.4f}")
Interaction Features
Interaction features multiply or combine features to capture joint effects. If you're predicting ad click-through rate, the interaction of user age and ad type might be far more predictive than either alone.
# Example: predicting ad clicks
df = pd.DataFrame({
'age': [20, 30, 40, 50],
'income': [30000, 60000, 90000, 120000],
'clicks': [5, 8, 12, 10]
})
# Create interaction features
df['age_income'] = df['age'] * df['income']
df['age_income_ratio'] = df['income'] / df['age']
# For categorical interactions (e.g., gender × product_category)
# Use one-hot encoding + manual multiplication
gender_encoded = pd.get_dummies(df['gender'], prefix='gender')
product_encoded = pd.get_dummies(df['product'], prefix='product')
# Create all interactions
for gender_col in gender_encoded.columns:
for product_col in product_encoded.columns:
interaction_col = f"{gender_col}_X_{product_col}"
df[interaction_col] = gender_encoded[gender_col] * product_encoded[product_col]
Interaction features are most useful for tree models and linear models. Neural networks learn interactions implicitly through hidden layers, so explicit interaction features help less.
Binning Continuous Variables
Binning converts continuous variables into categorical bins, useful when the relationship is step-wise (spending accelerates at income thresholds) or when you want to reduce the impact of outliers.
# Equal-width binning
age = np.array([5, 15, 25, 35, 45, 55, 65, 75])
age_binned = pd.cut(age, bins=4) # 4 equal-width bins
print(age_binned)
# [(4.975, 19.85], (19.85, 34.7], ...
# Equal-frequency binning (quartiles)
income = np.array([30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000])
income_quantiles = pd.qcut(income, q=4) # Quartiles (equal frequency)
print(income_quantiles)
# Custom binning (domain-driven)
df = pd.DataFrame({'credit_score': [300, 400, 500, 600, 700, 800]})
df['credit_tier'] = pd.cut(df['credit_score'],
bins=[0, 400, 600, 750, 850],
labels=['poor', 'fair', 'good', 'excellent'])
print(df)
Binning trades granularity for robustness. A continuous variable like age has many distinct values; binning groups them into "young", "middle-aged", "senior". This reduces dimensionality and can improve generalization if the true relationship is step-wise.
Time-Based Features
For temporal data, extract features from dates: day of week, month, is_holiday, days_since_event.
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2026-01-01', periods=10),
'revenue': np.random.rand(10) * 1000
})
df['day_of_week'] = df['date'].dt.dayofweek # 0=Monday, 6=Sunday
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
df['days_since_start'] = (df['date'] - df['date'].min()).dt.days
print(df)
Time-based features are crucial for time-series models. The day of week often predicts demand (weekends differ from weekdays), and seasonal patterns (months, quarters) capture long-term trends.
Text Length and Count Features
If you're working with text data, simple count and length features often yield quick wins: character count, word count, unique word count, and average word length.
df = pd.DataFrame({
'review': [
'This product is excellent',
'Not good at all',
'Amazing quality and fast shipping',
'Disappointed'
]
})
df['char_count'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()
df['avg_word_length'] = df['char_count'] / df['word_count']
df['unique_word_count'] = df['review'].str.split().apply(lambda x: len(set(x)))
print(df[['review', 'char_count', 'word_count', 'unique_word_count']])
These features are weak predictors individually but combine well with other signals. Longer, more detailed reviews often correlate with strong opinions (positive or negative).
Log and Root Transformations
For skewed features like income or engagement metrics, apply log or root transformations to normalize the distribution.
# Log transformation
revenue = np.array([100, 500, 1000, 10000, 100000])
log_revenue = np.log1p(revenue) # log(1 + x) to avoid log(0)
# Square root transformation (milder than log)
users = np.array([1, 4, 9, 16, 25])
sqrt_users = np.sqrt(users)
# Create as features
df['log_revenue'] = np.log1p(df['revenue'])
df['sqrt_engagement'] = np.sqrt(df['engagement'])
Log transformations are essential for features with exponential distributions. They linearize relationships and stabilize variance.
Aggregation Features
For grouped data, aggregate to create features: sum, mean, max, min, std per group.
# Example: customer transactions
df = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2, 3],
'amount': [100, 200, 50, 300, 250, 150]
})
# Aggregate per customer
agg_stats = df.groupby('customer_id')['amount'].agg(['sum', 'mean', 'max', 'std']).reset_index()
agg_stats.columns = ['customer_id', 'total_spent', 'avg_transaction', 'max_transaction', 'std_transaction']
# Merge back
df = df.merge(agg_stats, on='customer_id')
print(df)
Aggregation features reduce dimensionality by summarizing group behavior. They're common in customer analytics and fraud detection.
Key Takeaways
- Domain knowledge trumps automation: invest time understanding your problem before engineering features.
- Polynomial features capture non-linear relationships but risk overfitting; use cross-validation to validate.
- Interaction features multiply or combine inputs and are especially powerful for linear models.
- Binning converts continuous variables to categorical, reducing dimensionality and improving robustness.
- Time-based features (day of week, seasonality) are essential for temporal data.
- Log and root transformations normalize skewed distributions and linearize relationships.
- Always validate new features: add them to your model and check if performance improves on held-out test data.
Frequently Asked Questions
How do I know if a new feature is worth keeping?
Add the feature to your model and measure performance on a validation set. If cross-validated accuracy, F1, or your chosen metric improves, keep it. If it stays the same or worsens, drop it. Feature selection algorithms (covered in the next article) automate this.
Should I create polynomial features for all models?
No. Polynomial features help linear models capture non-linearity but are unnecessary and harmful for tree-based models, which already learn non-linear splits. Neural networks also learn non-linearity implicitly through hidden layers.
Can I create too many features?
Yes. Too many features increase training time, memory usage, and overfitting risk. The curse of dimensionality kicks in around 100+ features on small datasets. Use feature selection to reduce dimensionality after engineering.
Should I bin age or income, or keep them continuous?
Keep them continuous unless domain logic dictates otherwise. Continuous features preserve information. Bin only if the relationship is truly step-wise (e.g., credit score thresholds) or to reduce outlier impact.