Skip to main content

Feature Engineering: Encode Categorical Variables

Most machine learning algorithms require numeric input, but raw datasets often contain categorical features like city names, product categories, or color codes. Categorical encoding converts these text or nominal values into numbers that algorithms can process. The wrong encoding choice can introduce artificial orderings (making "red" less than "blue") or explode your feature count into thousands of sparse columns.

Why Encoding Matters

Raw categorical variables cause scikit-learn algorithms to crash with ValueError: could not convert string to float. Even if you could pass strings, tree-based models would treat "New York" and "new york" as different categories, fragmenting the data. A typical categorical column with 100 unique values demands intelligent encoding: one-hot encoding creates 100 binary columns (sparse), ordinal encoding creates a single numeric column (but implies false order), and newer techniques like target encoding preserve information with fewer dimensions. The right choice depends on cardinality (unique value count) and algorithm type (Kaggle competitions show that target encoding boosts XGBoost performance by 2-5% over one-hot on high-cardinality features, 2025).

One-Hot Encoding

One-hot encoding is the gold standard for low-cardinality categorical features (up to ~50 unique values). It creates a binary column for each category; a row has a 1 in the column matching its category and 0s everywhere else.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Example dataset
df = pd.DataFrame({
'color': ['red', 'blue', 'red', 'green'],
'size': ['small', 'large', 'medium', 'large']
})

# Method 1: pandas.get_dummies() (quick and readable)
df_encoded = pd.get_dummies(df, columns=['color', 'size'])
print(df_encoded)
# Output:
# color_blue color_green color_red size_large size_medium size_small
# 0 0 1 0 0 1
# 1 0 0 1 0 0
# 0 0 1 0 1 0
# 0 1 0 1 0 0

# Method 2: scikit-learn OneHotEncoder (production-ready, handles new categories)
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded = encoder.fit_transform(df[['color', 'size']])
print(encoder.get_feature_names_out(['color', 'size']))

One-hot encoding preserves all information (you can recover the original category from the binary columns) and works seamlessly with linear models and neural networks. The trade-off: if a categorical variable has 10,000 unique values (user IDs), one-hot creates 10,000 new columns, which explodes memory and training time.

Ordinal Encoding

Ordinal encoding maps categories to integers in order: low = 0, medium = 1, high = 2. Use this only when categories have a natural order (size, education level, priority rank).

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
'priority': ['low', 'high', 'medium', 'low'],
'satisfaction': ['bad', 'good', 'excellent', 'bad']
})

# Manual ordinal mapping (explicit control)
priority_map = {'low': 0, 'medium': 1, 'high': 2}
df['priority_encoded'] = df['priority'].map(priority_map)

# Scikit-learn OrdinalEncoder (handles multiple columns)
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high'],
['bad', 'good', 'excellent']])
encoded = encoder.fit_transform(df)
print(encoded)
# [[0. 0.]
# [2. 1.]
# [1. 2.]
# [0. 0.]]

Ordinal encoding is compact (one column per variable) and decision trees interpret it well. However, linear models treat the numeric codes as continuous, which is wrong if you use it on nominal categories. Apply ordinal only to truly ordered variables.

Label Encoding (Binary Classification Target)

Label encoding is distinct from ordinal encoding: it maps categories to integers without assuming order. Use it primarily for binary target variables in classification, not for input features.

from sklearn.preprocessing import LabelEncoder

# Target variable (what we are predicting)
y = pd.Series(['spam', 'ham', 'spam', 'ham', 'spam'])

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print(y_encoded) # [1 0 1 0 1]
print(encoder.classes_) # ['ham' 'spam']

Label encoding is used internally by scikit-learn classifiers on target variables. Never use it on input features unless you want to introduce false ordering.

Target Encoding

Target encoding (also called mean encoding) replaces each category with the mean of the target variable for that category. For predicting house prices, the city category is replaced by the average house price in that city.

import pandas as pd
import numpy as np

# Example: predict house price by city
df = pd.DataFrame({
'city': ['Boston', 'Boston', 'Austin', 'Austin', 'Boston'],
'price': [500000, 480000, 350000, 360000, 520000]
})

# Calculate mean target per category
target_mean = df.groupby('city')['price'].mean()
print(target_mean)
# city
# Austin 355000
# Boston 500000

# Map back to dataframe
df['city_encoded'] = df['city'].map(target_mean)
print(df)

Target encoding is powerful: it reduces dimensionality (one column per feature), captures signal directly, and often outperforms one-hot on tree models. The risk is overfitting—categories with few samples get unreliable mean estimates. Use cross-validation or smoothing to mitigate.

# Smoothed target encoding (handle rare categories)
smoothing = 1 # smoothing strength
global_mean = df['price'].mean()
city_counts = df['city'].value_counts()

smoothed_encoding = {}
for city in df['city'].unique():
cat_mean = df[df['city'] == city]['price'].mean()
count = city_counts[city]
# Blend rare categories toward global mean
smoothed_encoding[city] = (count * cat_mean + smoothing * global_mean) / (count + smoothing)

df['city_encoded_smooth'] = df['city'].map(smoothed_encoding)

Binary Encoding

For very high-cardinality features (thousands of categories), binary encoding converts each category index to binary, using log2(N) columns instead of N. Rarely needed in practice but useful for extreme cases.

def binary_encode(series):
"""Convert categorical to binary representation."""
# Map categories to integers
codes = pd.factorize(series)[0]
# Convert integers to binary strings
binary_strings = [bin(c)[2:] for c in codes]
# Pad to max length
max_len = len(max(binary_strings, key=len))
binary_strs = [b.zfill(max_len) for b in binary_strings]
# Create columns
result = pd.DataFrame(list(map(list, binary_strs)), dtype=int)
result.columns = [f'bin_{i}' for i in range(result.shape[1])]
return result

df = pd.DataFrame({'user_id': ['user_001', 'user_002', 'user_003', 'user_123']})
binary_encoded = binary_encode(df['user_id'])
print(binary_encoded)

Binary encoding is compact and works for tree models but is rarely used in modern ML pipelines. Target encoding or embedding methods are preferred.

Handling Unknown Categories

When your model encounters a new category in production (a city it never saw during training), it crashes. Handle this with the handle_unknown parameter.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(df[['city']])

# New data with unseen city 'Miami'
new_data = pd.DataFrame({'city': ['Boston', 'Miami', 'Austin']})
encoded = encoder.transform(new_data)
print(encoded)
# 'Miami' becomes all zeros (unknown)

Alternatively, create an "Other" bin before encoding.

min_freq = 10  # Treat as 'Other' if count < 10
top_categories = df['city'].value_counts()[df['city'].value_counts() >= min_freq].index
df['city_grouped'] = df['city'].apply(lambda x: x if x in top_categories else 'Other')

Comparison Table

MethodCardinalityLinear ModelsTree ModelsProsCons
One-HotLow (<50)ExcellentGoodInterpretable, no false orderHigh-dim, sparse for many categories
OrdinalAnyPoorGoodCompact, fastAssumes order (only for ranked data)
TargetAnyExcellentExcellentReduces dim, captures signalOverfitting risk, needs smoothing
BinaryVery HighPoorFairCompact (log2 N columns)Hard to interpret, rarely needed
HashingVery HighFairFairFixed dim, handles unknownsNo interpretability, collisions

Key Takeaways

  • One-hot encoding is the default for low-cardinality nominal features; it works with all algorithms.
  • Use ordinal encoding only for truly ordered categories (size, rank, priority).
  • Target encoding is powerful for high-cardinality features on tree models but requires smoothing to prevent overfitting.
  • Label encoding is for target variables in classification, not input features.
  • Always use handle_unknown='ignore' to gracefully handle unseen categories in production.
  • For very high-cardinality features (10,000+ unique values), consider dropping the feature or using target encoding with smoothing.

Frequently Asked Questions

Should I one-hot encode before or after train/test split?

Fit the encoder on training data, then transform both train and test. Never fit on the combined dataset; that causes leakage. Unseen categories in the test set are handled via handle_unknown='ignore'.

What if a categorical variable has 1,000 unique values?

One-hot encoding creates 1,000 columns, which is unwieldy. Instead, use target encoding (requires a numeric target), or group rare categories into an "Other" bin before encoding, or drop the feature if it has no predictive power.

Does one-hot encoding work with decision trees?

Yes, but it's often unnecessary. Decision trees natively handle categorical variables in some libraries (e.g., LightGBM, CatBoost). For scikit-learn, one-hot encoding works fine; it just creates more branches to split on.

Can I use target encoding on the test set?

No. Calculate target statistics (means per category) on training data only, then apply those statistics to the test set. Using test-set targets is data leakage.

Further Reading