← View series: machine learning
~/blog
Target Guided Ordinal Encoding
Target Guided Ordinal Encoding for High-Cardinality Features
If you have a column with 50 unique ZIP codes, one-hot encoding creates 50 columns — most of them sparse. Label encoding creates 1 column but implies arbitrary ordering. Neither is ideal.
Target guided ordinal encoding (also called mean encoding) takes a different approach: replace each category with the mean of the target variable for that category. This creates a single numeric column that captures the predictive relationship between the category and what you're trying to predict.
How It Works
The idea is straightforward: for each category, calculate the average target value, then use that average as the encoded value.
import pandas as pd
df = pd.DataFrame({
'city': ['New York', 'London', 'Paris', 'Tokyo',
'New York', 'Paris'],
'price': [200, 150, 300, 250, 180, 320]
})
mean_price = df.groupby('city')['price'].mean().to_dict()This gives you a mapping: London → 150, New York → 190, Paris → 310, Tokyo → 250.
df['city_encoded'] = df['city'].map(mean_price)New York (average 190) and Paris (average 310) get different encoded values because their price distributions differ. The encoding captures the relationship between city and price in a single column.
The Target Leakage Problem
Here's where target encoding gets subtle: if you compute the mean over the entire dataset before splitting into training and test sets, you've leaked information. The mean for a category incorporates target values from the test set, which your model shouldn't see.
This artificially inflates performance metrics. The fix is to compute the encoding within each cross-validation fold.
from sklearn.model_selection import KFold
df['city_encoded'] = 0
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
train_fold = df.iloc[train_idx]
means = train_fold.groupby('city')['price'].mean()
df.loc[val_idx, 'city_encoded'] = df.loc[val_idx, 'city'].map(means)Each fold's encoding uses only data available during training on that fold. This keeps your evaluation honest.
Handling Rare Categories
When a category appears only a few times, its mean is unreliable. A city with one $1,000,000 sale gets an encoded value of 1,000,000 — extreme and noisy.
The standard solution is smoothing: blend the category mean with the global mean, weighted by sample size.
def smoothed_mean_encoding(series, target, min_samples=5, global_weight=None):
global_mean = target.mean()
grouped = target.groupby(series)
agg = grouped.agg(['count', 'mean'])
if global_weight is None:
global_weight = min_samples
smoothed = (agg['count'] * agg['mean'] + global_weight * global_mean) / \
(agg['count'] + global_weight)
return series.map(smoothed.to_dict())Categories with few samples get pulled toward the global mean. Categories with many samples keep their computed mean. The global_weight parameter controls how much smoothing to apply.
When Target Encoding Works Best
Target encoding is most useful for high-cardinality features where there's a stable relationship between the category and the target. Common use cases:
- Geographic data — ZIP codes, cities, regions often correlate with income or purchasing behavior
- Product codes — certain products are associated with higher conversion rates
- User IDs — user-level past behavior predicts future actions
It's less suitable when:
- Categories have very few samples (even with smoothing)
- The category-target relationship is noisy or temporary
- You have a small dataset where cross-validation splitting reduces training size significantly
Comparison With Other High-Cardinality Methods
| Method | Information Used | Dimensionality | Risk |
|---|---|---|---|
| One-hot | Category identity | Very high | Sparse, overfitting |
| Label encoding | None (arbitrary) | 1 column | Implies false order |
| Frequency encoding | Count only | 1 column | Ignores target relationship |
| Target encoding | Target mean | 1 column | Target leakage without CV |
| Binary encoding | Category bits | log2(n) columns | Less interpretable |
Target encoding gives you the best information-to-dimensionality ratio for high-cardinality features, but only if you handle the leakage correctly. The cross-validation approach above is non-negotiable — skipping it means your validation metrics will be optimistic and your production performance will disappoint.