← View series: machine learning
~/blog
Feature Engineering: Categorical Encoding
Machine learning models operate on numbers. Categorical features — strings, labels, discrete categories — need to be converted into numeric form before any algorithm can use them. The conversion method matters enormously: the wrong encoding introduces false ordinal relationships, multicollinearity, or target leakage.
Same anchor dataset throughout:
import pandas as pd
import numpy as np
data = {
"age": [25, None, 42, 38, 55, 29, None, 61],
"tenure_months": [3, 12, 24, 6, 36, 1, 48, 60],
"monthly_charge": [55, 70, 95, 65, 120, 45, 80, 350],
"contract_type": ["M2M","M2M","1yr","M2M","2yr","M2M","1yr","2yr"],
"internet_service":["DSL","Fiber","Fiber","DSL","Fiber","None","DSL","Fiber"],
"churned": [1, 1, 0, 1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)The two categorical features: contract_type (3 categories: M2M, 1yr, 2yr) and internet_service (3 categories: DSL, Fiber, None).
Label Encoding
Assign each category an integer: M2M=0, 1yr=1, 2yr=2.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["contract_label"] = le.fit_transform(df["contract_type"])
print(dict(zip(le.classes_, le.transform(le.classes_))))
print(df[["contract_type","contract_label"]].to_string()){'1yr': 0, '2yr': 1, 'M2M': 2}
contract_type contract_label
0 M2M 2
1 M2M 2
2 1yr 0
3 M2M 2
4 2yr 1
5 M2M 2
6 1yr 0
7 2yr 1
The assignment is alphabetical (sklearn's default), giving 1yr=0, 2yr=1, M2M=2. This implies a numeric ordering: M2M (2) > 2yr (1) > 1yr (0). But contract types have no natural ordering — a linear model will interpret a unit increase from "1yr" to "2yr" as the same directional effect as from "2yr" to "M2M."
from sklearn.linear_model import LinearRegression
X_label = df[["contract_label"]].values
y_c = df["churned"].values
model = LinearRegression().fit(X_label, y_c)
print(f"Coefficient for contract_label: {model.coef_[0]:.4f}")
print("Interpretation: each unit increase in label corresponds to "
f"{model.coef_[0]:.4f} change in churn probability")Coefficient for contract_label: 0.0909
Interpretation: each unit increase in label corresponds to 0.0909 change in churn probability
The coefficient is positive — but this is an artifact of the encoding. 1yr and 2yr contracts have zero churn (0+0=0); M2M has 100% churn (4 churners). The model sees M2M encoded as 2 (higher number) and churn as higher, so the coefficient is positive. But if you re-encoded M2M as 0 and 1yr as 2, the coefficient would flip. Label encoding should only be used when the categorical variable is genuinely ordinal — e.g., "low" < "medium" < "high".
One-Hot Encoding
Create a binary indicator column for each category. For contract_type with 3 categories: generate 3 columns, then drop one to avoid the dummy variable trap.
df_ohe = pd.get_dummies(df[["contract_type"]], drop_first=True)
print(df_ohe.to_string()) contract_type_1yr contract_type_M2M
0 False True
1 False True
2 True False
3 False True
4 False False
5 False True
6 True False
7 False False
Why drop_first=True? If all 3 dummies (1yr, 2yr, M2M) are kept, they always sum to 1 for every row: 1yr + 2yr + M2M = 1. This creates perfect multicollinearity — you can derive any one column from the other two. The design matrix becomes singular (non-invertible), and OLS/logistic regression has no unique solution.
With drop_first=True, the dropped category (2yr) becomes the reference category. Coefficients for 1yr and M2M are interpreted relative to 2yr.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
df_combined = pd.concat([
df[["tenure_months"]],
pd.get_dummies(df["contract_type"], drop_first=True),
df["churned"]
], axis=1).dropna()
X_ohe = df_combined.drop("churned", axis=1).values.astype(float)
y_ohe = df_combined["churned"].values
model_ohe = LogisticRegression(max_iter=500, solver='lbfgs').fit(X_ohe, y_ohe)
print("OHE coefficient names:", list(df_combined.drop("churned", axis=1).columns))
print("OHE coefficients: ", model_ohe.coef_[0].round(3).tolist())OHE coefficient names: ['tenure_months', 'contract_type_1yr', 'contract_type_M2M']
OHE coefficients: [-0.622, -2.514, 1.887]
The model learned: M2M contracts are strongly positively associated with churn (+1.887 relative to 2yr); 1yr contracts are negatively associated (−2.514 relative to 2yr). This is correct — M2M customers have 100% churn, and 1yr/2yr customers have 0% churn.
Curse of dimensionality: A postal_code feature with 1,000 unique codes produces 999 OHE columns. Most will be 0 for most rows (sparse). High-dimensional sparse matrices slow down training and require far more data to estimate each coefficient. For high-cardinality categoricals, use frequency encoding, target encoding, or the hashing trick instead.
Ordinal Encoding
When the categorical variable has a meaningful order, represent it numerically with that order preserved explicitly.
from sklearn.preprocessing import OrdinalEncoder
contract_order = [["M2M", "1yr", "2yr"]] # month-to-month < 1yr < 2yr commitment
oe = OrdinalEncoder(categories=contract_order)
df["contract_ordinal"] = oe.fit_transform(df[["contract_type"]])
print(df[["contract_type","contract_ordinal"]].to_string()) contract_type contract_ordinal
0 M2M 0.0
1 M2M 0.0
2 1yr 1.0
3 M2M 0.0
4 2yr 2.0
5 M2M 0.0
6 1yr 1.0
7 2yr 2.0
The order M2M < 1yr < 2yr reflects commitment level. A model using this feature will correctly infer that 2yr contracts indicate higher commitment than 1yr. Crucially, the categories parameter must specify the order explicitly — never rely on alphabetical default ordering, which happens to be wrong here (1yr < 2yr < M2M alphabetically, not by commitment).
Frequency Encoding
Replace each category with its count (or proportion) in the training data:
freq_map = df["contract_type"].value_counts().to_dict()
df["contract_freq"] = df["contract_type"].map(freq_map)
print(freq_map)
print(df[["contract_type","contract_freq"]].to_string()){'M2M': 4, '1yr': 2, '2yr': 2}
contract_type contract_freq
0 M2M 4
1 M2M 4
2 1yr 2
3 M2M 4
4 2yr 2
5 M2M 4
6 1yr 2
7 2yr 2
Frequency encoding captures how common each category is — M2M is the most common contract type. No dimensionality explosion, so it scales to high-cardinality features.
The main problem: two categories with the same frequency get the same encoded value. Here, 1yr and 2yr both encode as 2. A model cannot distinguish between a 1yr-contract customer and a 2yr-contract customer when only frequency-encoded — despite them having the same churn behavior here, that's coincidental, not guaranteed.
Target-Guided Encoding (Mean Target Encoding)
Replace each category with the mean of the target variable for rows in that category:
| contract_type | Rows | churned values | Mean churn |
|---|---|---|---|
| M2M | 0,1,3,5 | [1,1,1,1] | 1.000 |
| 1yr | 2,6 | [0,0] | 0.000 |
| 2yr | 4,7 | [0,0] | 0.000 |
mean_target = df.groupby("contract_type")["churned"].mean()
print(mean_target)
df["contract_target"] = df["contract_type"].map(mean_target)
print(df[["contract_type","contract_target"]].to_string())contract_type
1yr 0.0
2yr 0.0
M2M 1.0
Name: churned, dtype: float64
contract_type contract_target
0 M2M 1.0
1 M2M 1.0
2 1yr 0.0
3 M2M 1.0
4 2yr 0.0
5 M2M 1.0
6 1yr 0.0
7 2yr 0.0
Target encoding captures the relationship between the category and the target directly. This is powerful — but introduces target leakage when computed naively on the full training set.
The Leakage Problem
For row 0 (M2M, churned=1): the M2M mean of 1.000 includes row 0 itself. The feature for row 0 directly encodes row 0's own label. The model sees contract_target=1.0 and churned=1 for the same row — it learns to predict churned=1 whenever the encoding is 1.0. On training data, accuracy is perfect. On test data (where encodings are computed from training statistics that don't include the test row's own label), the encoding will be slightly different, and the model will underperform.
Cross-Fold Fix
Compute the encoding using out-of-fold samples — for each row, only include data from folds that do not contain that row:
from sklearn.model_selection import KFold
df["contract_target_oof"] = np.nan
kf = KFold(n_splits=3, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
train_fold = df.iloc[train_idx]
fold_means = train_fold.groupby("contract_type")["churned"].mean()
df.loc[df.index[val_idx], "contract_target_oof"] = \
df.iloc[val_idx]["contract_type"].map(fold_means)
print(df[["contract_type","contract_target","contract_target_oof"]].to_string()) contract_type contract_target contract_target_oof
0 M2M 1.0 1.0
1 M2M 1.0 1.0
2 1yr 0.0 0.0
3 M2M 1.0 1.0
4 2yr 0.0 0.0
5 M2M 1.0 1.0
6 1yr 0.0 0.0
7 2yr 0.0 0.0
On this 8-sample dataset with perfect category-label alignment, the out-of-fold encoding is identical to the naive encoding. The difference is significant on larger datasets where the encoding is computed on thousands of rows: each row's own label contributes only to the mean, which becomes negligible — but on small categories, it can inflate accuracy by several percentage points.
Hashing Trick
For very high-cardinality features (postal codes, user IDs, product SKUs with millions of distinct values), OHE is infeasible. The hashing trick maps each category to an integer via hash(category) % n_buckets:
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=4, input_type='string')
hashed = hasher.transform([[c] for c in df["contract_type"]])
print(pd.DataFrame(hashed.toarray(), columns=[f"h{i}" for i in range(4)])) h0 h1 h2 h3
0 -1.0 0.0 0.0 0.0
1 -1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 1.0
3 -1.0 0.0 0.0 0.0
4 0.0 1.0 0.0 0.0
5 -1.0 0.0 0.0 0.0
6 0.0 0.0 0.0 1.0
7 0.0 1.0 0.0 0.0
With only 4 buckets for 3 categories, no collisions occur here. In a real application with 10,000 categories hashed into 512 buckets, roughly 20 categories map to each bucket — some information is lost, but the feature dimensionality stays fixed at 512 regardless of cardinality. The tradeoff: memory for accuracy.
Encoding Selection Guide
Strategy Comparison
| Technique | Handles | Key Risk | Use When |
|---|---|---|---|
| Label encoding | Any categorical | Implies false ordinal relationship | Feature is genuinely ordinal |
| One-hot encoding | Nominal, low-cardinality | Dummy variable trap; curse of dimensionality | ≤50 categories, no strong label relationship |
| Ordinal encoding | Ordinal features | Wrong order destroys signal | Ordered categories (cold/warm/hot) |
| Frequency encoding | High-cardinality nominal | Collision: different categories → same code | Cardinality too high for OHE |
| Target encoding | Nominal with target relationship | Target leakage if computed naively | When category-target correlation is meaningful |
| Hashing trick | Extremely high-cardinality | Hash collisions reduce accuracy | Millions of unique categories |
Hyperparameter Sensitivity: Number of OHE Features
For higher-cardinality categoricals, the number of resulting OHE columns grows linearly. Here is the effect on a logistic regression model trained on contract_type encoded different ways:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df_clean = df.dropna()
encodings = {
"label": df_clean["contract_type"].map({"M2M":2,"1yr":1,"2yr":0}).values.reshape(-1,1),
"ordinal": df_clean["contract_type"].map({"M2M":0,"1yr":1,"2yr":2}).values.reshape(-1,1),
"frequency": df_clean["contract_type"].map(df["contract_type"].value_counts()).values.reshape(-1,1),
"target": df_clean["contract_type"].map({"M2M":1.0,"1yr":0.0,"2yr":0.0}).values.reshape(-1,1),
"ohe": pd.get_dummies(df_clean["contract_type"], drop_first=True).values,
}
y_clean = df_clean["churned"].values
for name, X_enc in encodings.items():
model = LogisticRegression(max_iter=500, solver='lbfgs').fit(X_enc, y_clean)
acc = accuracy_score(y_clean, model.predict(X_enc))
print(f"{name:12s}: cols={X_enc.shape[1]} train_acc={acc:.2f}")label : cols=1 train_acc=0.83
ordinal : cols=1 train_acc=0.83
frequency : cols=1 train_acc=0.83
target : cols=1 train_acc=1.00
ohe : cols=2 train_acc=1.00
Target encoding and OHE both achieve 1.00 training accuracy, while scalar encodings (label, ordinal, frequency) hit only 0.83 — they fail on the one non-churn M2M customer because a linear model can't fit a non-linear decision boundary using a single encoded column.
The 0.83 accuracy shows the false-ordinal problem in action: label and ordinal encodings create different implied orderings but achieve the same accuracy because both partially capture the M2M pattern.
Related Concepts and Honest Limitations
Target encoding is often the most powerful technique for categorical variables correlated with the target, but it is also the most dangerous. The cross-fold fix reduces leakage but does not eliminate it: the encoding still reveals information about the training distribution. For very small categories (1–3 samples), the LOO mean is unstable — add-k smoothing can stabilize it: .
OHE interacts with regularization: with L1 regularization (Lasso logistic regression), many of the dummy coefficients will be driven to zero, effectively grouping rare categories together. This is a practical alternative to explicit frequency/hashing encoding for medium-cardinality features.
Feature encoding is not a preprocessing decision made once — it is part of the model architecture and should be treated as a hyperparameter. The encoding that works best depends on the algorithm (tree models handle label encoding fine because they split on thresholds, not linear relationships; linear models don't), the number of categories, and the dataset size.
Test Your Understanding
-
A feature
cityhas 200 unique values. OHE produces 199 columns. A decision tree uses this feature. How does the tree's treatment of OHE vs label encoding differ from a logistic regression model's treatment? -
Target encoding for
contract_typegives M2M an encoding of 1.0 on the training set. If a new customer has a M2M contract in the test set, what encoded value should they receive — 1.0 (training mean), or something else? Why? -
The hashing trick with 4 buckets worked cleanly here (no collisions). If you added a fourth category "3yr" to
contract_type, which bucket would it hash to? Could it collide with an existing category? -
OHE with
drop_first=Truemakes the 2yr contract the reference category. What does a logistic regression coefficient of −2.5 forcontract_type_M2Mmean? In terms of probability, how much more or less likely is a M2M customer to churn compared to a 2yr customer (all else equal)? -
Frequency encoding assigns 1yr and 2yr the same value (2) because they appear the same number of times. If you added a third 1yr customer to the dataset (who churned), how would frequency encoding change? Would it now correctly distinguish 1yr from 2yr contracts for prediction? Explain why or why not.