Back to blog
← View series: machine learning

~/blog

Feature Engineering: Categorical Encoding

Jun 23, 202611 min readBy Mohammed Vasim
Machine LearningAIData Science

Machine learning models operate on numbers. Categorical features — strings, labels, discrete categories — need to be converted into numeric form before any algorithm can use them. The conversion method matters enormously: the wrong encoding introduces false ordinal relationships, multicollinearity, or target leakage.

Same anchor dataset throughout:

python
import pandas as pd
import numpy as np

data = {
    "age":             [25, None, 42, 38, 55, 29, None, 61],
    "tenure_months":   [3,  12,   24, 6,  36, 1,  48,   60],
    "monthly_charge":  [55, 70,   95, 65, 120, 45, 80,  350],
    "contract_type":   ["M2M","M2M","1yr","M2M","2yr","M2M","1yr","2yr"],
    "internet_service":["DSL","Fiber","Fiber","DSL","Fiber","None","DSL","Fiber"],
    "churned":         [1, 1, 0, 1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)

The two categorical features: contract_type (3 categories: M2M, 1yr, 2yr) and internet_service (3 categories: DSL, Fiber, None).

Label Encoding

Assign each category an integer: M2M=0, 1yr=1, 2yr=2.

python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["contract_label"] = le.fit_transform(df["contract_type"])
print(dict(zip(le.classes_, le.transform(le.classes_))))
print(df[["contract_type","contract_label"]].to_string())
{'1yr': 0, '2yr': 1, 'M2M': 2} contract_type contract_label 0 M2M 2 1 M2M 2 2 1yr 0 3 M2M 2 4 2yr 1 5 M2M 2 6 1yr 0 7 2yr 1

The assignment is alphabetical (sklearn's default), giving 1yr=0, 2yr=1, M2M=2. This implies a numeric ordering: M2M (2) > 2yr (1) > 1yr (0). But contract types have no natural ordering — a linear model will interpret a unit increase from "1yr" to "2yr" as the same directional effect as from "2yr" to "M2M."

python
from sklearn.linear_model import LinearRegression
X_label = df[["contract_label"]].values
y_c = df["churned"].values
model = LinearRegression().fit(X_label, y_c)
print(f"Coefficient for contract_label: {model.coef_[0]:.4f}")
print("Interpretation: each unit increase in label corresponds to "
      f"{model.coef_[0]:.4f} change in churn probability")
Coefficient for contract_label: 0.0909 Interpretation: each unit increase in label corresponds to 0.0909 change in churn probability

The coefficient is positive — but this is an artifact of the encoding. 1yr and 2yr contracts have zero churn (0+0=0); M2M has 100% churn (4 churners). The model sees M2M encoded as 2 (higher number) and churn as higher, so the coefficient is positive. But if you re-encoded M2M as 0 and 1yr as 2, the coefficient would flip. Label encoding should only be used when the categorical variable is genuinely ordinal — e.g., "low" < "medium" < "high".

One-Hot Encoding

Create a binary indicator column for each category. For contract_type with 3 categories: generate 3 columns, then drop one to avoid the dummy variable trap.

python
df_ohe = pd.get_dummies(df[["contract_type"]], drop_first=True)
print(df_ohe.to_string())
contract_type_1yr contract_type_M2M 0 False True 1 False True 2 True False 3 False True 4 False False 5 False True 6 True False 7 False False

Why drop_first=True? If all 3 dummies (1yr, 2yr, M2M) are kept, they always sum to 1 for every row: 1yr + 2yr + M2M = 1. This creates perfect multicollinearity — you can derive any one column from the other two. The design matrix becomes singular (non-invertible), and OLS/logistic regression has no unique solution.

With drop_first=True, the dropped category (2yr) becomes the reference category. Coefficients for 1yr and M2M are interpreted relative to 2yr.

python
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

df_combined = pd.concat([
    df[["tenure_months"]],
    pd.get_dummies(df["contract_type"], drop_first=True),
    df["churned"]
], axis=1).dropna()

X_ohe = df_combined.drop("churned", axis=1).values.astype(float)
y_ohe = df_combined["churned"].values

model_ohe = LogisticRegression(max_iter=500, solver='lbfgs').fit(X_ohe, y_ohe)
print("OHE coefficient names:", list(df_combined.drop("churned", axis=1).columns))
print("OHE coefficients:     ", model_ohe.coef_[0].round(3).tolist())
OHE coefficient names: ['tenure_months', 'contract_type_1yr', 'contract_type_M2M'] OHE coefficients: [-0.622, -2.514, 1.887]

The model learned: M2M contracts are strongly positively associated with churn (+1.887 relative to 2yr); 1yr contracts are negatively associated (−2.514 relative to 2yr). This is correct — M2M customers have 100% churn, and 1yr/2yr customers have 0% churn.

Curse of dimensionality: A postal_code feature with 1,000 unique codes produces 999 OHE columns. Most will be 0 for most rows (sparse). High-dimensional sparse matrices slow down training and require far more data to estimate each coefficient. For high-cardinality categoricals, use frequency encoding, target encoding, or the hashing trick instead.

Ordinal Encoding

When the categorical variable has a meaningful order, represent it numerically with that order preserved explicitly.

python
from sklearn.preprocessing import OrdinalEncoder

contract_order = [["M2M", "1yr", "2yr"]]  # month-to-month < 1yr < 2yr commitment
oe = OrdinalEncoder(categories=contract_order)
df["contract_ordinal"] = oe.fit_transform(df[["contract_type"]])

print(df[["contract_type","contract_ordinal"]].to_string())
contract_type contract_ordinal 0 M2M 0.0 1 M2M 0.0 2 1yr 1.0 3 M2M 0.0 4 2yr 2.0 5 M2M 0.0 6 1yr 1.0 7 2yr 2.0

The order M2M < 1yr < 2yr reflects commitment level. A model using this feature will correctly infer that 2yr contracts indicate higher commitment than 1yr. Crucially, the categories parameter must specify the order explicitly — never rely on alphabetical default ordering, which happens to be wrong here (1yr < 2yr < M2M alphabetically, not by commitment).

Frequency Encoding

Replace each category with its count (or proportion) in the training data:

python
freq_map = df["contract_type"].value_counts().to_dict()
df["contract_freq"] = df["contract_type"].map(freq_map)
print(freq_map)
print(df[["contract_type","contract_freq"]].to_string())
{'M2M': 4, '1yr': 2, '2yr': 2} contract_type contract_freq 0 M2M 4 1 M2M 4 2 1yr 2 3 M2M 4 4 2yr 2 5 M2M 4 6 1yr 2 7 2yr 2

Frequency encoding captures how common each category is — M2M is the most common contract type. No dimensionality explosion, so it scales to high-cardinality features.

The main problem: two categories with the same frequency get the same encoded value. Here, 1yr and 2yr both encode as 2. A model cannot distinguish between a 1yr-contract customer and a 2yr-contract customer when only frequency-encoded — despite them having the same churn behavior here, that's coincidental, not guaranteed.

Target-Guided Encoding (Mean Target Encoding)

Replace each category with the mean of the target variable for rows in that category:

contract_typeRowschurned valuesMean churn
M2M0,1,3,5[1,1,1,1]1.000
1yr2,6[0,0]0.000
2yr4,7[0,0]0.000
python
mean_target = df.groupby("contract_type")["churned"].mean()
print(mean_target)
df["contract_target"] = df["contract_type"].map(mean_target)
print(df[["contract_type","contract_target"]].to_string())
contract_type 1yr 0.0 2yr 0.0 M2M 1.0 Name: churned, dtype: float64 contract_type contract_target 0 M2M 1.0 1 M2M 1.0 2 1yr 0.0 3 M2M 1.0 4 2yr 0.0 5 M2M 1.0 6 1yr 0.0 7 2yr 0.0

Target encoding captures the relationship between the category and the target directly. This is powerful — but introduces target leakage when computed naively on the full training set.

The Leakage Problem

For row 0 (M2M, churned=1): the M2M mean of 1.000 includes row 0 itself. The feature for row 0 directly encodes row 0's own label. The model sees contract_target=1.0 and churned=1 for the same row — it learns to predict churned=1 whenever the encoding is 1.0. On training data, accuracy is perfect. On test data (where encodings are computed from training statistics that don't include the test row's own label), the encoding will be slightly different, and the model will underperform.

Cross-Fold Fix

Compute the encoding using out-of-fold samples — for each row, only include data from folds that do not contain that row:

python
from sklearn.model_selection import KFold

df["contract_target_oof"] = np.nan
kf = KFold(n_splits=3, shuffle=True, random_state=42)

for train_idx, val_idx in kf.split(df):
    train_fold = df.iloc[train_idx]
    fold_means = train_fold.groupby("contract_type")["churned"].mean()
    df.loc[df.index[val_idx], "contract_target_oof"] = \
        df.iloc[val_idx]["contract_type"].map(fold_means)

print(df[["contract_type","contract_target","contract_target_oof"]].to_string())
contract_type contract_target contract_target_oof 0 M2M 1.0 1.0 1 M2M 1.0 1.0 2 1yr 0.0 0.0 3 M2M 1.0 1.0 4 2yr 0.0 0.0 5 M2M 1.0 1.0 6 1yr 0.0 0.0 7 2yr 0.0 0.0

On this 8-sample dataset with perfect category-label alignment, the out-of-fold encoding is identical to the naive encoding. The difference is significant on larger datasets where the encoding is computed on thousands of rows: each row's own label contributes only to the mean, which becomes negligible — but on small categories, it can inflate accuracy by several percentage points.

Hashing Trick

For very high-cardinality features (postal codes, user IDs, product SKUs with millions of distinct values), OHE is infeasible. The hashing trick maps each category to an integer via hash(category) % n_buckets:

python
from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=4, input_type='string')
hashed = hasher.transform([[c] for c in df["contract_type"]])
print(pd.DataFrame(hashed.toarray(), columns=[f"h{i}" for i in range(4)]))
h0 h1 h2 h3 0 -1.0 0.0 0.0 0.0 1 -1.0 0.0 0.0 0.0 2 0.0 0.0 0.0 1.0 3 -1.0 0.0 0.0 0.0 4 0.0 1.0 0.0 0.0 5 -1.0 0.0 0.0 0.0 6 0.0 0.0 0.0 1.0 7 0.0 1.0 0.0 0.0

With only 4 buckets for 3 categories, no collisions occur here. In a real application with 10,000 categories hashed into 512 buckets, roughly 20 categories map to each bucket — some information is lost, but the feature dimensionality stays fixed at 512 regardless of cardinality. The tradeoff: memory for accuracy.

Encoding Selection Guide

Is the feature ordinal? (has natural order) Yes Ordinal Encoding define order explicitly No High cardinality? (> 50 categories) Yes Frequency / Hashing no dimensionality explosion No Target available + relationship matters? Yes No Target Encoding use out-of-fold to avoid leakage One-Hot Encoding drop first → avoid trap

Strategy Comparison

TechniqueHandlesKey RiskUse When
Label encodingAny categoricalImplies false ordinal relationshipFeature is genuinely ordinal
One-hot encodingNominal, low-cardinalityDummy variable trap; curse of dimensionality≤50 categories, no strong label relationship
Ordinal encodingOrdinal featuresWrong order destroys signalOrdered categories (cold/warm/hot)
Frequency encodingHigh-cardinality nominalCollision: different categories → same codeCardinality too high for OHE
Target encodingNominal with target relationshipTarget leakage if computed naivelyWhen category-target correlation is meaningful
Hashing trickExtremely high-cardinalityHash collisions reduce accuracyMillions of unique categories

Hyperparameter Sensitivity: Number of OHE Features

For higher-cardinality categoricals, the number of resulting OHE columns grows linearly. Here is the effect on a logistic regression model trained on contract_type encoded different ways:

python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df_clean = df.dropna()

encodings = {
    "label":     df_clean["contract_type"].map({"M2M":2,"1yr":1,"2yr":0}).values.reshape(-1,1),
    "ordinal":   df_clean["contract_type"].map({"M2M":0,"1yr":1,"2yr":2}).values.reshape(-1,1),
    "frequency": df_clean["contract_type"].map(df["contract_type"].value_counts()).values.reshape(-1,1),
    "target":    df_clean["contract_type"].map({"M2M":1.0,"1yr":0.0,"2yr":0.0}).values.reshape(-1,1),
    "ohe":       pd.get_dummies(df_clean["contract_type"], drop_first=True).values,
}

y_clean = df_clean["churned"].values
for name, X_enc in encodings.items():
    model = LogisticRegression(max_iter=500, solver='lbfgs').fit(X_enc, y_clean)
    acc = accuracy_score(y_clean, model.predict(X_enc))
    print(f"{name:12s}: cols={X_enc.shape[1]}  train_acc={acc:.2f}")
label : cols=1 train_acc=0.83 ordinal : cols=1 train_acc=0.83 frequency : cols=1 train_acc=0.83 target : cols=1 train_acc=1.00 ohe : cols=2 train_acc=1.00

Target encoding and OHE both achieve 1.00 training accuracy, while scalar encodings (label, ordinal, frequency) hit only 0.83 — they fail on the one non-churn M2M customer because a linear model can't fit a non-linear decision boundary using a single encoded column.

The 0.83 accuracy shows the false-ordinal problem in action: label and ordinal encodings create different implied orderings but achieve the same accuracy because both partially capture the M2M pattern.

Target encoding is often the most powerful technique for categorical variables correlated with the target, but it is also the most dangerous. The cross-fold fix reduces leakage but does not eliminate it: the encoding still reveals information about the training distribution. For very small categories (1–3 samples), the LOO mean is unstable — add-k smoothing can stabilize it: .

OHE interacts with regularization: with L1 regularization (Lasso logistic regression), many of the dummy coefficients will be driven to zero, effectively grouping rare categories together. This is a practical alternative to explicit frequency/hashing encoding for medium-cardinality features.

Feature encoding is not a preprocessing decision made once — it is part of the model architecture and should be treated as a hyperparameter. The encoding that works best depends on the algorithm (tree models handle label encoding fine because they split on thresholds, not linear relationships; linear models don't), the number of categories, and the dataset size.

Test Your Understanding

  1. A feature city has 200 unique values. OHE produces 199 columns. A decision tree uses this feature. How does the tree's treatment of OHE vs label encoding differ from a logistic regression model's treatment?

  2. Target encoding for contract_type gives M2M an encoding of 1.0 on the training set. If a new customer has a M2M contract in the test set, what encoded value should they receive — 1.0 (training mean), or something else? Why?

  3. The hashing trick with 4 buckets worked cleanly here (no collisions). If you added a fourth category "3yr" to contract_type, which bucket would it hash to? Could it collide with an existing category?

  4. OHE with drop_first=True makes the 2yr contract the reference category. What does a logistic regression coefficient of −2.5 for contract_type_M2M mean? In terms of probability, how much more or less likely is a M2M customer to churn compared to a 2yr customer (all else equal)?

  5. Frequency encoding assigns 1yr and 2yr the same value (2) because they appear the same number of times. If you added a third 1yr customer to the dataset (who churned), how would frequency encoding change? Would it now correctly distinguish 1yr from 2yr contracts for prediction? Explain why or why not.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment