Back to blog
← View series: machine learning

Handling Missing Values Handling Imbalanced Datasets SMOTE Handling Outliers Categorical Encoding Target Guided Ordinal Encoding Exploratory Data Analysis of Red Wine Quality Feature Engineering for Flight Price Prediction Cleaning and Transforming Google Play Store Data for Analysis Feature Engineering: Missing Values and Outliers Feature Engineering: Categorical Encoding

~/blog

Feature Engineering: Categorical Encoding

Jun 23, 2026•11 min read•By Mohammed Vasim

Machine LearningAIData Science

Machine learning models operate on numbers. Categorical features — strings, labels, discrete categories — need to be converted into numeric form before any algorithm can use them. The conversion method matters enormously: the wrong encoding introduces false ordinal relationships, multicollinearity, or target leakage.

Same anchor dataset throughout:

python

import pandas as pd
import numpy as np

data = {
    "age":             [25, None, 42, 38, 55, 29, None, 61],
    "tenure_months":   [3,  12,   24, 6,  36, 1,  48,   60],
    "monthly_charge":  [55, 70,   95, 65, 120, 45, 80,  350],
    "contract_type":   ["M2M","M2M","1yr","M2M","2yr","M2M","1yr","2yr"],
    "internet_service":["DSL","Fiber","Fiber","DSL","Fiber","None","DSL","Fiber"],
    "churned":         [1, 1, 0, 1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)

The two categorical features: contract_type (3 categories: M2M, 1yr, 2yr) and internet_service (3 categories: DSL, Fiber, None).

Label Encoding

Assign each category an integer: M2M=0, 1yr=1, 2yr=2.

python

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["contract_label"] = le.fit_transform(df["contract_type"])
print(dict(zip(le.classes_, le.transform(le.classes_))))
print(df[["contract_type","contract_label"]].to_string())

{'1yr': 0, '2yr': 1, 'M2M': 2}
  contract_type  contract_label
0           M2M               2
1           M2M               2
2           1yr               0
3           M2M               2
4           2yr               1
5           M2M               2
6           1yr               0
7           2yr               1

The assignment is alphabetical (sklearn's default), giving 1yr=0, 2yr=1, M2M=2. This implies a numeric ordering: M2M (2) > 2yr (1) > 1yr (0). But contract types have no natural ordering — a linear model will interpret a unit increase from "1yr" to "2yr" as the same directional effect as from "2yr" to "M2M."

python

from sklearn.linear_model import LinearRegression
X_label = df[["contract_label"]].values
y_c = df["churned"].values
model = LinearRegression().fit(X_label, y_c)
print(f"Coefficient for contract_label: {model.coef_[0]:.4f}")
print("Interpretation: each unit increase in label corresponds to "
      f"{model.coef_[0]:.4f} change in churn probability")

Coefficient for contract_label: 0.0909
Interpretation: each unit increase in label corresponds to 0.0909 change in churn probability

The coefficient is positive — but this is an artifact of the encoding. 1yr and 2yr contracts have zero churn (0+0=0); M2M has 100% churn (4 churners). The model sees M2M encoded as 2 (higher number) and churn as higher, so the coefficient is positive. But if you re-encoded M2M as 0 and 1yr as 2, the coefficient would flip. Label encoding should only be used when the categorical variable is genuinely ordinal — e.g., "low" < "medium" < "high".

One-Hot Encoding

Create a binary indicator column for each category. For contract_type with 3 categories: generate 3 columns, then drop one to avoid the dummy variable trap.

python

df_ohe = pd.get_dummies(df[["contract_type"]], drop_first=True)
print(df_ohe.to_string())

   contract_type_1yr  contract_type_M2M
0              False               True
1              False               True
2               True              False
3              False               True
4              False              False
5              False               True
6               True              False
7              False              False

Why drop_first=True? If all 3 dummies (1yr, 2yr, M2M) are kept, they always sum to 1 for every row: 1yr + 2yr + M2M = 1. This creates perfect multicollinearity — you can derive any one column from the other two. The design matrix $X^{T} X$ becomes singular (non-invertible), and OLS/logistic regression has no unique solution.

With drop_first=True, the dropped category (2yr) becomes the reference category. Coefficients for 1yr and M2M are interpreted relative to 2yr.

python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

df_combined = pd.concat([
    df[["tenure_months"]],
    pd.get_dummies(df["contract_type"], drop_first=True),
    df["churned"]
], axis=1).dropna()

X_ohe = df_combined.drop("churned", axis=1).values.astype(float)
y_ohe = df_combined["churned"].values

model_ohe = LogisticRegression(max_iter=500, solver='lbfgs').fit(X_ohe, y_ohe)
print("OHE coefficient names:", list(df_combined.drop("churned", axis=1).columns))
print("OHE coefficients:     ", model_ohe.coef_[0].round(3).tolist())

OHE coefficient names: ['tenure_months', 'contract_type_1yr', 'contract_type_M2M']
OHE coefficients:      [-0.622,  -2.514,   1.887]

The model learned: M2M contracts are strongly positively associated with churn (+1.887 relative to 2yr); 1yr contracts are negatively associated (−2.514 relative to 2yr). This is correct — M2M customers have 100% churn, and 1yr/2yr customers have 0% churn.

Curse of dimensionality: A postal_code feature with 1,000 unique codes produces 999 OHE columns. Most will be 0 for most rows (sparse). High-dimensional sparse matrices slow down training and require far more data to estimate each coefficient. For high-cardinality categoricals, use frequency encoding, target encoding, or the hashing trick instead.

Ordinal Encoding

When the categorical variable has a meaningful order, represent it numerically with that order preserved explicitly.

python

from sklearn.preprocessing import OrdinalEncoder

contract_order = [["M2M", "1yr", "2yr"]]  # month-to-month < 1yr < 2yr commitment
oe = OrdinalEncoder(categories=contract_order)
df["contract_ordinal"] = oe.fit_transform(df[["contract_type"]])

print(df[["contract_type","contract_ordinal"]].to_string())

  contract_type  contract_ordinal
0           M2M               0.0
1           M2M               0.0
2           1yr               1.0
3           M2M               0.0
4           2yr               2.0
5           M2M               0.0
6           1yr               1.0
7           2yr               2.0

The order M2M < 1yr < 2yr reflects commitment level. A model using this feature will correctly infer that 2yr contracts indicate higher commitment than 1yr. Crucially, the categories parameter must specify the order explicitly — never rely on alphabetical default ordering, which happens to be wrong here (1yr < 2yr < M2M alphabetically, not by commitment).

Frequency Encoding

Replace each category with its count (or proportion) in the training data:

python

freq_map = df["contract_type"].value_counts().to_dict()
df["contract_freq"] = df["contract_type"].map(freq_map)
print(freq_map)
print(df[["contract_type","contract_freq"]].to_string())

{'M2M': 4, '1yr': 2, '2yr': 2}
  contract_type  contract_freq
0           M2M              4
1           M2M              4
2           1yr              2
3           M2M              4
4           2yr              2
5           M2M              4
6           1yr              2
7           2yr              2

Frequency encoding captures how common each category is — M2M is the most common contract type. No dimensionality explosion, so it scales to high-cardinality features.

The main problem: two categories with the same frequency get the same encoded value. Here, 1yr and 2yr both encode as 2. A model cannot distinguish between a 1yr-contract customer and a 2yr-contract customer when only frequency-encoded — despite them having the same churn behavior here, that's coincidental, not guaranteed.

Target-Guided Encoding (Mean Target Encoding)

Replace each category with the mean of the target variable for rows in that category:

contract_type	Rows	churned values	Mean churn
M2M	0,1,3,5	[1,1,1,1]	1.000
1yr	2,6	[0,0]	0.000
2yr	4,7	[0,0]	0.000

python

mean_target = df.groupby("contract_type")["churned"].mean()
print(mean_target)
df["contract_target"] = df["contract_type"].map(mean_target)
print(df[["contract_type","contract_target"]].to_string())

contract_type
1yr    0.0
2yr    0.0
M2M    1.0
Name: churned, dtype: float64

  contract_type  contract_target
0           M2M              1.0
1           M2M              1.0
2           1yr              0.0
3           M2M              1.0
4           2yr              0.0
5           M2M              1.0
6           1yr              0.0
7           2yr              0.0

Target encoding captures the relationship between the category and the target directly. This is powerful — but introduces target leakage when computed naively on the full training set.

The Leakage Problem

For row 0 (M2M, churned=1): the M2M mean of 1.000 includes row 0 itself. The feature for row 0 directly encodes row 0's own label. The model sees contract_target=1.0 and churned=1 for the same row — it learns to predict churned=1 whenever the encoding is 1.0. On training data, accuracy is perfect. On test data (where encodings are computed from training statistics that don't include the test row's own label), the encoding will be slightly different, and the model will underperform.

Cross-Fold Fix

Compute the encoding using out-of-fold samples — for each row, only include data from folds that do not contain that row:

python

from sklearn.model_selection import KFold

df["contract_target_oof"] = np.nan
kf = KFold(n_splits=3, shuffle=True, random_state=42)

for train_idx, val_idx in kf.split(df):
    train_fold = df.iloc[train_idx]
    fold_means = train_fold.groupby("contract_type")["churned"].mean()
    df.loc[df.index[val_idx], "contract_target_oof"] = \
        df.iloc[val_idx]["contract_type"].map(fold_means)

print(df[["contract_type","contract_target","contract_target_oof"]].to_string())

  contract_type  contract_target  contract_target_oof
0           M2M              1.0                  1.0
1           M2M              1.0                  1.0
2           1yr              0.0                  0.0
3           M2M              1.0                  1.0
4           2yr              0.0                  0.0
5           M2M              1.0                  1.0
6           1yr              0.0                  0.0
7           2yr              0.0                  0.0

On this 8-sample dataset with perfect category-label alignment, the out-of-fold encoding is identical to the naive encoding. The difference is significant on larger datasets where the encoding is computed on thousands of rows: each row's own label contributes only $1/ n_{category}$ to the mean, which becomes negligible — but on small categories, it can inflate accuracy by several percentage points.

Hashing Trick

For very high-cardinality features (postal codes, user IDs, product SKUs with millions of distinct values), OHE is infeasible. The hashing trick maps each category to an integer via hash(category) % n_buckets:

python

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=4, input_type='string')
hashed = hasher.transform([[c] for c in df["contract_type"]])
print(pd.DataFrame(hashed.toarray(), columns=[f"h{i}" for i in range(4)]))

    h0   h1   h2   h3
0 -1.0  0.0  0.0  0.0
1 -1.0  0.0  0.0  0.0
2  0.0  0.0  0.0  1.0
3 -1.0  0.0  0.0  0.0
4  0.0  1.0  0.0  0.0
5 -1.0  0.0  0.0  0.0
6  0.0  0.0  0.0  1.0
7  0.0  1.0  0.0  0.0

With only 4 buckets for 3 categories, no collisions occur here. In a real application with 10,000 categories hashed into 512 buckets, roughly 20 categories map to each bucket — some information is lost, but the feature dimensionality stays fixed at 512 regardless of cardinality. The tradeoff: memory for accuracy.

Encoding Selection Guide

Strategy Comparison

Technique	Handles	Key Risk	Use When
Label encoding	Any categorical	Implies false ordinal relationship	Feature is genuinely ordinal
One-hot encoding	Nominal, low-cardinality	Dummy variable trap; curse of dimensionality	≤50 categories, no strong label relationship
Ordinal encoding	Ordinal features	Wrong order destroys signal	Ordered categories (cold/warm/hot)
Frequency encoding	High-cardinality nominal	Collision: different categories → same code	Cardinality too high for OHE
Target encoding	Nominal with target relationship	Target leakage if computed naively	When category-target correlation is meaningful
Hashing trick	Extremely high-cardinality	Hash collisions reduce accuracy	Millions of unique categories

Hyperparameter Sensitivity: Number of OHE Features

For higher-cardinality categoricals, the number of resulting OHE columns grows linearly. Here is the effect on a logistic regression model trained on contract_type encoded different ways:

python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df_clean = df.dropna()

encodings = {
    "label":     df_clean["contract_type"].map({"M2M":2,"1yr":1,"2yr":0}).values.reshape(-1,1),
    "ordinal":   df_clean["contract_type"].map({"M2M":0,"1yr":1,"2yr":2}).values.reshape(-1,1),
    "frequency": df_clean["contract_type"].map(df["contract_type"].value_counts()).values.reshape(-1,1),
    "target":    df_clean["contract_type"].map({"M2M":1.0,"1yr":0.0,"2yr":0.0}).values.reshape(-1,1),
    "ohe":       pd.get_dummies(df_clean["contract_type"], drop_first=True).values,
}

y_clean = df_clean["churned"].values
for name, X_enc in encodings.items():
    model = LogisticRegression(max_iter=500, solver='lbfgs').fit(X_enc, y_clean)
    acc = accuracy_score(y_clean, model.predict(X_enc))
    print(f"{name:12s}: cols={X_enc.shape[1]}  train_acc={acc:.2f}")

label       : cols=1  train_acc=0.83
ordinal     : cols=1  train_acc=0.83
frequency   : cols=1  train_acc=0.83
target      : cols=1  train_acc=1.00
ohe         : cols=2  train_acc=1.00

Target encoding and OHE both achieve 1.00 training accuracy, while scalar encodings (label, ordinal, frequency) hit only 0.83 — they fail on the one non-churn M2M customer because a linear model can't fit a non-linear decision boundary using a single encoded column.

The 0.83 accuracy shows the false-ordinal problem in action: label and ordinal encodings create different implied orderings but achieve the same accuracy because both partially capture the M2M pattern.

Target encoding is often the most powerful technique for categorical variables correlated with the target, but it is also the most dangerous. The cross-fold fix reduces leakage but does not eliminate it: the encoding still reveals information about the training distribution. For very small categories (1–3 samples), the LOO mean is unstable — add-k smoothing can stabilize it: $encoded = (mean_target \times n + k \times global_mean) / (n + k)$ .

OHE interacts with regularization: with L1 regularization (Lasso logistic regression), many of the dummy coefficients will be driven to zero, effectively grouping rare categories together. This is a practical alternative to explicit frequency/hashing encoding for medium-cardinality features.

Feature encoding is not a preprocessing decision made once — it is part of the model architecture and should be treated as a hyperparameter. The encoding that works best depends on the algorithm (tree models handle label encoding fine because they split on thresholds, not linear relationships; linear models don't), the number of categories, and the dataset size.

Test Your Understanding

A feature city has 200 unique values. OHE produces 199 columns. A decision tree uses this feature. How does the tree's treatment of OHE vs label encoding differ from a logistic regression model's treatment?
Target encoding for contract_type gives M2M an encoding of 1.0 on the training set. If a new customer has a M2M contract in the test set, what encoded value should they receive — 1.0 (training mean), or something else? Why?
The hashing trick with 4 buckets worked cleanly here (no collisions). If you added a fourth category "3yr" to contract_type, which bucket would it hash to? Could it collide with an existing category?
OHE with drop_first=True makes the 2yr contract the reference category. What does a logistic regression coefficient of −2.5 for contract_type_M2M mean? In terms of probability, how much more or less likely is a M2M customer to churn compared to a 2yr customer (all else equal)?
Frequency encoding assigns 1yr and 2yr the same value (2) because they appear the same number of times. If you added a third 1yr customer to the dataset (who churned), how would frequency encoding change? Would it now correctly distinguish 1yr from 2yr contracts for prediction? Explain why or why not.

Feature Engineering: Categorical Encoding

Label Encoding

One-Hot Encoding

Ordinal Encoding

Frequency Encoding

Target-Guided Encoding (Mean Target Encoding)

The Leakage Problem

Cross-Fold Fix

Hashing Trick

Encoding Selection Guide

Strategy Comparison

Hyperparameter Sensitivity: Number of OHE Features

Test Your Understanding

Comments (0)

Leave a comment

Feature Engineering: Categorical Encoding

Label Encoding

One-Hot Encoding

Ordinal Encoding

Frequency Encoding

Target-Guided Encoding (Mean Target Encoding)

The Leakage Problem

Cross-Fold Fix

Hashing Trick

Encoding Selection Guide

Strategy Comparison

Hyperparameter Sensitivity: Number of OHE Features

Related Concepts and Honest Limitations

Test Your Understanding

Comments (0)

Leave a comment