Categorical Encoding

Jun 1, 2026•3 min read•By Mohammed Vasim

Machine LearningAIData Science

Categorical Encoding: One-Hot, Label, Ordinal, and Alternatives

Most machine learning models expect numeric input. If your data has columns like "color", "country", or "education level", you need to convert categories into numbers. The question is how — because the encoding choice silently affects how your model interprets the feature.

One-Hot Encoding (Nominal Encoding)

When categories have no inherent order — colors, countries, product categories — you want an encoding that doesn't imply ranking. One-hot encoding creates a binary column for each category.

python

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[['color']]).toarray()

encoder_df = pd.DataFrame(
    encoded, columns=encoder.get_feature_names_out()
)

pd.concat([df, encoder_df], axis=1)

Each color becomes its own column with 0 or 1. No column is "greater" than another — red isn't less than blue. They're independent.

The tradeoff is dimensionality. If you have a column with 50 countries, you get 50 new columns. For tree-based models, this creates sparse splits that are hard to learn. For linear models, it multiplies the parameter count.

One parameter worth knowing: drop='first' drops the first category to avoid multicollinearity in linear models.

python

encoder = OneHotEncoder(drop='first')

Label Encoding

Label encoding assigns a unique integer to each category. Red → 0, blue → 1, green → 2.

python

from sklearn.preprocessing import LabelEncoder

lbl_encoder = LabelEncoder()
lbl_encoder.fit_transform(df[['color']])

The problem: the model now sees 0 < 1 < 2. For nominal categories, this is an arbitrary ordering that may mislead the model. A linear model will treat the difference between red and blue (1) as half the difference between red and green (2) — which has no basis in reality.

When it works: tree-based models don't use the magnitude of label-encoded values, only the ordering of the splits. So label encoding can work with random forests and gradient boosting, though ordinal encoding is still better when an order exists.

Ordinal Encoding

When categories do have a natural order, ordinal encoding preserves that structure.

python

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoder.fit_transform(df[['size']])

Small → 0, medium → 1, large → 2. The numerical values reflect the actual relationship.

The important detail is specifying the category order explicitly. If you let the encoder infer it alphabetically (large → 0, medium → 1, small → 2), the ordering reverses the real-world relationship. This is a common bug that silently degrades model performance.

When One-Hot Encoding Becomes Impractical

High-cardinality features (ZIP codes, product IDs, user IDs) break one-hot encoding. Five hundred ZIP codes produce 500 sparse columns. Alternatives for this case:

Frequency encoding replaces each category with its count in the dataset.

python

freq_encoding = df['city'].value_counts()
df['city_encoded'] = df['city'].map(freq_encoding)

This captures the intuition that rare categories may behave differently from common ones. It works surprisingly well as a general-purpose encoding.

Binary encoding combines OHE and hashing. It creates fewer columns by encoding categories as binary strings, then splitting each binary digit into a column. For 500 categories, OHE creates 500 columns; binary encoding creates 9.

python

# category_binary_encoder, BinaryEncoder from category_encoders
# Each category becomes floor(log2(n_categories)) binary columns

Choosing the Right Encoding

Encoding	When to Use	Model Fit
One-hot	Nominal data, low cardinality	All models
Ordinal	Clear order exists	All models (best with trees)
Label encoding	Nominal data with tree models	Trees only
Frequency	High cardinality, no natural order	All models
Binary	High cardinality, need compact representation	Linear models, trees
Target guided	High cardinality, target relationship	Linear models (see next post)

The most common mistake is using label encoding on nominal data with a linear model. The model interprets the integers as having meaningful magnitude. Your code runs fine, your accuracy looks reasonable, but you're leaving predictive signal on the table. If your categories have no order, use one-hot or frequency encoding. If they have order, use ordinal encoding with explicit category ordering.

Categorical Encoding