← View series: machine learning
~/blog
Categorical Encoding
Categorical Encoding: One-Hot, Label, Ordinal, and Alternatives
Most machine learning models expect numeric input. If your data has columns like "color", "country", or "education level", you need to convert categories into numbers. The question is how — because the encoding choice silently affects how your model interprets the feature.
One-Hot Encoding (Nominal Encoding)
When categories have no inherent order — colors, countries, product categories — you want an encoding that doesn't imply ranking. One-hot encoding creates a binary column for each category.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[['color']]).toarray()
encoder_df = pd.DataFrame(
encoded, columns=encoder.get_feature_names_out()
)
pd.concat([df, encoder_df], axis=1)Each color becomes its own column with 0 or 1. No column is "greater" than another — red isn't less than blue. They're independent.
The tradeoff is dimensionality. If you have a column with 50 countries, you get 50 new columns. For tree-based models, this creates sparse splits that are hard to learn. For linear models, it multiplies the parameter count.
One parameter worth knowing: drop='first' drops the first category to avoid multicollinearity in linear models.
encoder = OneHotEncoder(drop='first')Label Encoding
Label encoding assigns a unique integer to each category. Red → 0, blue → 1, green → 2.
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()
lbl_encoder.fit_transform(df[['color']])The problem: the model now sees 0 < 1 < 2. For nominal categories, this is an arbitrary ordering that may mislead the model. A linear model will treat the difference between red and blue (1) as half the difference between red and green (2) — which has no basis in reality.
When it works: tree-based models don't use the magnitude of label-encoded values, only the ordering of the splits. So label encoding can work with random forests and gradient boosting, though ordinal encoding is still better when an order exists.
Ordinal Encoding
When categories do have a natural order, ordinal encoding preserves that structure.
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({
'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoder.fit_transform(df[['size']])Small → 0, medium → 1, large → 2. The numerical values reflect the actual relationship.
The important detail is specifying the category order explicitly. If you let the encoder infer it alphabetically (large → 0, medium → 1, small → 2), the ordering reverses the real-world relationship. This is a common bug that silently degrades model performance.
When One-Hot Encoding Becomes Impractical
High-cardinality features (ZIP codes, product IDs, user IDs) break one-hot encoding. Five hundred ZIP codes produce 500 sparse columns. Alternatives for this case:
Frequency encoding replaces each category with its count in the dataset.
freq_encoding = df['city'].value_counts()
df['city_encoded'] = df['city'].map(freq_encoding)This captures the intuition that rare categories may behave differently from common ones. It works surprisingly well as a general-purpose encoding.
Binary encoding combines OHE and hashing. It creates fewer columns by encoding categories as binary strings, then splitting each binary digit into a column. For 500 categories, OHE creates 500 columns; binary encoding creates 9.
# category_binary_encoder, BinaryEncoder from category_encoders
# Each category becomes floor(log2(n_categories)) binary columnsChoosing the Right Encoding
| Encoding | When to Use | Model Fit |
|---|---|---|
| One-hot | Nominal data, low cardinality | All models |
| Ordinal | Clear order exists | All models (best with trees) |
| Label encoding | Nominal data with tree models | Trees only |
| Frequency | High cardinality, no natural order | All models |
| Binary | High cardinality, need compact representation | Linear models, trees |
| Target guided | High cardinality, target relationship | Linear models (see next post) |
The most common mistake is using label encoding on nominal data with a linear model. The model interprets the integers as having meaningful magnitude. Your code runs fine, your accuracy looks reasonable, but you're leaving predictive signal on the table. If your categories have no order, use one-hot or frequency encoding. If they have order, use ordinal encoding with explicit category ordering.