← View series: machine learning
~/blog
Handling Imbalanced Datasets
Handling Imbalanced Datasets: Upsampling and Downsampling
A model that achieves 99% accuracy but never predicts the minority class isn't accurate — it's broken. This is the core problem with imbalanced datasets: standard optimizers target overall accuracy, so they learn to predict the majority class every time.
Fraud detection, rare disease diagnosis, churn prediction — any problem where the important outcome is uncommon needs a different approach.
What Imbalance Does to a Model
To see the problem clearly, create a simple 9:1 imbalanced dataset:
import numpy as np
import pandas as pd
from sklearn.utils import resample
np.random.seed(123)
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0
class_0 = pd.DataFrame({
'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
'target': [0] * n_class_0
})
class_1 = pd.DataFrame({
'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
'target': [1] * n_class_1
})
df = pd.concat([class_0, class_1]).reset_index(drop=True)
df['target'].value_counts()900 samples in class 0, 100 in class 1. A logistic regression trained on this will heavily weight decision boundaries toward the majority class.
Approach 1: Upsampling
Upsampling artificially increases the minority class by duplicating existing samples with replacement.
df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]
df_minority_upsampled = resample(
df_minority,
replace=True,
n_samples=len(df_majority),
random_state=42
)
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
df_upsampled['target'].value_counts()Both classes now have 900 samples. The model gets equal exposure to both.
The catch is overfitting. Duplicating the same 100 samples 9 times each means the model memorizes those specific points rather than learning general patterns. This is more pronounced with small minority classes and complex models like random forests.
Approach 2: Downsampling
Downsampling reduces the majority class to match the minority count.
df_majority_downsampled = resample(
df_majority,
replace=False,
n_samples=len(df_minority),
random_state=42
)
df_downsampled = pd.concat([df_minority, df_majority_downsampled])
df_downsampled['target'].value_counts()Both classes have 100 samples. The model won't be biased, but you discard 800 potentially useful data points. For small datasets, this loss of information hurts more than the imbalance.
Choosing Between Them
The right choice depends on your data:
- Upsampling works better when the minority class has enough diversity that duplicating won't cause overfitting. Also better when you can't afford to lose data.
- Downsampling works better when the majority class has plenty of redundancy. Training is faster, and you avoid the risk of overfitting on duplicates.
- Combined approaches like SMOTE (covered in the next post) generate synthetic samples instead of duplicates, often giving the best of both.
Beyond Accuracy: The Right Metric
When classes are imbalanced, accuracy is a misleading metric. A 99% accurate model on a 99:1 dataset could be doing nothing useful. Alternatives:
- Precision — of all positive predictions, how many are correct?
- Recall — of all actual positives, how many did you catch?
- F1 Score — harmonic mean of precision and recall
- AUC-ROC — model's ability to distinguish between classes across thresholds
These metrics tell you whether your resampling strategy actually helped. Accuracy by itself cannot.
Model-Specific Considerations
Tree-based models (random forests, gradient boosting) handle imbalance better than linear models because they can learn interaction rules specific to the minority class. But they're still biased toward the majority when the ratio exceeds 10:1.
Support vector machines can be tuned with class_weight='balanced' to adjust the penalty for misclassifying minority samples. This is often simpler than resampling:
from sklearn.svm import SVC
model = SVC(class_weight='balanced')For any approach, always evaluate using cross-validation to ensure your resampling strategy generalizes, not just memorizes.