Back to blog
← View series: machine learning

~/blog

Handling Imbalanced Datasets

Jun 1, 20263 min readBy Mohammed Vasim
Machine LearningAIData Science

Handling Imbalanced Datasets: Upsampling and Downsampling

A model that achieves 99% accuracy but never predicts the minority class isn't accurate — it's broken. This is the core problem with imbalanced datasets: standard optimizers target overall accuracy, so they learn to predict the majority class every time.

Fraud detection, rare disease diagnosis, churn prediction — any problem where the important outcome is uncommon needs a different approach.

What Imbalance Does to a Model

To see the problem clearly, create a simple 9:1 imbalanced dataset:

python
import numpy as np
import pandas as pd
from sklearn.utils import resample

np.random.seed(123)

n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)
df['target'].value_counts()

900 samples in class 0, 100 in class 1. A logistic regression trained on this will heavily weight decision boundaries toward the majority class.

Approach 1: Upsampling

Upsampling artificially increases the minority class by duplicating existing samples with replacement.

python
df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]

df_minority_upsampled = resample(
    df_minority,
    replace=True,
    n_samples=len(df_majority),
    random_state=42
)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])
df_upsampled['target'].value_counts()

Both classes now have 900 samples. The model gets equal exposure to both.

The catch is overfitting. Duplicating the same 100 samples 9 times each means the model memorizes those specific points rather than learning general patterns. This is more pronounced with small minority classes and complex models like random forests.

Approach 2: Downsampling

Downsampling reduces the majority class to match the minority count.

python
df_majority_downsampled = resample(
    df_majority,
    replace=False,
    n_samples=len(df_minority),
    random_state=42
)

df_downsampled = pd.concat([df_minority, df_majority_downsampled])
df_downsampled['target'].value_counts()

Both classes have 100 samples. The model won't be biased, but you discard 800 potentially useful data points. For small datasets, this loss of information hurts more than the imbalance.

Choosing Between Them

The right choice depends on your data:

  • Upsampling works better when the minority class has enough diversity that duplicating won't cause overfitting. Also better when you can't afford to lose data.
  • Downsampling works better when the majority class has plenty of redundancy. Training is faster, and you avoid the risk of overfitting on duplicates.
  • Combined approaches like SMOTE (covered in the next post) generate synthetic samples instead of duplicates, often giving the best of both.

Beyond Accuracy: The Right Metric

When classes are imbalanced, accuracy is a misleading metric. A 99% accurate model on a 99:1 dataset could be doing nothing useful. Alternatives:

  • Precision — of all positive predictions, how many are correct?
  • Recall — of all actual positives, how many did you catch?
  • F1 Score — harmonic mean of precision and recall
  • AUC-ROC — model's ability to distinguish between classes across thresholds

These metrics tell you whether your resampling strategy actually helped. Accuracy by itself cannot.

Model-Specific Considerations

Tree-based models (random forests, gradient boosting) handle imbalance better than linear models because they can learn interaction rules specific to the minority class. But they're still biased toward the majority when the ratio exceeds 10:1.

Support vector machines can be tuned with class_weight='balanced' to adjust the penalty for misclassifying minority samples. This is often simpler than resampling:

python
from sklearn.svm import SVC

model = SVC(class_weight='balanced')

For any approach, always evaluate using cross-validation to ensure your resampling strategy generalizes, not just memorizes.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment