~/blog

Imbalanced Datasets

Jun 14, 2026•17 min read•By mohammed.vasim

Machine LearningAIData Science

The 3 pillars (always start here in an interview)

Data-level — fix the imbalance in the data itself before training. Algorithm-level — make the model aware of the imbalance during training. Evaluation — measure the right thing, not accuracy.

Oversampling techniques (tabular)

Random oversampling — duplicate minority samples randomly. Simple, but causes overfitting on the same points.

SMOTE (Synthetic Minority Oversampling Technique) — generates synthetic samples by interpolating between a minority point and its k nearest neighbors. The workhorse of oversampling. Interview key: it creates samples in feature space, not pixel/text space, so it's purely for tabular/numerical data.

ADASYN (Adaptive Synthetic Sampling) — like SMOTE, but generates more synthetic samples in harder-to-learn regions (near the decision boundary). More adaptive than vanilla SMOTE.

Borderline-SMOTE — only oversamples minority points that are near the decision boundary (borderline examples), ignoring easy ones deep inside the minority cluster.

SVMSMOTE — uses SVM support vectors to guide where synthetic samples are placed.

Undersampling techniques

Random undersampling — drop majority class samples randomly. Risk: loses potentially useful information.

Tomek links — removes the majority sample from a pair of very close minority-majority samples. Cleans the boundary rather than heavily reducing data.

NearMiss — selects majority samples whose average distance to the nearest minority samples is smallest (version-1, 2, or 3). More principled than random.

Cluster centroids — replace a cluster of majority samples with their centroid. Reduces majority while preserving structure.

Edited Nearest Neighbors (ENN) — removes majority samples misclassified by their k-nearest neighbors. Good for noise removal.

Algorithm-level

class_weight parameter — built into sklearn's Logistic Regression, SVM, RandomForest, etc. Setting class_weight='balanced' automatically inversely weights each class by frequency. One of the easiest first things to try. Interview gold.

Focal Loss — introduced by Facebook AI for object detection (RetinaNet). Down-weights the loss on easy (well-classified) samples and focuses training on hard ones. Key formula: FL(p_t) = -(1-p_t)^γ * log(p_t). γ controls the focus. Standard in computer vision imbalance.

XGBoost scale_pos_weight — set to negative_count / positive_count. Tells the boosting algorithm to penalize missing minority class more.

Threshold tuning — after training, shift the classification threshold from default 0.5 toward 0.3 or lower to favor recall of the minority class.

Ensemble methods for imbalance

BalancedBaggingClassifier — bagging where each bootstrap sample is balanced before training each base estimator.

EasyEnsemble — creates multiple balanced datasets by random undersampling and trains a model on each; combines by averaging.

RUSBoost — combines Random Undersampling with AdaBoost. At each boosting round, the majority class is undersampled.

BalancedRandomForest — like RandomForest but each tree is trained on a balanced bootstrap sample.

By data type

Tabular

SMOTE variants are the go-to. Also: CTGAN (Conditional GAN for tabular data) to generate realistic minority samples. Feature engineering can sometimes help expose a clearer signal for the minority class.

Text / NLP

EDA (Easy Data Augmentation) — synonym replacement, random insertion, random swap, random deletion on minority class sentences.
Back-translation — translate text to French then back to English; paraphrase at low cost.
LLM paraphrasing — use GPT/Claude to generate paraphrases of minority class examples. Increasingly common and very effective.
Few-shot fine-tuning — fine-tune a pretrained model (BERT etc.) which already has rich representations; needs far less minority data.
Oversampling at the embedding level — SMOTE in embedding space rather than raw text.

Computer vision

Geometric augmentation — flips, crops, rotations, color jitter on minority class images. Cheapest and most effective first step.
Mixup / CutMix — blend two images and their labels; forces the model to learn smoother decision boundaries.
GAN / Diffusion synthesis — generate realistic minority class images using a GAN (DCGAN, StyleGAN) or diffusion model. Expensive but powerful for rare-class detection.
Transfer learning — pretrain on ImageNet, fine-tune on your imbalanced dataset. The pretrained backbone needs far fewer minority samples to generalize.
Focal loss — especially important for object detection where background heavily outnumbers objects.

Time series

SMOTE-TS — SMOTE adapted for temporal structure, interpolates between temporal windows.
Window sliding — create more training windows from minority-class events (shorter stride).
Anomaly detection framing — reframe as one-class classification (learn what "normal" looks like, flag everything else). Useful when minority is extremely rare (fraud, fault detection).
Time-aware augmentation — warp, scale, or add noise to temporal minority sequences.

Evaluation — the trap in interviews

Never use accuracy with imbalanced data. If 99% of data is class 0, predicting class 0 always gives 99% accuracy while being useless.

Precision = of all predicted positives, how many are truly positive. Recall = of all actual positives, how many did we catch. (Often more important in fraud/medical.) F1 = harmonic mean of precision and recall. PR-AUC (Area Under Precision-Recall Curve) — better than ROC-AUC for highly imbalanced data because ROC-AUC can be misleadingly high. MCC (Matthews Correlation Coefficient) — robust single metric even with extreme imbalance. G-mean = √(Sensitivity × Specificity). Stratified k-fold — essential; ensures each fold maintains the class distribution ratio.

Interview question themes to expect

"Why is accuracy a bad metric for imbalanced data?" → confusion matrix, precision/recall
"Explain SMOTE. What are its limitations?" → overfitting on noisy boundaries, doesn't work on image/text raw data
"When would you oversample vs undersample?" → data size, interpretability, risk of overfitting
"How does focal loss work?" → γ parameter, use case in RetinaNet
"What do you do when even SMOTE doesn't help?" → ensemble methods, threshold tuning, reframing as anomaly detection
"How do you handle imbalance in NLP?" → augmentation techniques, LLMs, embedding-space SMOTE