Back to blog
← View series: machine learning

~/blog

Imbalanced Datasets

Jun 14, 202617 min readBy mohammed.vasim
Machine LearningAIData Science
Imbalanced Dataset Techniques Map A taxonomy of techniques to handle imbalanced datasets organized by data type and approach Imbalanced dataset handling Data-level Algorithm-level Evaluation strategy Oversampling Undersampling Random oversampling SMOTE ADASYN Borderline-SMOTE Random undersampling Tomek links NearMiss Cluster centroids Cost-sensitive Ensemble class_weight param Focal loss Custom loss fn BalancedBagging EasyEnsemble RUSBoost Precision / Recall / F1 ROC-AUC, PR-AUC Matthews Corr. Coef. G-mean Stratified k-fold CV domain / data-type specific Tabular Text / NLP Computer vision Time series SMOTE variants CTGAN for minority XGBoost scale_pos_weight Data augmentation (EDA) Back-translation LLM paraphrasing Few-shot fine-tuning Augmentation (flip/crop) GAN / Diffusion synthesis Transfer learning Focal loss (detection) SMOTE-TS Window sliding Anomaly framing

The 3 pillars (always start here in an interview)

Data-level — fix the imbalance in the data itself before training. Algorithm-level — make the model aware of the imbalance during training. Evaluation — measure the right thing, not accuracy.


Oversampling techniques (tabular)

Random oversampling — duplicate minority samples randomly. Simple, but causes overfitting on the same points.

SMOTE (Synthetic Minority Oversampling Technique) — generates synthetic samples by interpolating between a minority point and its k nearest neighbors. The workhorse of oversampling. Interview key: it creates samples in feature space, not pixel/text space, so it's purely for tabular/numerical data.

ADASYN (Adaptive Synthetic Sampling) — like SMOTE, but generates more synthetic samples in harder-to-learn regions (near the decision boundary). More adaptive than vanilla SMOTE.

Borderline-SMOTE — only oversamples minority points that are near the decision boundary (borderline examples), ignoring easy ones deep inside the minority cluster.

SVMSMOTE — uses SVM support vectors to guide where synthetic samples are placed.


Undersampling techniques

Random undersampling — drop majority class samples randomly. Risk: loses potentially useful information.

Tomek links — removes the majority sample from a pair of very close minority-majority samples. Cleans the boundary rather than heavily reducing data.

NearMiss — selects majority samples whose average distance to the nearest minority samples is smallest (version-1, 2, or 3). More principled than random.

Cluster centroids — replace a cluster of majority samples with their centroid. Reduces majority while preserving structure.

Edited Nearest Neighbors (ENN) — removes majority samples misclassified by their k-nearest neighbors. Good for noise removal.


Algorithm-level

class_weight parameter — built into sklearn's Logistic Regression, SVM, RandomForest, etc. Setting class_weight='balanced' automatically inversely weights each class by frequency. One of the easiest first things to try. Interview gold.

Focal Loss — introduced by Facebook AI for object detection (RetinaNet). Down-weights the loss on easy (well-classified) samples and focuses training on hard ones. Key formula: FL(p_t) = -(1-p_t)^γ * log(p_t). γ controls the focus. Standard in computer vision imbalance.

XGBoost scale_pos_weight — set to negative_count / positive_count. Tells the boosting algorithm to penalize missing minority class more.

Threshold tuning — after training, shift the classification threshold from default 0.5 toward 0.3 or lower to favor recall of the minority class.


Ensemble methods for imbalance

BalancedBaggingClassifier — bagging where each bootstrap sample is balanced before training each base estimator.

EasyEnsemble — creates multiple balanced datasets by random undersampling and trains a model on each; combines by averaging.

RUSBoost — combines Random Undersampling with AdaBoost. At each boosting round, the majority class is undersampled.

BalancedRandomForest — like RandomForest but each tree is trained on a balanced bootstrap sample.


By data type

Tabular

SMOTE variants are the go-to. Also: CTGAN (Conditional GAN for tabular data) to generate realistic minority samples. Feature engineering can sometimes help expose a clearer signal for the minority class.

Text / NLP

  • EDA (Easy Data Augmentation) — synonym replacement, random insertion, random swap, random deletion on minority class sentences.
  • Back-translation — translate text to French then back to English; paraphrase at low cost.
  • LLM paraphrasing — use GPT/Claude to generate paraphrases of minority class examples. Increasingly common and very effective.
  • Few-shot fine-tuning — fine-tune a pretrained model (BERT etc.) which already has rich representations; needs far less minority data.
  • Oversampling at the embedding level — SMOTE in embedding space rather than raw text.

Computer vision

  • Geometric augmentation — flips, crops, rotations, color jitter on minority class images. Cheapest and most effective first step.
  • Mixup / CutMix — blend two images and their labels; forces the model to learn smoother decision boundaries.
  • GAN / Diffusion synthesis — generate realistic minority class images using a GAN (DCGAN, StyleGAN) or diffusion model. Expensive but powerful for rare-class detection.
  • Transfer learning — pretrain on ImageNet, fine-tune on your imbalanced dataset. The pretrained backbone needs far fewer minority samples to generalize.
  • Focal loss — especially important for object detection where background heavily outnumbers objects.

Time series

  • SMOTE-TS — SMOTE adapted for temporal structure, interpolates between temporal windows.
  • Window sliding — create more training windows from minority-class events (shorter stride).
  • Anomaly detection framing — reframe as one-class classification (learn what "normal" looks like, flag everything else). Useful when minority is extremely rare (fraud, fault detection).
  • Time-aware augmentation — warp, scale, or add noise to temporal minority sequences.

Evaluation — the trap in interviews

Never use accuracy with imbalanced data. If 99% of data is class 0, predicting class 0 always gives 99% accuracy while being useless.

Precision = of all predicted positives, how many are truly positive. Recall = of all actual positives, how many did we catch. (Often more important in fraud/medical.) F1 = harmonic mean of precision and recall. PR-AUC (Area Under Precision-Recall Curve) — better than ROC-AUC for highly imbalanced data because ROC-AUC can be misleadingly high. MCC (Matthews Correlation Coefficient) — robust single metric even with extreme imbalance. G-mean = √(Sensitivity × Specificity). Stratified k-fold — essential; ensures each fold maintains the class distribution ratio.


Interview question themes to expect

  1. "Why is accuracy a bad metric for imbalanced data?" → confusion matrix, precision/recall
  2. "Explain SMOTE. What are its limitations?" → overfitting on noisy boundaries, doesn't work on image/text raw data
  3. "When would you oversample vs undersample?" → data size, interpretability, risk of overfitting
  4. "How does focal loss work?" → γ parameter, use case in RetinaNet
  5. "What do you do when even SMOTE doesn't help?" → ensemble methods, threshold tuning, reframing as anomaly detection
  6. "How do you handle imbalance in NLP?" → augmentation techniques, LLMs, embedding-space SMOTE

Comments (0)

No comments yet. Be the first to comment!

Leave a comment