← View series: machine learning
~/blog
Imbalanced Datasets
The 3 pillars (always start here in an interview)
Data-level — fix the imbalance in the data itself before training. Algorithm-level — make the model aware of the imbalance during training. Evaluation — measure the right thing, not accuracy.
Oversampling techniques (tabular)
Random oversampling — duplicate minority samples randomly. Simple, but causes overfitting on the same points.
SMOTE (Synthetic Minority Oversampling Technique) — generates synthetic samples by interpolating between a minority point and its k nearest neighbors. The workhorse of oversampling. Interview key: it creates samples in feature space, not pixel/text space, so it's purely for tabular/numerical data.
ADASYN (Adaptive Synthetic Sampling) — like SMOTE, but generates more synthetic samples in harder-to-learn regions (near the decision boundary). More adaptive than vanilla SMOTE.
Borderline-SMOTE — only oversamples minority points that are near the decision boundary (borderline examples), ignoring easy ones deep inside the minority cluster.
SVMSMOTE — uses SVM support vectors to guide where synthetic samples are placed.
Undersampling techniques
Random undersampling — drop majority class samples randomly. Risk: loses potentially useful information.
Tomek links — removes the majority sample from a pair of very close minority-majority samples. Cleans the boundary rather than heavily reducing data.
NearMiss — selects majority samples whose average distance to the nearest minority samples is smallest (version-1, 2, or 3). More principled than random.
Cluster centroids — replace a cluster of majority samples with their centroid. Reduces majority while preserving structure.
Edited Nearest Neighbors (ENN) — removes majority samples misclassified by their k-nearest neighbors. Good for noise removal.
Algorithm-level
class_weight parameter — built into sklearn's Logistic Regression, SVM, RandomForest, etc. Setting class_weight='balanced' automatically inversely weights each class by frequency. One of the easiest first things to try. Interview gold.
Focal Loss — introduced by Facebook AI for object detection (RetinaNet). Down-weights the loss on easy (well-classified) samples and focuses training on hard ones. Key formula: FL(p_t) = -(1-p_t)^γ * log(p_t). γ controls the focus. Standard in computer vision imbalance.
XGBoost scale_pos_weight — set to negative_count / positive_count. Tells the boosting algorithm to penalize missing minority class more.
Threshold tuning — after training, shift the classification threshold from default 0.5 toward 0.3 or lower to favor recall of the minority class.
Ensemble methods for imbalance
BalancedBaggingClassifier — bagging where each bootstrap sample is balanced before training each base estimator.
EasyEnsemble — creates multiple balanced datasets by random undersampling and trains a model on each; combines by averaging.
RUSBoost — combines Random Undersampling with AdaBoost. At each boosting round, the majority class is undersampled.
BalancedRandomForest — like RandomForest but each tree is trained on a balanced bootstrap sample.
By data type
Tabular
SMOTE variants are the go-to. Also: CTGAN (Conditional GAN for tabular data) to generate realistic minority samples. Feature engineering can sometimes help expose a clearer signal for the minority class.
Text / NLP
- EDA (Easy Data Augmentation) — synonym replacement, random insertion, random swap, random deletion on minority class sentences.
- Back-translation — translate text to French then back to English; paraphrase at low cost.
- LLM paraphrasing — use GPT/Claude to generate paraphrases of minority class examples. Increasingly common and very effective.
- Few-shot fine-tuning — fine-tune a pretrained model (BERT etc.) which already has rich representations; needs far less minority data.
- Oversampling at the embedding level — SMOTE in embedding space rather than raw text.
Computer vision
- Geometric augmentation — flips, crops, rotations, color jitter on minority class images. Cheapest and most effective first step.
- Mixup / CutMix — blend two images and their labels; forces the model to learn smoother decision boundaries.
- GAN / Diffusion synthesis — generate realistic minority class images using a GAN (DCGAN, StyleGAN) or diffusion model. Expensive but powerful for rare-class detection.
- Transfer learning — pretrain on ImageNet, fine-tune on your imbalanced dataset. The pretrained backbone needs far fewer minority samples to generalize.
- Focal loss — especially important for object detection where background heavily outnumbers objects.
Time series
- SMOTE-TS — SMOTE adapted for temporal structure, interpolates between temporal windows.
- Window sliding — create more training windows from minority-class events (shorter stride).
- Anomaly detection framing — reframe as one-class classification (learn what "normal" looks like, flag everything else). Useful when minority is extremely rare (fraud, fault detection).
- Time-aware augmentation — warp, scale, or add noise to temporal minority sequences.
Evaluation — the trap in interviews
Never use accuracy with imbalanced data. If 99% of data is class 0, predicting class 0 always gives 99% accuracy while being useless.
Precision = of all predicted positives, how many are truly positive. Recall = of all actual positives, how many did we catch. (Often more important in fraud/medical.) F1 = harmonic mean of precision and recall. PR-AUC (Area Under Precision-Recall Curve) — better than ROC-AUC for highly imbalanced data because ROC-AUC can be misleadingly high. MCC (Matthews Correlation Coefficient) — robust single metric even with extreme imbalance. G-mean = √(Sensitivity × Specificity). Stratified k-fold — essential; ensures each fold maintains the class distribution ratio.
Interview question themes to expect
- "Why is accuracy a bad metric for imbalanced data?" → confusion matrix, precision/recall
- "Explain SMOTE. What are its limitations?" → overfitting on noisy boundaries, doesn't work on image/text raw data
- "When would you oversample vs undersample?" → data size, interpretability, risk of overfitting
- "How does focal loss work?" → γ parameter, use case in RetinaNet
- "What do you do when even SMOTE doesn't help?" → ensemble methods, threshold tuning, reframing as anomaly detection
- "How do you handle imbalance in NLP?" → augmentation techniques, LLMs, embedding-space SMOTE