~/blog/tutorials/machine-learning

Imbalanced Datasets

Real-world classification problems are rarely balanced. Fraud detection, disease diagnosis, and churn prediction all share the same problem: the class you care most about makes up a tiny fraction of the data. This series covers the full toolkit — when each technique applies, how it changes model behaviour, and where it breaks down.

Posts in this series

Imbalanced Datasets — What makes a dataset imbalanced and why standard accuracy metrics mislead you
SMOTE — Synthetic oversampling by interpolation between minority samples

Prerequisites

Binary classification basics
Familiarity with scikit-learn

Tutorial

Jun 14, 20269 min read

Imbalanced Datasets

Imbalanced dataset handling Data-level Algorithm-level Evaluation strategy Oversampling Undersampling Random oversampling SMOTE ADASYN Borderline-SMOTE Random u…

Tutorial

Jun 14, 20267 min read

SMOTE

SMOTE picks a minority point, finds its k nearest minority neighbors, and interpolates between them to create a synthetic sample — a new point that didn't exist…

Tutorial

Jun 14, 20264 min read

Borderline-SMOTE

Vanilla SMOTE has a blind spot. It treats every minority point the same — whether it sits comfortably deep inside the minority cluster or dangerously close to t…

Tutorial

Jun 14, 20264 min read

ADASYN

SMOTE generates the same number of synthetic samples for every minority point it visits. That's a reasonable default but a poor strategy. A fraud transaction su…

Tutorial

Jun 14, 20264 min read

SVMSMOTE

The weakness shared by SMOTE, Borderline-SMOTE, and ADASYN is that they estimate the decision boundary using k-nearest neighbors. That's a local measure — it te…