← View series: machine learning
~/blog
SMOTE
Let's build an interactive SMOTE visualizer — pick any minority point and watch synthetic samples get generated in real time.
Now let's break down exactly what's happening.
What SMOTE actually does — step by step
The core problem SMOTE solves: if you just duplicate minority samples randomly, your model memorizes those exact points. SMOTE instead creates new, plausible minority samples by interpolating between existing ones.
The algorithm, step by step
Step 1 — Pick a minority sample X.
Say you have a fraud detection dataset with features [transaction_amount, time_of_day]. You pick one fraud transaction: X = (120, 280).
Step 2 — Find its k nearest neighbors in the minority class. With k=3, you find the 3 closest fraud transactions in feature space:
- Neighbor 1: (160, 230)
- Neighbor 2: (190, 300)
- Neighbor 3: (140, 190)
Importantly, neighbors are only searched among other minority points — not majority points.
Step 3 — Randomly pick one neighbor. Say we pick Neighbor 1: (160, 230).
Step 4 — Interpolate along the line between X and that neighbor. Pick a random λ between 0 and 1, then:
synthetic = X + λ × (neighbor − X)
If λ = 0.4:
synthetic = (120, 280) + 0.4 × ((160,230) − (120,280))
= (120, 280) + 0.4 × (40, −50)
= (120 + 16, 280 − 20)
= (136, 260)
This new point (136, 260) is the synthetic sample — it didn't exist in the original data, but it lies on the line between two real fraud transactions, so it's a plausible fraud transaction.
Step 5 — Repeat for as many synthetic samples as you need to balance the dataset.
A concrete tabular example
Imagine a loan default dataset:
| Sample | Income | Loan amount | Age | Default? |
|---|---|---|---|---|
| A (minority) | 30,000 | 15,000 | 25 | Yes |
| B (minority) | 35,000 | 20,000 | 30 | Yes |
SMOTE with λ=0.6 between A and B:
Income = 30000 + 0.6 × (35000 - 30000) = 33,000
Loan amount = 15000 + 0.6 × (20000 - 15000) = 18,000
Age = 25 + 0.6 × (30 - 25) = 28
New synthetic defaulter: (33,000 income, 18,000 loan, age 28). Completely plausible!
Why it works (and why it fails)
It works because it forces the model to learn a broader decision region around the minority class rather than memorizing a few exact points. The synthetic points fill in the "neighborhood" of minority samples.
It fails in a few important situations:
Noisy minority samples — if a minority point is actually mislabeled or an outlier sitting near majority territory, SMOTE generates synthetic points in that wrong region, amplifying noise.
Overlapping classes — if minority and majority classes are already heavily mixed, interpolating creates more samples right where the model is already confused.
High-dimensional data — in very high dimensions, "nearest neighbor" loses meaning (the curse of dimensionality). All points become equidistant.
Non-tabular data — you can't linearly interpolate between two sentences or two images in raw feature space and expect a meaningful result. That's why SMOTE is tabular-only in practice.
1. What "interpolate" means
Word meaning (literally): From Latin inter (between) + polire (to polish/fill). It means "to insert something between two existing things."
In everyday language — if Monday's temperature was 20°C and Wednesday's was 30°C, and someone asks "what was Tuesday probably like?", you'd say "around 25°C." That estimation between two known values is interpolation.
Contextual meaning in SMOTE:
You have two real data points in feature space — think of them as two dots on a graph. SMOTE draws an imaginary straight line between them and places a new point somewhere along that line, controlled by λ:
synthetic = X + λ × (neighbor − X)
- λ = 0 → you're at X itself
- λ = 1 → you're at the neighbor itself
- λ = 0.4 → you're 40% of the way from X toward the neighbor
So the synthetic point is never outside the two real points — it's always between them. That's what makes it "plausible": it lives in the same neighborhood as real minority samples, not somewhere random.
The key intuition: SMOTE doesn't invent new extremes. It fills in the gaps between what already exists.
2. Techniques to mitigate SMOTE's challengesClick each card to read how it mitigates a specific SMOTE weakness. Here's the mental model for all of them:
The core SMOTE problem → solution mapping
Problem 1: Noise amplification (minority point sitting near majority territory gets oversampled) → Solution: Borderline-SMOTE — only oversample points that are genuinely at the boundary, not the noisy outliers buried in the wrong region.
Problem 2: Uniform treatment of all minority points (hard and easy regions get equal samples) → Solution: ADASYN — calculate how "difficult" each minority point is and give more synthetic samples to the harder ones. The model naturally struggles more near the boundary, so that's where you need more data.
Problem 3: SMOTE doesn't know where the decision boundary actually is → Solution: SVMSMOTE — let an SVM explicitly find the boundary first, then use its support vectors as the generation anchors. More geometrically principled.
Problem 4: SMOTE creates samples but leaves noisy boundary pairs → Solution: SMOTE + Tomek Links or SMOTE + ENN — first oversample, then clean up noisy overlapping pairs at the boundary. Two-pass pipeline: generate, then filter.
Problem 5: Linear interpolation doesn't respect feature constraints (e.g., interpolating between age=25 and age=30 gives age=27.4, which is fine; but interpolating a categorical "product type" is meaningless) → Solution: CTGAN / VAE — generative models that learn the actual distribution of the minority class, including correlations and mixed data types, and sample from that learned distribution.
Quick decision tree for which to use
Is your data purely numerical?
├── Yes → Does SMOTE create noisy samples near boundary?
│ ├── Yes, slightly → Borderline-SMOTE
│ ├── Yes, heavily → ADASYN
│ └── Want SVM-guided boundary → SVMSMOTE
│
├── Yes, but want cleaner boundary → SMOTE + Tomek / SMOTE + ENN
│
└── Has categorical columns or complex distributions?
└── CTGAN or VAE
The hybrid approaches (SMOTE + Tomek, SMOTE + ENN) are particularly popular in practice because they handle both sides of the problem — generate enough minority samples AND clean up the messy boundary — in a single sklearn pipeline step.
Interview-ready summary
SMOTE = pick minority point → find k nearest minority neighbors → pick one neighbor → interpolate at random λ ∈ [0,1] → that's your new sample. Repeat until balanced.
The key formula: synthetic = X + λ(neighbor − X)
The key limitation: noise amplification near the decision boundary, which is exactly why Borderline-SMOTE and ADASYN were invented — they add logic to avoid generating samples in the wrong zones.