← View series: machine learning
~/blog
SMOTE + Tomek Links
Every oversampling technique has a dirty secret: generating synthetic minority samples doesn't just add data — it adds data right at the boundary, which is exactly where your dataset is already most ambiguous. You're patching one problem while potentially creating another.
SMOTE + Tomek Links approaches this honestly. It's a two-step pipeline: first oversample aggressively, then go back and surgically remove the most ambiguous pairs that result. Oversample, then clean.
What a Tomek Link Is
A Tomek link is a pair of samples from opposite classes that are each other's nearest neighbor. Formally, samples A and B form a Tomek link if:
- A is minority, B is majority (or vice versa)
- dist(A, B) < dist(A, C) for all other majority points C
- dist(A, B) < dist(B, D) for all other minority points D
These are the pairs that sit uncomfortably close to each other across class lines. They're the training examples that make the classifier's job hardest — two points right next to each other with different labels.
Majority: ○ ○ ○ ○ ● ○ ○ ○ ○
Minority: ★ ★ ★ ★
↑
Tomek link pair here
The majority point ● and the minority point ★ nearest to it — those two form a Tomek link. At least one of them is either mislabeled or an outlier, and both live in the classifier's danger zone.
The Two-Step Pipeline
Step 1 — SMOTE. Generate synthetic minority samples to balance the class distribution. The boundary region gets denser with synthetic minority points.
Step 2 — Remove Tomek links. Identify all Tomek link pairs in the now-augmented dataset. Remove the majority sample from each pair (some implementations remove both; imblearn defaults to removing only the majority side to avoid losing minority signal).
The result is a dataset that's both larger in the minority class and cleaner at the boundary.
from imblearn.combine import SMOTETomek
sampler = SMOTETomek(
smote=None, # uses default SMOTE settings
tomek=None, # uses default TomekLinks settings
sampling_strategy='auto',
random_state=42
)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)To customize the SMOTE component:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
smote = SMOTE(k_neighbors=5, random_state=42)
tomek = TomekLinks(sampling_strategy='majority')
sampler = SMOTETomek(smote=smote, tomek=tomek, random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)Why This Order Matters
Running Tomek Links before SMOTE instead of after is a legitimate but different strategy. Pre-cleaning removes ambiguous majority points first, giving SMOTE a cleaner dataset to interpolate on. Post-cleaning (the SMOTETomek default) removes the ambiguity that SMOTE's own synthetic generation introduced.
The post-cleaning order is generally preferred because SMOTE's synthetic points can create new Tomek links — synthetic minority samples that land very close to existing majority points. Cleaning after catches both the original ambiguities and the ones SMOTE introduced.
What Tomek Cleaning Actually Does to the Boundary
Before SMOTETomek, the boundary region looks like scattered points from both classes mixed together. After:
- SMOTE has added synthetic minority points throughout the minority distribution including near the boundary
- Tomek link removal has peeled away the majority points that were intruding closest into minority territory
The net effect is a sharper boundary with more minority representation near it. The classifier gets a cleaner signal about where the classes actually separate.
A Medical Diagnosis Example
Rare disease classification. 95% negative, 5% positive. The positive cases closest to negative territory are patients with atypical presentations — early-stage symptoms that overlap heavily with common conditions.
With SMOTE alone, synthetic positive cases are generated across all positive patients including the atypical ones. The classifier learns a blurry boundary.
With SMOTETomek: the atypical positive patients still contribute synthetic samples (SMOTE), but the most confusingly similar negative patients — the ones forming Tomek links with positives — are removed from the training set. The classifier learns the boundary from cleaner examples on both sides.
Where It Breaks
Tomek link removal is conservative. Unlike Edited Nearest Neighbors (which removes any misclassified sample under k-NN), Tomek links only target the single closest cross-class pair per point. If your boundary has dense overlapping clusters rather than isolated noisy pairs, Tomek cleaning barely dents the problem.
Losing majority samples can matter on small datasets. Removing majority points is undersampling. If your majority class isn't large to begin with, you might degrade overall classifier performance just to get a slightly cleaner boundary.
Order sensitivity. The dataset you clean in Step 2 depends entirely on the random interpolation done in Step 1. Different random seeds in SMOTE produce different Tomek links in the cleanup pass. On small datasets, this variance can be significant. Always evaluate with multiple seeds.
Doesn't address the root cause. If your class overlap is severe and fundamental to the data distribution (not noise), SMOTE + Tomek is moving deck chairs. You'll see marginal gains, but the core problem is an insufficient signal to separate the classes, and no resampling technique fixes that.
When to Use It
SMOTETomek is a strong general-purpose baseline when:
- You're moving beyond vanilla SMOTE and want a sensible upgrade without the compute cost of SVMSMOTE
- Your dataset has moderate boundary noise — some mislabeled samples or outliers near the decision boundary
- You want a single
imblearncall that handles both augmentation and cleaning - Class imbalance is moderate (1:5 to 1:20) — extreme imbalance needs more aggressive approaches
Use SMOTEENN instead of SMOTETomek when:
- Your boundary overlap is dense, not sparse (ENN cleans more aggressively)
- You can afford to lose more training samples from both classes for a cleaner boundary
The intuition behind SMOTE + Tomek is genuinely sound: make the minority class bigger, then make the boundary cleaner. Two operations with complementary effects. The limitation is that "cleaner" in Tomek's sense means "fewer close cross-class pairs" — which is necessary but not always sufficient for a dramatically better classifier.