Back to blog
← View series: machine learning

~/blog

SMOTE + Tomek Links

Jun 14, 20264 min readBy mohammed.vasim
Machine LearningAIData Science

Every oversampling technique has a dirty secret: generating synthetic minority samples doesn't just add data — it adds data right at the boundary, which is exactly where your dataset is already most ambiguous. You're patching one problem while potentially creating another.

SMOTE + Tomek Links approaches this honestly. It's a two-step pipeline: first oversample aggressively, then go back and surgically remove the most ambiguous pairs that result. Oversample, then clean.

A Tomek link is a pair of samples from opposite classes that are each other's nearest neighbor. Formally, samples A and B form a Tomek link if:

  • A is minority, B is majority (or vice versa)
  • dist(A, B) < dist(A, C) for all other majority points C
  • dist(A, B) < dist(B, D) for all other minority points D

These are the pairs that sit uncomfortably close to each other across class lines. They're the training examples that make the classifier's job hardest — two points right next to each other with different labels.

Majority: ○ ○ ○ ○ ● ○ ○ ○ ○ Minority: ★ ★ ★ ★ ↑ Tomek link pair here

The majority point ● and the minority point ★ nearest to it — those two form a Tomek link. At least one of them is either mislabeled or an outlier, and both live in the classifier's danger zone.

The Two-Step Pipeline

Step 1 — SMOTE. Generate synthetic minority samples to balance the class distribution. The boundary region gets denser with synthetic minority points.

Step 2 — Remove Tomek links. Identify all Tomek link pairs in the now-augmented dataset. Remove the majority sample from each pair (some implementations remove both; imblearn defaults to removing only the majority side to avoid losing minority signal).

The result is a dataset that's both larger in the minority class and cleaner at the boundary.

python
from imblearn.combine import SMOTETomek

sampler = SMOTETomek(
    smote=None,          # uses default SMOTE settings
    tomek=None,          # uses default TomekLinks settings
    sampling_strategy='auto',
    random_state=42
)

X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

To customize the SMOTE component:

python
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek

smote = SMOTE(k_neighbors=5, random_state=42)
tomek = TomekLinks(sampling_strategy='majority')

sampler = SMOTETomek(smote=smote, tomek=tomek, random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

Why This Order Matters

Running Tomek Links before SMOTE instead of after is a legitimate but different strategy. Pre-cleaning removes ambiguous majority points first, giving SMOTE a cleaner dataset to interpolate on. Post-cleaning (the SMOTETomek default) removes the ambiguity that SMOTE's own synthetic generation introduced.

The post-cleaning order is generally preferred because SMOTE's synthetic points can create new Tomek links — synthetic minority samples that land very close to existing majority points. Cleaning after catches both the original ambiguities and the ones SMOTE introduced.

What Tomek Cleaning Actually Does to the Boundary

Before SMOTETomek, the boundary region looks like scattered points from both classes mixed together. After:

  • SMOTE has added synthetic minority points throughout the minority distribution including near the boundary
  • Tomek link removal has peeled away the majority points that were intruding closest into minority territory

The net effect is a sharper boundary with more minority representation near it. The classifier gets a cleaner signal about where the classes actually separate.

A Medical Diagnosis Example

Rare disease classification. 95% negative, 5% positive. The positive cases closest to negative territory are patients with atypical presentations — early-stage symptoms that overlap heavily with common conditions.

With SMOTE alone, synthetic positive cases are generated across all positive patients including the atypical ones. The classifier learns a blurry boundary.

With SMOTETomek: the atypical positive patients still contribute synthetic samples (SMOTE), but the most confusingly similar negative patients — the ones forming Tomek links with positives — are removed from the training set. The classifier learns the boundary from cleaner examples on both sides.

Where It Breaks

Tomek link removal is conservative. Unlike Edited Nearest Neighbors (which removes any misclassified sample under k-NN), Tomek links only target the single closest cross-class pair per point. If your boundary has dense overlapping clusters rather than isolated noisy pairs, Tomek cleaning barely dents the problem.

Losing majority samples can matter on small datasets. Removing majority points is undersampling. If your majority class isn't large to begin with, you might degrade overall classifier performance just to get a slightly cleaner boundary.

Order sensitivity. The dataset you clean in Step 2 depends entirely on the random interpolation done in Step 1. Different random seeds in SMOTE produce different Tomek links in the cleanup pass. On small datasets, this variance can be significant. Always evaluate with multiple seeds.

Doesn't address the root cause. If your class overlap is severe and fundamental to the data distribution (not noise), SMOTE + Tomek is moving deck chairs. You'll see marginal gains, but the core problem is an insufficient signal to separate the classes, and no resampling technique fixes that.

When to Use It

SMOTETomek is a strong general-purpose baseline when:

  • You're moving beyond vanilla SMOTE and want a sensible upgrade without the compute cost of SVMSMOTE
  • Your dataset has moderate boundary noise — some mislabeled samples or outliers near the decision boundary
  • You want a single imblearn call that handles both augmentation and cleaning
  • Class imbalance is moderate (1:5 to 1:20) — extreme imbalance needs more aggressive approaches

Use SMOTEENN instead of SMOTETomek when:

  • Your boundary overlap is dense, not sparse (ENN cleans more aggressively)
  • You can afford to lose more training samples from both classes for a cleaner boundary

The intuition behind SMOTE + Tomek is genuinely sound: make the minority class bigger, then make the boundary cleaner. Two operations with complementary effects. The limitation is that "cleaner" in Tomek's sense means "fewer close cross-class pairs" — which is necessary but not always sufficient for a dramatically better classifier.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment