Back to blog
← View series: machine learning

~/blog

SMOTE + ENN

Jun 14, 20265 min readBy mohammed.vasim
Machine LearningAIData Science

If SMOTE + Tomek Links is a scalpel, SMOTE + ENN is a cleaver. Both are two-step pipelines that oversample the minority class and then clean the boundary. The difference is in what "cleaning" means.

Tomek Links removes one majority point at a time — only when it forms a tight cross-class pair. Edited Nearest Neighbors removes any sample from either class that the k-NN classifier gets wrong under majority voting. That's a fundamentally different scope of operation, and on datasets with dense, messy boundary overlap it produces dramatically cleaner training sets.

What ENN Does

Edited Nearest Neighbors (ENN) runs a simple local validation on every sample in the dataset:

  1. For each sample, find its k nearest neighbors (across both classes)
  2. Predict that sample's class using majority vote of its neighbors
  3. If the prediction disagrees with the sample's actual label — remove it
Sample X (minority) → k=3 neighbors: [majority, majority, minority] Majority vote: majority Actual label: minority → X is removed (ambiguous sample)

This removes samples that don't fit their local neighborhood. In boundary regions where classes are mixed, a significant number of samples from both classes get removed. In clean interior regions, most samples survive because their neighborhood is dominated by the same class.

The key difference from Tomek links: ENN can remove samples from both classes, and it removes them based on local classification accuracy — not just cross-class proximity. A minority point surrounded by three majority points and one minority point gets removed. A majority point surrounded by one minority point and three majority points survives.

The SMOTEENN Pipeline

python
from imblearn.combine import SMOTEENN

sampler = SMOTEENN(
    smote=None,    # default SMOTE settings
    enn=None,      # default ENN settings (k=3)
    sampling_strategy='auto',
    random_state=42
)

X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

To customize both components:

python
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours

smote = SMOTE(k_neighbors=5, random_state=42)
enn = EditedNearestNeighbours(
    n_neighbors=3,
    kind_sel='all'   # remove sample if ALL neighbors disagree ('all') vs majority ('mode')
)

sampler = SMOTEENN(smote=smote, enn=enn, random_state=42)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

kind_sel='all' is more conservative — it only removes a sample if every single neighbor disagrees with its label. kind_sel='mode' removes based on majority vote. The default mode is more aggressive.

How Much Data Does ENN Actually Remove?

This varies significantly by dataset, but ENN typically removes more samples than Tomek links by a wide margin. On a heavily overlapping dataset, the final dataset after SMOTEENN can be notably smaller than after SMOTETomek — even if you started with the same SMOTE augmentation.

A quick diagnostic to understand what you're working with:

python
from collections import Counter
from imblearn.combine import SMOTEENN, SMOTETomek

for name, sampler in [("SMOTETomek", SMOTETomek()), ("SMOTEENN", SMOTEENN())]:
    X_res, y_res = sampler.fit_resample(X_train, y_train)
    print(f"{name}: {len(X_res)} samples → {Counter(y_res)}")

If SMOTEENN gives you significantly fewer total samples than SMOTETomek, it's removing a lot from the boundary — which means your original data had dense overlap. That's valuable information: the dataset is inherently hard and no resampling method is a silver bullet.

A Medical Imaging Feature Dataset Example

Extracted features from chest X-rays. Rare pathology (3%) vs. normal (97%). The pathology class includes mild cases that look almost identical to normal cases — genuine clinical ambiguity, not label noise.

With SMOTETomek: SMOTE generates synthetic pathology samples, Tomek links removes the 20 closest normal cases. The boundary is slightly cleaner.

With SMOTEENN: SMOTE generates synthetic pathology samples, ENN removes any sample from either class that its 3 nearest neighbors would misclassify. This catches not just the closest cross-class pairs but entire clusters of genuinely ambiguous samples. The resulting dataset is smaller but the classifier trained on it faces a cleaner decision problem. Recall on mild pathology cases typically improves more with SMOTEENN than SMOTETomek on datasets with this structure.

The Risk: Removing Too Much

ENN's aggressiveness is both its strength and its danger.

On small datasets, ENN can remove a meaningful percentage of your training data. If you started with 5,000 samples and ENN removes 800 of them, you've potentially degraded the model's ability to generalize even if the remaining samples are cleaner.

On legitimate boundary overlap, ENN removes real signal. In some domains, the ambiguous cases are real — they're not noise, they're the actual distribution of the phenomenon. Removing them doesn't clarify the problem; it hides it from the training process, and the model will still encounter those cases at inference time.

Class imbalance can worsen. Because ENN removes from both classes based on local misclassification, and minority points near the boundary are statistically more likely to be surrounded by majority points, ENN can remove proportionally more minority samples than majority ones. Always check the class distribution after SMOTEENN.

python
X_res, y_res = SMOTEENN(random_state=42).fit_resample(X_train, y_train)
original_ratio = Counter(y_train)
new_ratio = Counter(y_res)
print("Original minority %:", original_ratio[1] / sum(original_ratio.values()))
print("After SMOTEENN minority %:", new_ratio[1] / sum(new_ratio.values()))

If the minority percentage dropped, ENN overcleaned. Increase n_neighbors or switch to kind_sel='all' to reduce aggressiveness.

SMOTEENN vs SMOTETomek: When to Choose Which

Both are valid. The choice comes down to boundary structure.

Use SMOTETomek when:

  • Boundary noise is sparse — a few outliers and mislabeled points, not dense overlap
  • Dataset size is limited and you can't afford to lose many samples
  • You want a predictable, conservative cleaning step

Use SMOTEENN when:

  • Your 2D projection (PCA, UMAP) shows genuine dense overlap between classes in the boundary region
  • You've tried SMOTETomek and the improvement was marginal — the boundary still looks messy
  • Dataset is large enough that removing 10-30% of boundary samples doesn't hurt generalization
  • Precision matters more than recall — ENN's aggressive cleaning produces sharper boundaries, which reduces false positives

Neither will fix a fundamentally inseparable dataset. But on datasets where the overlap is structural (not inherent), SMOTEENN tends to produce the cleaner classifier of the two. The tradeoff is that you're trusting k-NN local voting as the arbiter of what gets to stay — and that's a heuristic, not ground truth.

The right mental model for SMOTEENN: it's not just resampling. It's resampling plus a lightweight cleaning pass that throws out every training point your k-NN already knew it couldn't handle. Whatever survives is the part of your data your model will actually learn something clean from.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment