Back to blog
← View series: machine learning

~/blog

Borderline-SMOTE

Jun 14, 20264 min readBy mohammed.vasim
Machine LearningAIData Science

Vanilla SMOTE has a blind spot. It treats every minority point the same — whether it sits comfortably deep inside the minority cluster or dangerously close to the majority class boundary. The result is synthetic samples generated in regions the model already handles well, while the actual problem area — where the two classes blur together — gets no special attention.

Borderline-SMOTE fixes this by asking a question SMOTE never bothers with: where exactly does this minority point live in the feature space?

The Zone Classification Idea

Before generating a single synthetic sample, Borderline-SMOTE classifies every minority point into one of three zones based on its k nearest neighbors across both classes.

Safe — Most or all of the k neighbors are also minority class. This point is well inside the minority cluster. The model doesn't struggle here, so generating more samples around it adds little value.

Borderline — Roughly half the k neighbors are majority class. This point sits right at the edge where the two classes overlap. This is where misclassification happens. This is where you want more data.

Noise — Nearly all k neighbors are majority class. This is a minority point that's essentially surrounded by the opposite class — almost certainly a mislabeled sample or an extreme outlier. Generating synthetic samples around it would actively mislead your model.

Only the borderline points get oversampled. Safe points are already well-learned. Noise points are a trap.

How It Actually Works

Given a minority point X classified as borderline:

  1. Find its m nearest neighbors — but this time, only among minority class points (not majority).
  2. For each selected neighbor, generate a synthetic sample by interpolating:
python
synthetic = X + λ × (neighbor − X)

where λ ∈ [0, 1] is chosen randomly. Same formula as SMOTE, different anchor points.

There are two variants in practice. Borderline-SMOTE1 picks neighbors only from the minority class to generate samples. Borderline-SMOTE2 occasionally picks a majority neighbor for interpolation — with λ ∈ [0, 0.5] to ensure the synthetic point stays in minority territory. The idea is to generate samples even closer to the decision boundary, which can push the classifier to be more aggressive about the minority region.

In Code

python
from imblearn.over_sampling import BorderlineSMOTE

sampler = BorderlineSMOTE(
    k_neighbors=5,
    m_neighbors=10,
    kind='borderline-1',
    random_state=42
)

X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

The m_neighbors parameter controls how many neighbors are checked to classify each minority point into safe/borderline/noise. The k_neighbors parameter controls how many minority neighbors are used during interpolation.

A useful diagnostic before and after:

python
from collections import Counter
print("Before:", Counter(y_train))
print("After:", Counter(y_resampled))

A Concrete Example

Fraud detection. You have 10,000 transactions: 9,800 legitimate, 200 fraudulent.

With vanilla SMOTE, synthetic fraud samples are generated around every fraud point — including the few that happen to look very similar to legitimate transactions (wrong merchant category, unusual amount for a fraud pattern). These are your noise points. SMOTE generates more misleading data right where your classifier is most confused.

With Borderline-SMOTE, those noise points are identified and skipped. Only the fraud transactions that genuinely sit at the boundary — the ones that are almost indistinguishable from legitimate — get augmented. The model gets more training signal exactly where it needs it.

Where It Breaks

The zone classification is sensitive to k. A small k makes points flip easily between zones based on local fluctuations. A large k smooths this out but risks misclassifying genuinely borderline points as safe.

It also assumes the boundary is the hard part, which is true most of the time — but not always. In some problems, the rare events are genuinely scattered across feature space with no clean boundary at all. In those cases, the zone classification gives you noisy labels and Borderline-SMOTE performs no better than vanilla SMOTE.

And like all SMOTE variants, it's still doing linear interpolation. If the true minority class distribution is non-linear or has strong feature correlations, the synthetic points will still be unrealistic.

When to Use It

Borderline-SMOTE is the right upgrade from vanilla SMOTE when your exploratory analysis shows:

  • Clear class overlap in 2D projections (PCA, UMAP) — both classes visually bleed into each other
  • Your classifier's precision is decent but recall on the minority class is poor — it's missing borderline cases
  • You're seeing good performance on interior minority samples but failure on edge cases

If SMOTE is your current baseline and you're debugging why it's still underperforming, Borderline-SMOTE is usually the first thing to try.

The zone classification step adds compute — it's running a k-NN pass over the entire dataset before any generation happens. On large datasets, this adds up. But for datasets where vanilla SMOTE is already feasible, the overhead is rarely a blocker.

The model doesn't decide where the decision boundary is. The data does. Borderline-SMOTE just stops pretending otherwise.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment