Back to blog
← View series: machine learning

~/blog

SMOTE

Jun 1, 20263 min readBy Mohammed Vasim
Machine LearningAIData Science

SMOTE: Synthetic Minority Oversampling for Imbalanced Data

Basic upsampling has a known flaw: it duplicates minority samples. If you have 100 fraud cases and upsample to 900, the model sees the same 100 examples repeatedly. It memorizes them rather than learning what fraud actually looks like.

SMOTE (Synthetic Minority Oversampling Technique) solves this differently. Instead of duplicating, it creates new synthetic samples by interpolating between existing minority class instances.

How SMOTE Works Under the Hood

The algorithm takes a minority sample, finds its k-nearest neighbors (also from the minority class), and creates a new sample at a random point along the line connecting them. The result is a plausible minority-class example that isn't an exact copy of any real data point.

This matters because it preserves the underlying distribution of the minority class while increasing its representation. The synthetic points fill the gaps in the minority region rather than piling on top of existing points.

Applying SMOTE

Let's create an imbalanced dataset and apply SMOTE to see the effect.

python
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
import pandas as pd
import matplotlib.pyplot as plt

X, y = make_classification(
    n_samples=1000, n_redundant=0, n_features=2,
    n_clusters_per_class=1, weights=[0.90], random_state=12
)

df1 = pd.DataFrame(X, columns=['f1', 'f2'])
df2 = pd.DataFrame(y, columns=['target'])
final_df = pd.concat([df1, df2], axis=1)

final_df['target'].value_counts()

900 samples in class 0, 100 in class 1. Let's see the imbalance visually:

python
plt.scatter(final_df['f1'], final_df['f2'], c=final_df['target'])

You'll see two clusters — the minority class is sparse. Now apply SMOTE:

python
oversample = SMOTE()
X_resampled, y_resampled = oversample.fit_resample(
    final_df[['f1', 'f2']], final_df['target']
)

len(y_resampled[y_resampled == 0])  # 900
len(y_resampled[y_resampled == 1])  # 900

Both classes now have 900 samples. If you plot the resampled data, you'll notice the minority cluster is denser but follows the same general shape — the interpolation fills gaps without distorting the original pattern.

When SMOTE Falls Short

SMOTE isn't universally better than simple upsampling. A few cases where it causes problems:

High-dimensional data. Interpolation in hundreds of dimensions creates synthetic points that may not correspond to realistic examples. The notion of "between two points" becomes less meaningful as dimensions increase.

Overlapping classes. If your minority and majority clusters already overlap, SMOTE can generate synthetic points deep inside majority territory. This adds confusion rather than signal. A variant called Borderline-SMOTE addresses this by only generating samples near the decision boundary.

Very small minority clusters. With too few minority samples (say, fewer than 10), the interpolated points don't add meaningful diversity. The k-nearest neighbors of any minority point are the same few examples, so the synthetic samples are near-duplicates — no better than basic upsampling.

Combining SMOTE with Cleaning

A common and effective pattern is to apply SMOTE followed by an undersampling step that removes ambiguous synthetic points. Tomek Links and Edited Nearest Neighbors are two cleaning methods for this:

python
from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)

SMOTETomek applies SMOTE first, then removes Tomek links — pairs of opposite-class points that are each other's nearest neighbors. This clears up ambiguous synthetic points that SMOTE placed too close to the majority boundary.

Comparing Approaches

MethodStrengthsWeaknesses
UpsamplingSimple, preserves all dataOverfits on duplicates
DownsamplingFast, no synthetic pointsLoses data
SMOTECreates realistic synthetic samplesStruggles with high dimensions
SMOTE + TomekCleaner decision boundaryMore complex, more parameters

In practice, SMOTE with a cleaning step gives the best results for most structured data problems. For very high-dimensional data (text, images), basic upsampling or class weights in the loss function are more reliable.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment