← View series: machine learning
~/blog
ADASYN
SMOTE generates the same number of synthetic samples for every minority point it visits. That's a reasonable default but a poor strategy. A fraud transaction surrounded by other fraud transactions doesn't need more neighbors — your model already understands that region. A fraud transaction surrounded by legitimate ones is where your model fails, and that's where you want the training signal concentrated.
ADASYN (Adaptive Synthetic Sampling) makes this distinction mathematically. Instead of asking "how many samples do we need?" it asks "which minority points are hardest to classify, and how much should each of them contribute to the synthetic set?"
The Core Idea: Difficulty as a Sampling Weight
For each minority point, ADASYN computes a difficulty ratio — a number between 0 and 1 that measures how surrounded by majority-class points it is.
rᵢ = (number of majority neighbors in k-NN of point i) / k
A minority point where all k neighbors are also minority gets rᵢ = 0 — it's in a safe region. A minority point where all k neighbors are majority gets rᵢ = 1 — it's completely isolated in enemy territory.
These ratios are then normalized so they sum to 1, turning them into a proper probability distribution:
r̂ᵢ = rᵢ / Σrᵢ
Now, given a total budget of G synthetic samples to generate (calculated from the class imbalance), each minority point i contributes:
Gᵢ = G × r̂ᵢ
Harder points get more synthetic samples. Easier points get fewer — sometimes zero.
Step by Step
from imblearn.over_sampling import ADASYN
sampler = ADASYN(
n_neighbors=5,
sampling_strategy='auto',
random_state=42
)
X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)sampling_strategy='auto' tells ADASYN to balance the minority class up to the majority count. You can pass a float (target ratio) or a dict (per-class target counts) for finer control.
n_neighbors controls the k-NN used for both computing difficulty ratios and generating synthetic samples.
What Changes Compared to SMOTE
With SMOTE on a fraud dataset, if you have 200 fraud transactions and want 1800 synthetic ones, each fraud point generates roughly 9 synthetics. No point gets more because it's harder.
With ADASYN on the same dataset:
- 50 fraud transactions in safe regions → rᵢ ≈ 0 → almost no synthetic samples
- 80 at moderate risk → rᵢ ≈ 0.4 → moderate contribution
- 70 surrounded by legitimate transactions → rᵢ ≈ 0.9 → most of the 1800 synthetics come from here
The classifier gets trained heavily on the genuinely ambiguous cases. This tends to produce better recall on edge cases at the cost of slightly more false positives — a tradeoff that's often exactly right in high-recall domains.
A Practical Comparison
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
for name, sampler in [("SMOTE", SMOTE()), ("ADASYN", ADASYN())]:
X_res, y_res = sampler.fit_resample(X_train, y_train)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_res, y_res)
print(f"\n{name}:")
print(classification_report(y_test, clf.predict(X_test)))On highly imbalanced datasets with overlapping classes, ADASYN typically shows better recall on the minority class and slightly lower precision. Whether that tradeoff works depends entirely on your domain.
Where It Breaks
Noisy minority points become expensive. If a minority sample is mislabeled (a legitimate transaction marked as fraud), ADASYN will give it the highest weight — it's surrounded by majority points by definition. You end up generating a lot of synthetic samples around the most unreliable anchor in your dataset. Noise in the minority class punishes ADASYN more than it punishes SMOTE.
Over-aggressiveness near noisy boundaries. In domains where the boundary between classes is genuinely fuzzy (not because of data quality, but because the phenomenon itself is ambiguous), ADASYN can push the classifier too hard into ambiguous territory and hurt overall calibration.
Not deterministic across runs. The difficulty ratio computation is stable, but the synthetic sample generation involves random interpolation. Results can vary meaningfully across random seeds, especially on small datasets. Fix random_state in experiments.
Can't handle categorical features. Like all SMOTE variants, ADASYN interpolates linearly in feature space. Mixed-type datasets with categorical columns need preprocessing (target encoding, etc.) or a different approach entirely.
When to Use It
ADASYN makes sense when:
- Your evaluation shows the minority class precision is acceptable but recall is poor on edge cases specifically
- You have a relatively clean dataset — not many mislabeled minority samples
- You're in a high-recall domain (medical diagnosis, fraud detection, safety systems) where missing a positive is more expensive than a false alarm
- SMOTE gave you marginal improvement and you want a more aggressive boundary-focused strategy
Skip ADASYN when:
- Your minority class has label noise — you'll amplify the wrong points
- Precision is as important as recall — the over-aggressiveness near boundaries can hurt it
- Your dataset is small — the difficulty ratios become unreliable with few samples
ADASYN is SMOTE with a feedback signal. It's not smarter about how it generates samples — the interpolation is identical — but it's smarter about where it concentrates effort. That distinction alone is often enough to push a borderline model into production territory.