← View series: machine learning
~/blog
Handling Outliers
Handling Outliers: Detection and Treatment with IQR
A single extreme value can quietly distort your entire analysis. The mean of a column shifts, variance inflates, and models that rely on distance-based metrics (linear regression, SVMs, k-means) get pulled toward the outlier.
Not all outliers are errors. A $50,000 purchase in a B2C retail dataset might be a data entry mistake. In a B2B dataset, it's a legitimate transaction. The challenge is detecting extreme values systematically and deciding what they represent.
The Five-Number Summary
Before handling outliers, you need a way to detect them. The five-number summary provides the foundation: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
import numpy as np
import seaborn as sns
marks = [45, 32, 56, 75, 89, 54, 32, 89, 90, 87,
67, 54, 45, 98, 99, 67, 74]
minimum, Q1, median, Q3, maximum = np.quantile(
marks, [0, 0.25, 0.50, 0.75, 1.0]
)This gives you the spread of the central 50% of data, but it doesn't tell you what counts as an outlier.
The IQR Method
The Interquartile Range (Q3 − Q1) captures the middle 50% of your data. The standard rule defines outliers as points below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.
IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
higher_fence = Q3 + 1.5 * IQRThe 1.5 multiplier is a convention that works well for roughly normal distributions. For skewed distributions, you might adjust it to 3.0 to avoid flagging natural skewness as outliers.
sns.boxplot(marks)A clean set with no outliers. Now add some extreme values:
marks_with_outliers = [-100, -200, 45, 32, 56, 75, 89, 54, 32, 89,
90, 87, 67, 54, 45, 98, 99, 67, 74, 150, 170, 180]
sns.boxplot(marks_with_outliers)The box plot now shows points far outside the whiskers — -200 and -100 on the low end, 150-180 on the high end.
Detection Methods Compared
The IQR method isn't the only option. Each has tradeoffs:
| Method | Best For | Limitation |
|---|---|---|
| IQR (Tukey's fences) | Quick univariate detection | Assumes symmetric distribution |
| Z-score | Normally distributed data | Masked by multiple outliers (swamping) |
| Modified Z-score | Data with some skewness | Requires robust scale estimate (MAD) |
| DBSCAN clustering | Multivariate outliers | Sensitive to epsilon parameter |
| Isolation Forest | High-dimensional data | Less interpretable |
The Z-score flags points more than 3 standard deviations from the mean:
from scipy import stats
z_scores = np.abs(stats.zscore(marks_with_outliers))
outliers = np.where(z_scores > 3)This works well when your data is normal. For non-normal distributions, the modified Z-score using median absolute deviation is more robust.
What to Do With Detected Outliers
Once identified, you have three main options:
Trimming (Removal) — drop the outlier rows entirely. Best when outliers are clearly data entry errors or you have enough data to spare.
Capping (Winsorization) — replace outliers with the nearest fence value. This preserves the data point while limiting its influence.
def cap_outliers(series, multiplier=1.5):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
return series.clip(lower, upper)Imputation — treat outliers like missing values and impute them using the median. This is the most conservative option.
How Different Models React to Outliers
The impact of outliers depends heavily on your model:
- Linear regression, logistic regression — highly sensitive. Outliers in the input space pull the coefficient estimates.
- Tree-based models (random forest, XGBoost) — naturally robust. Trees split on percentiles, so a single extreme value is isolated in its own leaf.
- k-NN, SVM (with RBF kernel) — very sensitive. Distance-based algorithms get distorted by extreme feature values.
- Neural networks — moderately sensitive. Gradient-based optimization can be affected, but normalization helps.
A useful heuristic: if you're using a tree-based model and outliers are legitimate extreme values, you can often leave them alone. For linear models, you almost always need to handle them.
Context Is Everything
The same outlier can be signal or noise depending on context. A sudden spike in website traffic during a marketing campaign is expected — capping it would remove the signal. The same spike on a random Tuesday might be a bot attack — trimming makes sense.
The safest approach is to flag outliers rather than automatically removing them. Understand what generated each extreme value before deciding. If you must automate, capping is more conservative than trimming — it preserves sample size while limiting influence.