Back to blog
← View series: machine learning

~/blog

Handling Outliers

Jun 1, 20264 min readBy Mohammed Vasim
Machine LearningAIData Science

Handling Outliers: Detection and Treatment with IQR

A single extreme value can quietly distort your entire analysis. The mean of a column shifts, variance inflates, and models that rely on distance-based metrics (linear regression, SVMs, k-means) get pulled toward the outlier.

Not all outliers are errors. A $50,000 purchase in a B2C retail dataset might be a data entry mistake. In a B2B dataset, it's a legitimate transaction. The challenge is detecting extreme values systematically and deciding what they represent.

The Five-Number Summary

Before handling outliers, you need a way to detect them. The five-number summary provides the foundation: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

python
import numpy as np
import seaborn as sns

marks = [45, 32, 56, 75, 89, 54, 32, 89, 90, 87,
         67, 54, 45, 98, 99, 67, 74]

minimum, Q1, median, Q3, maximum = np.quantile(
    marks, [0, 0.25, 0.50, 0.75, 1.0]
)

This gives you the spread of the central 50% of data, but it doesn't tell you what counts as an outlier.

The IQR Method

The Interquartile Range (Q3 − Q1) captures the middle 50% of your data. The standard rule defines outliers as points below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.

python
IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
higher_fence = Q3 + 1.5 * IQR

The 1.5 multiplier is a convention that works well for roughly normal distributions. For skewed distributions, you might adjust it to 3.0 to avoid flagging natural skewness as outliers.

python
sns.boxplot(marks)

A clean set with no outliers. Now add some extreme values:

python
marks_with_outliers = [-100, -200, 45, 32, 56, 75, 89, 54, 32, 89,
                       90, 87, 67, 54, 45, 98, 99, 67, 74, 150, 170, 180]

sns.boxplot(marks_with_outliers)

The box plot now shows points far outside the whiskers — -200 and -100 on the low end, 150-180 on the high end.

Detection Methods Compared

The IQR method isn't the only option. Each has tradeoffs:

MethodBest ForLimitation
IQR (Tukey's fences)Quick univariate detectionAssumes symmetric distribution
Z-scoreNormally distributed dataMasked by multiple outliers (swamping)
Modified Z-scoreData with some skewnessRequires robust scale estimate (MAD)
DBSCAN clusteringMultivariate outliersSensitive to epsilon parameter
Isolation ForestHigh-dimensional dataLess interpretable

The Z-score flags points more than 3 standard deviations from the mean:

python
from scipy import stats
z_scores = np.abs(stats.zscore(marks_with_outliers))
outliers = np.where(z_scores > 3)

This works well when your data is normal. For non-normal distributions, the modified Z-score using median absolute deviation is more robust.

What to Do With Detected Outliers

Once identified, you have three main options:

Trimming (Removal) — drop the outlier rows entirely. Best when outliers are clearly data entry errors or you have enough data to spare.

Capping (Winsorization) — replace outliers with the nearest fence value. This preserves the data point while limiting its influence.

python
def cap_outliers(series, multiplier=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    return series.clip(lower, upper)

Imputation — treat outliers like missing values and impute them using the median. This is the most conservative option.

How Different Models React to Outliers

The impact of outliers depends heavily on your model:

  • Linear regression, logistic regression — highly sensitive. Outliers in the input space pull the coefficient estimates.
  • Tree-based models (random forest, XGBoost) — naturally robust. Trees split on percentiles, so a single extreme value is isolated in its own leaf.
  • k-NN, SVM (with RBF kernel) — very sensitive. Distance-based algorithms get distorted by extreme feature values.
  • Neural networks — moderately sensitive. Gradient-based optimization can be affected, but normalization helps.

A useful heuristic: if you're using a tree-based model and outliers are legitimate extreme values, you can often leave them alone. For linear models, you almost always need to handle them.

Context Is Everything

The same outlier can be signal or noise depending on context. A sudden spike in website traffic during a marketing campaign is expected — capping it would remove the signal. The same spike on a random Tuesday might be a bot attack — trimming makes sense.

The safest approach is to flag outliers rather than automatically removing them. Understand what generated each extreme value before deciding. If you must automate, capping is more conservative than trimming — it preserves sample size while limiting influence.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment