Back to blog
← View series: machine learning

Handling Missing Values Handling Imbalanced Datasets SMOTE Handling Outliers Categorical Encoding Target Guided Ordinal Encoding Exploratory Data Analysis of Red Wine Quality Feature Engineering for Flight Price Prediction Cleaning and Transforming Google Play Store Data for Analysis

~/blog

Handling Outliers

Jun 1, 2026•4 min read•By Mohammed Vasim

Machine LearningAIData Science

Handling Outliers: Detection and Treatment with IQR

A single extreme value can quietly distort your entire analysis. The mean of a column shifts, variance inflates, and models that rely on distance-based metrics (linear regression, SVMs, k-means) get pulled toward the outlier.

Not all outliers are errors. A $50,000 purchase in a B2C retail dataset might be a data entry mistake. In a B2B dataset, it's a legitimate transaction. The challenge is detecting extreme values systematically and deciding what they represent.

The Five-Number Summary

Before handling outliers, you need a way to detect them. The five-number summary provides the foundation: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

python

import numpy as np
import seaborn as sns

marks = [45, 32, 56, 75, 89, 54, 32, 89, 90, 87,
         67, 54, 45, 98, 99, 67, 74]

minimum, Q1, median, Q3, maximum = np.quantile(
    marks, [0, 0.25, 0.50, 0.75, 1.0]
)

This gives you the spread of the central 50% of data, but it doesn't tell you what counts as an outlier.

The IQR Method

The Interquartile Range (Q3 − Q1) captures the middle 50% of your data. The standard rule defines outliers as points below Q1 − 1.5×IQR or above Q3 + 1.5×IQR.

python

IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
higher_fence = Q3 + 1.5 * IQR

The 1.5 multiplier is a convention that works well for roughly normal distributions. For skewed distributions, you might adjust it to 3.0 to avoid flagging natural skewness as outliers.

python

sns.boxplot(marks)

A clean set with no outliers. Now add some extreme values:

python

marks_with_outliers = [-100, -200, 45, 32, 56, 75, 89, 54, 32, 89,
                       90, 87, 67, 54, 45, 98, 99, 67, 74, 150, 170, 180]

sns.boxplot(marks_with_outliers)

The box plot now shows points far outside the whiskers — -200 and -100 on the low end, 150-180 on the high end.

Detection Methods Compared

The IQR method isn't the only option. Each has tradeoffs:

Method	Best For	Limitation
IQR (Tukey's fences)	Quick univariate detection	Assumes symmetric distribution
Z-score	Normally distributed data	Masked by multiple outliers (swamping)
Modified Z-score	Data with some skewness	Requires robust scale estimate (MAD)
DBSCAN clustering	Multivariate outliers	Sensitive to epsilon parameter
Isolation Forest	High-dimensional data	Less interpretable

The Z-score flags points more than 3 standard deviations from the mean:

python

from scipy import stats
z_scores = np.abs(stats.zscore(marks_with_outliers))
outliers = np.where(z_scores > 3)

This works well when your data is normal. For non-normal distributions, the modified Z-score using median absolute deviation is more robust.

What to Do With Detected Outliers

Once identified, you have three main options:

Trimming (Removal) — drop the outlier rows entirely. Best when outliers are clearly data entry errors or you have enough data to spare.

Capping (Winsorization) — replace outliers with the nearest fence value. This preserves the data point while limiting its influence.

python

def cap_outliers(series, multiplier=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    return series.clip(lower, upper)

Imputation — treat outliers like missing values and impute them using the median. This is the most conservative option.

How Different Models React to Outliers

The impact of outliers depends heavily on your model:

Linear regression, logistic regression — highly sensitive. Outliers in the input space pull the coefficient estimates.
Tree-based models (random forest, XGBoost) — naturally robust. Trees split on percentiles, so a single extreme value is isolated in its own leaf.
k-NN, SVM (with RBF kernel) — very sensitive. Distance-based algorithms get distorted by extreme feature values.
Neural networks — moderately sensitive. Gradient-based optimization can be affected, but normalization helps.

A useful heuristic: if you're using a tree-based model and outliers are legitimate extreme values, you can often leave them alone. For linear models, you almost always need to handle them.

Context Is Everything

The same outlier can be signal or noise depending on context. A sudden spike in website traffic during a marketing campaign is expected — capping it would remove the signal. The same spike on a random Tuesday might be a bot attack — trimming makes sense.

The safest approach is to flag outliers rather than automatically removing them. Understand what generated each extreme value before deciding. If you must automate, capping is more conservative than trimming — it preserves sample size while limiting influence.

Handling Outliers

Handling Outliers: Detection and Treatment with IQR

The Five-Number Summary

The IQR Method

Detection Methods Compared

What to Do With Detected Outliers

How Different Models React to Outliers

Context Is Everything

Comments (0)

Leave a comment