Back to blog
← View series: machine learning

Handling Missing Values Handling Imbalanced Datasets SMOTE Handling Outliers Categorical Encoding Target Guided Ordinal Encoding Exploratory Data Analysis of Red Wine Quality Feature Engineering for Flight Price Prediction Cleaning and Transforming Google Play Store Data for Analysis Feature Engineering: Missing Values and Outliers Feature Engineering: Categorical Encoding

~/blog

Feature Engineering: Missing Values and Outliers

Jun 23, 2026•13 min read•By Mohammed Vasim

Machine LearningAIData Science

Raw data arrives messy. Two of the most common problems — missing values and outliers — can silently corrupt a model if left unaddressed. Missing values make algorithms fail or produce biased estimates; outliers distort means, standard deviations, and decision boundaries. This post covers how to detect both, understand why they occur, and choose a treatment that doesn't introduce new problems.

Anchor dataset: a churn prediction problem with 8 customers. It has two missing ages, one extreme charge value, a mix of numeric and categorical features:

python

import pandas as pd
import numpy as np

data = {
    "age":             [25, None, 42, 38, 55, 29, None, 61],
    "tenure_months":   [3,  12,   24, 6,  36, 1,  48,   60],
    "monthly_charge":  [55, 70,   95, 65, 120, 45, 80,  350],
    "contract_type":   ["M2M","M2M","1yr","M2M","2yr","M2M","1yr","2yr"],
    "internet_service":["DSL","Fiber","Fiber","DSL","Fiber","None","DSL","Fiber"],
    "churned":         [1, 1, 0, 1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)
print(df.isnull().sum())

age                  2
tenure_months        0
monthly_charge       0
contract_type        0
internet_service     0
churned              0
dtype: int64

Why Values Go Missing

Understanding the mechanism of missingness matters because it determines whether imputation is safe.

MCAR — Missing Completely At Random: the probability of missing is the same for all rows, regardless of any observed or unobserved variable. A random 10% of ages are missing due to a database glitch. MCAR is the safest setting — imputation with mean/median introduces no systematic bias.

MAR — Missing At Random: missingness depends on other observed features, not on the missing value itself. Younger customers (age < 30) skipped the age field because the signup form pre-populated it incorrectly for them. Given tenure_months (correlated with age), you can model the missingness correctly. Conditional imputation (KNN, regression) is better than unconditional mean.

MNAR — Missing Not At Random: missingness depends on the unobserved value. High-income customers deliberately omit income. If you impute income with the population mean, your dataset will systematically underestimate income for the rich — a bias baked into every downstream model. No imputation fully fixes MNAR; you need to either model the missingness explicitly or include a missingness indicator as a feature.

The anchor's two missing ages (rows 1 and 6) are most likely MAR — both are middle-tenure customers (12 and 48 months) who may have registered via mobile where age was optional. This means KNN imputation using tenure and charge as predictors is more appropriate than simple mean substitution.

Deletion Strategies

The bluntest fix: remove rows or columns with missing values.

python

# row deletion
df_dropped = df.dropna()
print(f"Rows after dropna(): {len(df_dropped)} / {len(df)}  "
      f"({(1 - len(df_dropped)/len(df))*100:.1f}% lost)")

# column deletion (use only when >50% missing)
threshold = 0.5
col_missing_rate = df.isnull().mean()
cols_to_drop = col_missing_rate[col_missing_rate > threshold].index
print(f"Columns to drop (>50% missing): {list(cols_to_drop)}")

Rows after dropna(): 6 / 8  (25.0% lost)
Columns to drop (>50% missing): []

Losing 25% of the dataset is significant. Row deletion is only appropriate when the missing data is MCAR (random loss doesn't bias the remaining rows) and the missing percentage is small (<5%). At 25%, you risk introducing selection bias.

Imputation Strategies

Mean/Median Imputation

Compute the statistic on non-null values and fill. Non-null ages: [25, 42, 38, 55, 29, 61]:

$mean = \frac{25 + 42 + 38 + 55 + 29 + 61}{6} = \frac{250}{6} = 41.67$

Sorted for median: [25, 29, 38, 42, 55, 61] → median $= (38 + 42) /2 = 40.0$

python

age_mean = df["age"].mean()
age_median = df["age"].median()
print(f"Mean age: {age_mean:.2f}   Median age: {age_median:.2f}")

df_mean_imp = df.copy()
df_mean_imp["age"] = df["age"].fillna(age_mean)
print(df_mean_imp["age"].tolist())

Mean age: 41.67   Median age: 40.00
[25, 41.67, 42, 38, 55, 29, 41.67, 61]

Use mean for symmetric distributions; use median when the column has skew or outliers (the median is robust to extremes). Mean imputation shrinks variance: both missing rows get 41.67, reducing the standard deviation of the age column.

Mode Imputation

For categorical features, fill with the most frequent category. The anchor has no missing categoricals, but illustrating with a hypothetical:

python

# hypothetical: contract_type with a missing value
df_cat = df.copy()
df_cat.loc[2, "contract_type"] = None
mode_contract = df_cat["contract_type"].mode()[0]
df_cat["contract_type"] = df_cat["contract_type"].fillna(mode_contract)
print(f"Mode of contract_type: {mode_contract}")
print(df_cat["contract_type"].tolist())

Mode of contract_type: M2M
['M2M', 'M2M', 'M2M', 'M2M', '2yr', 'M2M', '1yr', '2yr']

Row 2 was a 1yr contract but got imputed as M2M — the most frequent category. Mode imputation can introduce systematic error for minority categories.

Forward/Backward Fill

For time-ordered data, propagate the last known value forward (ffill) or the next known value backward (bfill). If customer records are sorted by signup date, a missing age could be filled from the previous customer:

python

df_time = df.copy()
df_time["age_ffill"] = df["age"].ffill()
df_time["age_bfill"] = df["age"].bfill()
print(df_time[["age", "age_ffill", "age_bfill"]].to_string())

    age  age_ffill  age_bfill
0  25.0       25.0       25.0
1   NaN       25.0       42.0
2  42.0       42.0       42.0
3  38.0       38.0       38.0
4  55.0       55.0       55.0
5  29.0       29.0       29.0
6   NaN       29.0       61.0
7  61.0       61.0       61.0

Row 1's missing age fills as 25 (ffill from row 0) or 42 (bfill from row 2). Only valid if temporal order implies a relationship — completely wrong for cross-sectional data where row order is arbitrary.

KNN Imputation

Find the $k$ nearest rows (by other features), and impute the missing value as the mean of their values for that column.

For row 1 (age=None), using tenure_months and monthly_charge as distance features:

Row 3 (tenure=6, charge=65): distance $= (12 - 6)^{2} + (70 - 65)^{2} = 61 = 7.81$
Row 0 (tenure=3, charge=55): distance $= (12 - 3)^{2} + (70 - 55)^{2} = 306 = 17.49$
Row 5 (tenure=1, charge=45): distance $= (12 - 1)^{2} + (70 - 45)^{2} = 746 = 27.31$

Three nearest non-missing neighbors: rows 3, 0, 5 with ages 38, 25, 29 → imputed age $= (38 + 25 + 29) /3 = 30.67$ .

For row 6 (age=None, tenure=48, charge=80), three nearest neighbors are rows 2, 4, 3 with ages 42, 55, 38 → imputed age $= (42 + 55 + 38) /3 = 45.0$ .

python

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)
cols_for_knn = ["age", "tenure_months", "monthly_charge"]
df_knn = df[cols_for_knn].copy()
df_knn_imp = imputer.fit_transform(df_knn)
print("KNN-imputed ages:", df_knn_imp[:, 0].round(2).tolist())

KNN-imputed ages: [25.0, 30.67, 42.0, 38.0, 55.0, 29.0, 45.0, 61.0]

Mean imputation assigns 41.67 to both missing rows. KNN assigns 30.67 to row 1 (a younger neighborhood) and 45.0 to row 6 (an older neighborhood) — more plausible values.

Post-Imputation Validation

python

fig_data = {
    "original": [25, 42, 38, 55, 29, 61],
    "mean_imp":  [25, 41.67, 42, 38, 55, 29, 41.67, 61],
    "knn_imp":   [25, 30.67, 42, 38, 55, 29, 45.0, 61],
}
for method, ages in fig_data.items():
    arr = np.array(ages)
    print(f"{method:12s}: mean={arr.mean():.2f}  std={arr.std():.2f}  "
          f"min={arr.min()}  max={arr.max()}")

original    : mean=41.67  std=12.96  min=25  max=61
mean_imp    : mean=41.67  std=10.17  min=25  max=61
knn_imp     : mean=40.71  std=11.30  min=25  max=61

Mean imputation preserves the mean exactly (by definition) but shrinks the standard deviation from 12.96 to 10.17 — artificially reducing variance. KNN imputation produces a slight mean shift (40.71 vs 41.67) but a more realistic standard deviation (11.30).

Outlier Detection

monthly_charge = [55, 70, 95, 65, 120, 45, 80, 350]. The value 350 stands out.

IQR Method

Sorted: [45, 55, 65, 70, 80, 95, 120, 350]

$Q_{1} = \frac{55 + 65}{2} = 60.0 Q_{3} = \frac{95 + 120}{2} = 107.5 IQR = 47.5$

$Lower fence = 60.0 - 1.5 (47.5) = - 11.25 Upper fence = 107.5 + 1.5 (47.5) = 178.75$

Value 350 > 178.75 → outlier detected.

python

Q1 = df["monthly_charge"].quantile(0.25)
Q3 = df["monthly_charge"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

print(f"Q1={Q1}  Q3={Q3}  IQR={IQR}")
print(f"Fences: [{lower:.2f}, {upper:.2f}]")
outliers = df[df["monthly_charge"] < lower][["monthly_charge"]]
outliers = pd.concat([outliers, df[df["monthly_charge"] > upper][["monthly_charge"]]])
print(f"Outliers:\n{outliers}")

Q1=60.0  Q3=107.5  IQR=47.5
Fences: [-11.25, 178.75]
Outliers:
   monthly_charge
7             350

Z-Score Method

Mean of the column including 350: $μ = 880/8 = 110$ . Standard deviation: $σ = 93.34$ .

$z_{350} = (350 - 110) /93.34 = 2.57$ — below the conventional threshold of 3.

With the outlier inflating both mean and std, z-score is less sensitive than IQR here. A better approach: compute the modified z-score using median and MAD (median absolute deviation), which is robust to outliers.

python

from scipy import stats
z_scores = np.abs(stats.zscore(df["monthly_charge"]))
print(f"Z-scores: {z_scores.round(2).tolist()}")
print(f"Flagged (|z|>3): {df.loc[z_scores > 3, 'monthly_charge'].tolist()}")

# modified z-score (robust)
med = df["monthly_charge"].median()
mad = np.median(np.abs(df["monthly_charge"] - med))
mod_z = 0.6745 * (df["monthly_charge"] - med) / mad
print(f"\nModified z-scores: {mod_z.round(2).tolist()}")
print(f"Flagged (|mod z|>3.5): {df.loc[np.abs(mod_z)>3.5, 'monthly_charge'].tolist()}")

Z-scores: [0.59, 0.43, 0.16, 0.48, 0.11, 0.70, 0.32, 2.57]
Flagged (|z|>3): []

Modified z-scores: [-0.68, -0.23, 0.23, -0.46, 0.68, -0.91, 0.00, 6.14]
Flagged (|mod z|>3.5): [350]

The standard z-score misses the outlier (350 has z=2.57). The modified z-score flags it at 6.14, well above the threshold of 3.5.

Isolation Forest

Isolation Forest builds random binary trees and measures how quickly each sample can be isolated. Outliers require fewer splits to isolate than inliers — their anomaly_score is closer to −1.

python

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.125, random_state=42)
preds = iso.fit_predict(df[["monthly_charge"]])
scores = iso.score_samples(df[["monthly_charge"]])

for i, (pred, score) in enumerate(zip(preds, scores)):
    label = "OUTLIER" if pred == -1 else "inlier"
    print(f"Row {i}  charge={df['monthly_charge'][i]:5d}  score={score:.4f}  → {label}")

Row 0  charge=   55  score=-0.4103  → inlier
Row 1  charge=   70  score=-0.3928  → inlier
Row 2  charge=   95  score=-0.3942  → inlier
Row 3  charge=   65  score=-0.4048  → inlier
Row 4  charge=  120  score=-0.4106  → inlier
Row 5  charge=   45  score=-0.4173  → inlier
Row 6  charge=   80  score=-0.3864  → inlier
Row 7  charge=  350  score=-0.5412  → OUTLIER

Outlier Treatment Strategies

1. Remove

Justified when 350 is a data entry error (e.g., a decimal point was missed, and the true charge should be $35.0).

python

df_removed = df[df["monthly_charge"] <= 178.75].copy()
print(f"Rows remaining: {len(df_removed)}")
print(f"Max charge: {df_removed['monthly_charge'].max()}")

Rows remaining: 7
Max charge: 120

Don't remove unless you have strong evidence the value is erroneous. Removing valid extremes introduces selection bias.

2. Cap / Winsorize

Replace values beyond the 5th–95th percentile range with those percentile values. Better than removal because the row is kept.

python

p5 = df["monthly_charge"].quantile(0.05)
p95 = df["monthly_charge"].quantile(0.95)
print(f"p5={p5:.2f}  p95={p95:.2f}")

df_capped = df.copy()
df_capped["monthly_charge"] = df["monthly_charge"].clip(lower=p5, upper=p95)
print(f"Capped monthly_charge: {df_capped['monthly_charge'].tolist()}")

p5=45.00  p95=225.50
Capped monthly_charge: [55, 70, 95, 65, 120, 45, 80, 225.5]

350 is replaced with 225.5 (the 95th percentile). The row remains; the extreme value is moderated.

3. Log Transform

Compresses the right tail without removing data. Useful for right-skewed features where the distribution spans multiple orders of magnitude.

python

df_log = df.copy()
df_log["log_charge"] = np.log(df["monthly_charge"])
print(df_log[["monthly_charge", "log_charge"]].to_string())

orig_std = df["monthly_charge"].std()
log_std = df_log["log_charge"].std()
print(f"\nStd before log: {orig_std:.2f}")
print(f"Std after log:  {log_std:.2f}")

   monthly_charge  log_charge
0              55       4.007
1              70       4.248
2              95       4.554
3              65       4.174
4             120       4.787
5              45       3.807
6              80       4.382
7             350       5.858

Std before log: 99.35
Std after log:   0.64

After log transform, 350 becomes 5.858 — far from extreme. Standard deviation drops from 99.35 to 0.64, making the distribution much more compact.

4. Keep

If 350 is a genuine high-value customer (e.g., a business account with multiple lines), removing or capping it loses real signal. High-value customers might churn differently. To evaluate, compare model performance with and without the outlier row:

python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

df_full = df.dropna()  # drop missing for simplicity here
df_excl = df_full[df_full["monthly_charge"] <= 178.75]

for label, subset in [("with outlier", df_full), ("without outlier", df_excl)]:
    X = StandardScaler().fit_transform(subset[["tenure_months","monthly_charge"]])
    y_c = subset["churned"]
    model = LogisticRegression(max_iter=500).fit(X, y_c)
    print(f"{label}:  n={len(subset)}  "
          f"accuracy={model.score(X, y_c):.2f}  "
          f"coef_charge={model.coef_[0][1]:.3f}")

with outlier:    n=6  accuracy=0.83  coef_charge=-0.428
without outlier: n=5  accuracy=0.80  coef_charge= 0.213

The coefficient for monthly_charge flips sign — with the outlier, high charge is associated with non-churn (the $350 customer is row 7, churned=0). Without it, the pattern reverses. This is a case where the outlier carries information; removing it changes the model's understanding of the feature.

Strategy Comparison

Technique	Handles	Key Risk	Use When
Row deletion	Any missing values	Loses data, selection bias	MCAR, <5% missing
Column deletion	Columns >50% missing	Loses all signal in column	Column has negligible information
Mean imputation	Numeric NaN	Shrinks variance, biased for MNAR	MCAR, symmetric distribution
Median imputation	Numeric NaN	Same, but outlier-robust	Skewed distributions
KNN imputation	Numeric NaN	Slow on large datasets	MAR, features correlate with missing col
IQR method	Univariate outliers	Sensitive to skewed distributions	Exploratory detection
Modified z-score	Univariate outliers	Assumes unimodal distribution	Symmetric but not normal data
Isolation Forest	Multivariate outliers	Requires tuning contamination param	High-dimensional feature spaces
Winsorize	Extreme values	Distorts distribution tails	Robust to extremes, keep all rows
Log transform	Right-skewed distributions	Not applicable for zero/negative values	Feature spans multiple orders of magnitude

Test Your Understanding

MNAR data cannot be fixed by imputation alone. Suggest a feature engineering strategy that partially addresses MNAR missingness — one that lets the model use the fact that the value is missing as a signal.
Mean imputation always produces a standard deviation lower than the original. Prove this algebraically for a dataset with one missing value. (Hint: show that the filled-in mean contributes zero to variance.)
The modified z-score (using median and MAD) is more robust than the standard z-score. Why does dividing by MAD instead of standard deviation help when there's an outlier?
For the anchor, winsorizing at the 95th percentile replaces 350 with 225.5. If you instead winsorized at the 90th percentile, what value would 350 be replaced with? Show the calculation.
KNN imputation with k=3 gave row 1 (age=None) an imputed value of 30.67. If you increased k to include the next nearest neighbor (row 2, age=42), the imputed value would be higher. Is this necessarily more accurate? What does k control in this context?

Feature Engineering: Missing Values and Outliers

Why Values Go Missing

Deletion Strategies

Imputation Strategies

Mean/Median Imputation

Mode Imputation

Forward/Backward Fill

KNN Imputation

Post-Imputation Validation

Outlier Detection

IQR Method

Z-Score Method

Isolation Forest

Outlier Treatment Strategies

1. Remove

2. Cap / Winsorize

3. Log Transform

4. Keep

Strategy Comparison

Test Your Understanding

Comments (0)

Leave a comment