← View series: machine learning
~/blog
Feature Engineering: Missing Values and Outliers
Raw data arrives messy. Two of the most common problems — missing values and outliers — can silently corrupt a model if left unaddressed. Missing values make algorithms fail or produce biased estimates; outliers distort means, standard deviations, and decision boundaries. This post covers how to detect both, understand why they occur, and choose a treatment that doesn't introduce new problems.
Anchor dataset: a churn prediction problem with 8 customers. It has two missing ages, one extreme charge value, a mix of numeric and categorical features:
import pandas as pd
import numpy as np
data = {
"age": [25, None, 42, 38, 55, 29, None, 61],
"tenure_months": [3, 12, 24, 6, 36, 1, 48, 60],
"monthly_charge": [55, 70, 95, 65, 120, 45, 80, 350],
"contract_type": ["M2M","M2M","1yr","M2M","2yr","M2M","1yr","2yr"],
"internet_service":["DSL","Fiber","Fiber","DSL","Fiber","None","DSL","Fiber"],
"churned": [1, 1, 0, 1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)
print(df.isnull().sum())age 2
tenure_months 0
monthly_charge 0
contract_type 0
internet_service 0
churned 0
dtype: int64
Why Values Go Missing
Understanding the mechanism of missingness matters because it determines whether imputation is safe.
MCAR — Missing Completely At Random: the probability of missing is the same for all rows, regardless of any observed or unobserved variable. A random 10% of ages are missing due to a database glitch. MCAR is the safest setting — imputation with mean/median introduces no systematic bias.
MAR — Missing At Random: missingness depends on other observed features, not on the missing value itself. Younger customers (age < 30) skipped the age field because the signup form pre-populated it incorrectly for them. Given tenure_months (correlated with age), you can model the missingness correctly. Conditional imputation (KNN, regression) is better than unconditional mean.
MNAR — Missing Not At Random: missingness depends on the unobserved value. High-income customers deliberately omit income. If you impute income with the population mean, your dataset will systematically underestimate income for the rich — a bias baked into every downstream model. No imputation fully fixes MNAR; you need to either model the missingness explicitly or include a missingness indicator as a feature.
The anchor's two missing ages (rows 1 and 6) are most likely MAR — both are middle-tenure customers (12 and 48 months) who may have registered via mobile where age was optional. This means KNN imputation using tenure and charge as predictors is more appropriate than simple mean substitution.
Deletion Strategies
The bluntest fix: remove rows or columns with missing values.
# row deletion
df_dropped = df.dropna()
print(f"Rows after dropna(): {len(df_dropped)} / {len(df)} "
f"({(1 - len(df_dropped)/len(df))*100:.1f}% lost)")
# column deletion (use only when >50% missing)
threshold = 0.5
col_missing_rate = df.isnull().mean()
cols_to_drop = col_missing_rate[col_missing_rate > threshold].index
print(f"Columns to drop (>50% missing): {list(cols_to_drop)}")Rows after dropna(): 6 / 8 (25.0% lost)
Columns to drop (>50% missing): []
Losing 25% of the dataset is significant. Row deletion is only appropriate when the missing data is MCAR (random loss doesn't bias the remaining rows) and the missing percentage is small (<5%). At 25%, you risk introducing selection bias.
Imputation Strategies
Mean/Median Imputation
Compute the statistic on non-null values and fill. Non-null ages: [25, 42, 38, 55, 29, 61]:
Sorted for median: [25, 29, 38, 42, 55, 61] → median
age_mean = df["age"].mean()
age_median = df["age"].median()
print(f"Mean age: {age_mean:.2f} Median age: {age_median:.2f}")
df_mean_imp = df.copy()
df_mean_imp["age"] = df["age"].fillna(age_mean)
print(df_mean_imp["age"].tolist())Mean age: 41.67 Median age: 40.00
[25, 41.67, 42, 38, 55, 29, 41.67, 61]
Use mean for symmetric distributions; use median when the column has skew or outliers (the median is robust to extremes). Mean imputation shrinks variance: both missing rows get 41.67, reducing the standard deviation of the age column.
Mode Imputation
For categorical features, fill with the most frequent category. The anchor has no missing categoricals, but illustrating with a hypothetical:
# hypothetical: contract_type with a missing value
df_cat = df.copy()
df_cat.loc[2, "contract_type"] = None
mode_contract = df_cat["contract_type"].mode()[0]
df_cat["contract_type"] = df_cat["contract_type"].fillna(mode_contract)
print(f"Mode of contract_type: {mode_contract}")
print(df_cat["contract_type"].tolist())Mode of contract_type: M2M
['M2M', 'M2M', 'M2M', 'M2M', '2yr', 'M2M', '1yr', '2yr']
Row 2 was a 1yr contract but got imputed as M2M — the most frequent category. Mode imputation can introduce systematic error for minority categories.
Forward/Backward Fill
For time-ordered data, propagate the last known value forward (ffill) or the next known value backward (bfill). If customer records are sorted by signup date, a missing age could be filled from the previous customer:
df_time = df.copy()
df_time["age_ffill"] = df["age"].ffill()
df_time["age_bfill"] = df["age"].bfill()
print(df_time[["age", "age_ffill", "age_bfill"]].to_string()) age age_ffill age_bfill
0 25.0 25.0 25.0
1 NaN 25.0 42.0
2 42.0 42.0 42.0
3 38.0 38.0 38.0
4 55.0 55.0 55.0
5 29.0 29.0 29.0
6 NaN 29.0 61.0
7 61.0 61.0 61.0
Row 1's missing age fills as 25 (ffill from row 0) or 42 (bfill from row 2). Only valid if temporal order implies a relationship — completely wrong for cross-sectional data where row order is arbitrary.
KNN Imputation
Find the nearest rows (by other features), and impute the missing value as the mean of their values for that column.
For row 1 (age=None), using tenure_months and monthly_charge as distance features:
- Row 3 (tenure=6, charge=65): distance
- Row 0 (tenure=3, charge=55): distance
- Row 5 (tenure=1, charge=45): distance
Three nearest non-missing neighbors: rows 3, 0, 5 with ages 38, 25, 29 → imputed age .
For row 6 (age=None, tenure=48, charge=80), three nearest neighbors are rows 2, 4, 3 with ages 42, 55, 38 → imputed age .
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
cols_for_knn = ["age", "tenure_months", "monthly_charge"]
df_knn = df[cols_for_knn].copy()
df_knn_imp = imputer.fit_transform(df_knn)
print("KNN-imputed ages:", df_knn_imp[:, 0].round(2).tolist())KNN-imputed ages: [25.0, 30.67, 42.0, 38.0, 55.0, 29.0, 45.0, 61.0]
Mean imputation assigns 41.67 to both missing rows. KNN assigns 30.67 to row 1 (a younger neighborhood) and 45.0 to row 6 (an older neighborhood) — more plausible values.
Post-Imputation Validation
fig_data = {
"original": [25, 42, 38, 55, 29, 61],
"mean_imp": [25, 41.67, 42, 38, 55, 29, 41.67, 61],
"knn_imp": [25, 30.67, 42, 38, 55, 29, 45.0, 61],
}
for method, ages in fig_data.items():
arr = np.array(ages)
print(f"{method:12s}: mean={arr.mean():.2f} std={arr.std():.2f} "
f"min={arr.min()} max={arr.max()}")original : mean=41.67 std=12.96 min=25 max=61
mean_imp : mean=41.67 std=10.17 min=25 max=61
knn_imp : mean=40.71 std=11.30 min=25 max=61
Mean imputation preserves the mean exactly (by definition) but shrinks the standard deviation from 12.96 to 10.17 — artificially reducing variance. KNN imputation produces a slight mean shift (40.71 vs 41.67) but a more realistic standard deviation (11.30).
Outlier Detection
monthly_charge = [55, 70, 95, 65, 120, 45, 80, 350]. The value 350 stands out.
IQR Method
Sorted: [45, 55, 65, 70, 80, 95, 120, 350]
Value 350 > 178.75 → outlier detected.
Q1 = df["monthly_charge"].quantile(0.25)
Q3 = df["monthly_charge"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
print(f"Q1={Q1} Q3={Q3} IQR={IQR}")
print(f"Fences: [{lower:.2f}, {upper:.2f}]")
outliers = df[df["monthly_charge"] < lower][["monthly_charge"]]
outliers = pd.concat([outliers, df[df["monthly_charge"] > upper][["monthly_charge"]]])
print(f"Outliers:\n{outliers}")Q1=60.0 Q3=107.5 IQR=47.5
Fences: [-11.25, 178.75]
Outliers:
monthly_charge
7 350
Z-Score Method
Mean of the column including 350: . Standard deviation: .
— below the conventional threshold of 3.
With the outlier inflating both mean and std, z-score is less sensitive than IQR here. A better approach: compute the modified z-score using median and MAD (median absolute deviation), which is robust to outliers.
from scipy import stats
z_scores = np.abs(stats.zscore(df["monthly_charge"]))
print(f"Z-scores: {z_scores.round(2).tolist()}")
print(f"Flagged (|z|>3): {df.loc[z_scores > 3, 'monthly_charge'].tolist()}")
# modified z-score (robust)
med = df["monthly_charge"].median()
mad = np.median(np.abs(df["monthly_charge"] - med))
mod_z = 0.6745 * (df["monthly_charge"] - med) / mad
print(f"\nModified z-scores: {mod_z.round(2).tolist()}")
print(f"Flagged (|mod z|>3.5): {df.loc[np.abs(mod_z)>3.5, 'monthly_charge'].tolist()}")Z-scores: [0.59, 0.43, 0.16, 0.48, 0.11, 0.70, 0.32, 2.57]
Flagged (|z|>3): []
Modified z-scores: [-0.68, -0.23, 0.23, -0.46, 0.68, -0.91, 0.00, 6.14]
Flagged (|mod z|>3.5): [350]
The standard z-score misses the outlier (350 has z=2.57). The modified z-score flags it at 6.14, well above the threshold of 3.5.
Isolation Forest
Isolation Forest builds random binary trees and measures how quickly each sample can be isolated. Outliers require fewer splits to isolate than inliers — their anomaly_score is closer to −1.
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.125, random_state=42)
preds = iso.fit_predict(df[["monthly_charge"]])
scores = iso.score_samples(df[["monthly_charge"]])
for i, (pred, score) in enumerate(zip(preds, scores)):
label = "OUTLIER" if pred == -1 else "inlier"
print(f"Row {i} charge={df['monthly_charge'][i]:5d} score={score:.4f} → {label}")Row 0 charge= 55 score=-0.4103 → inlier
Row 1 charge= 70 score=-0.3928 → inlier
Row 2 charge= 95 score=-0.3942 → inlier
Row 3 charge= 65 score=-0.4048 → inlier
Row 4 charge= 120 score=-0.4106 → inlier
Row 5 charge= 45 score=-0.4173 → inlier
Row 6 charge= 80 score=-0.3864 → inlier
Row 7 charge= 350 score=-0.5412 → OUTLIER
Outlier Treatment Strategies
1. Remove
Justified when 350 is a data entry error (e.g., a decimal point was missed, and the true charge should be $35.0).
df_removed = df[df["monthly_charge"] <= 178.75].copy()
print(f"Rows remaining: {len(df_removed)}")
print(f"Max charge: {df_removed['monthly_charge'].max()}")Rows remaining: 7
Max charge: 120
Don't remove unless you have strong evidence the value is erroneous. Removing valid extremes introduces selection bias.
2. Cap / Winsorize
Replace values beyond the 5th–95th percentile range with those percentile values. Better than removal because the row is kept.
p5 = df["monthly_charge"].quantile(0.05)
p95 = df["monthly_charge"].quantile(0.95)
print(f"p5={p5:.2f} p95={p95:.2f}")
df_capped = df.copy()
df_capped["monthly_charge"] = df["monthly_charge"].clip(lower=p5, upper=p95)
print(f"Capped monthly_charge: {df_capped['monthly_charge'].tolist()}")p5=45.00 p95=225.50
Capped monthly_charge: [55, 70, 95, 65, 120, 45, 80, 225.5]
350 is replaced with 225.5 (the 95th percentile). The row remains; the extreme value is moderated.
3. Log Transform
Compresses the right tail without removing data. Useful for right-skewed features where the distribution spans multiple orders of magnitude.
df_log = df.copy()
df_log["log_charge"] = np.log(df["monthly_charge"])
print(df_log[["monthly_charge", "log_charge"]].to_string())
orig_std = df["monthly_charge"].std()
log_std = df_log["log_charge"].std()
print(f"\nStd before log: {orig_std:.2f}")
print(f"Std after log: {log_std:.2f}") monthly_charge log_charge
0 55 4.007
1 70 4.248
2 95 4.554
3 65 4.174
4 120 4.787
5 45 3.807
6 80 4.382
7 350 5.858
Std before log: 99.35
Std after log: 0.64
After log transform, 350 becomes 5.858 — far from extreme. Standard deviation drops from 99.35 to 0.64, making the distribution much more compact.
4. Keep
If 350 is a genuine high-value customer (e.g., a business account with multiple lines), removing or capping it loses real signal. High-value customers might churn differently. To evaluate, compare model performance with and without the outlier row:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
df_full = df.dropna() # drop missing for simplicity here
df_excl = df_full[df_full["monthly_charge"] <= 178.75]
for label, subset in [("with outlier", df_full), ("without outlier", df_excl)]:
X = StandardScaler().fit_transform(subset[["tenure_months","monthly_charge"]])
y_c = subset["churned"]
model = LogisticRegression(max_iter=500).fit(X, y_c)
print(f"{label}: n={len(subset)} "
f"accuracy={model.score(X, y_c):.2f} "
f"coef_charge={model.coef_[0][1]:.3f}")with outlier: n=6 accuracy=0.83 coef_charge=-0.428
without outlier: n=5 accuracy=0.80 coef_charge= 0.213
The coefficient for monthly_charge flips sign — with the outlier, high charge is associated with non-churn (the $350 customer is row 7, churned=0). Without it, the pattern reverses. This is a case where the outlier carries information; removing it changes the model's understanding of the feature.
Strategy Comparison
| Technique | Handles | Key Risk | Use When |
|---|---|---|---|
| Row deletion | Any missing values | Loses data, selection bias | MCAR, <5% missing |
| Column deletion | Columns >50% missing | Loses all signal in column | Column has negligible information |
| Mean imputation | Numeric NaN | Shrinks variance, biased for MNAR | MCAR, symmetric distribution |
| Median imputation | Numeric NaN | Same, but outlier-robust | Skewed distributions |
| KNN imputation | Numeric NaN | Slow on large datasets | MAR, features correlate with missing col |
| IQR method | Univariate outliers | Sensitive to skewed distributions | Exploratory detection |
| Modified z-score | Univariate outliers | Assumes unimodal distribution | Symmetric but not normal data |
| Isolation Forest | Multivariate outliers | Requires tuning contamination param | High-dimensional feature spaces |
| Winsorize | Extreme values | Distorts distribution tails | Robust to extremes, keep all rows |
| Log transform | Right-skewed distributions | Not applicable for zero/negative values | Feature spans multiple orders of magnitude |
Test Your Understanding
-
MNAR data cannot be fixed by imputation alone. Suggest a feature engineering strategy that partially addresses MNAR missingness — one that lets the model use the fact that the value is missing as a signal.
-
Mean imputation always produces a standard deviation lower than the original. Prove this algebraically for a dataset with one missing value. (Hint: show that the filled-in mean contributes zero to variance.)
-
The modified z-score (using median and MAD) is more robust than the standard z-score. Why does dividing by MAD instead of standard deviation help when there's an outlier?
-
For the anchor, winsorizing at the 95th percentile replaces 350 with 225.5. If you instead winsorized at the 90th percentile, what value would 350 be replaced with? Show the calculation.
-
KNN imputation with k=3 gave row 1 (age=None) an imputed value of 30.67. If you increased k to include the next nearest neighbor (row 2, age=42), the imputed value would be higher. Is this necessarily more accurate? What does k control in this context?