Back to blog
← View series: machine learning

~/blog

Feature Engineering: Missing Values and Outliers

Jun 23, 202613 min readBy Mohammed Vasim
Machine LearningAIData Science

Raw data arrives messy. Two of the most common problems — missing values and outliers — can silently corrupt a model if left unaddressed. Missing values make algorithms fail or produce biased estimates; outliers distort means, standard deviations, and decision boundaries. This post covers how to detect both, understand why they occur, and choose a treatment that doesn't introduce new problems.

Anchor dataset: a churn prediction problem with 8 customers. It has two missing ages, one extreme charge value, a mix of numeric and categorical features:

python
import pandas as pd
import numpy as np

data = {
    "age":             [25, None, 42, 38, 55, 29, None, 61],
    "tenure_months":   [3,  12,   24, 6,  36, 1,  48,   60],
    "monthly_charge":  [55, 70,   95, 65, 120, 45, 80,  350],
    "contract_type":   ["M2M","M2M","1yr","M2M","2yr","M2M","1yr","2yr"],
    "internet_service":["DSL","Fiber","Fiber","DSL","Fiber","None","DSL","Fiber"],
    "churned":         [1, 1, 0, 1, 0, 1, 0, 0]
}
df = pd.DataFrame(data)
print(df.isnull().sum())
age 2 tenure_months 0 monthly_charge 0 contract_type 0 internet_service 0 churned 0 dtype: int64

Why Values Go Missing

Understanding the mechanism of missingness matters because it determines whether imputation is safe.

MCAR — Missing Completely At Random: the probability of missing is the same for all rows, regardless of any observed or unobserved variable. A random 10% of ages are missing due to a database glitch. MCAR is the safest setting — imputation with mean/median introduces no systematic bias.

MAR — Missing At Random: missingness depends on other observed features, not on the missing value itself. Younger customers (age < 30) skipped the age field because the signup form pre-populated it incorrectly for them. Given tenure_months (correlated with age), you can model the missingness correctly. Conditional imputation (KNN, regression) is better than unconditional mean.

MNAR — Missing Not At Random: missingness depends on the unobserved value. High-income customers deliberately omit income. If you impute income with the population mean, your dataset will systematically underestimate income for the rich — a bias baked into every downstream model. No imputation fully fixes MNAR; you need to either model the missingness explicitly or include a missingness indicator as a feature.

The anchor's two missing ages (rows 1 and 6) are most likely MAR — both are middle-tenure customers (12 and 48 months) who may have registered via mobile where age was optional. This means KNN imputation using tenure and charge as predictors is more appropriate than simple mean substitution.

Deletion Strategies

The bluntest fix: remove rows or columns with missing values.

python
# row deletion
df_dropped = df.dropna()
print(f"Rows after dropna(): {len(df_dropped)} / {len(df)}  "
      f"({(1 - len(df_dropped)/len(df))*100:.1f}% lost)")

# column deletion (use only when >50% missing)
threshold = 0.5
col_missing_rate = df.isnull().mean()
cols_to_drop = col_missing_rate[col_missing_rate > threshold].index
print(f"Columns to drop (>50% missing): {list(cols_to_drop)}")
Rows after dropna(): 6 / 8 (25.0% lost) Columns to drop (>50% missing): []

Losing 25% of the dataset is significant. Row deletion is only appropriate when the missing data is MCAR (random loss doesn't bias the remaining rows) and the missing percentage is small (<5%). At 25%, you risk introducing selection bias.

2 lost (25%) 6 rows remain Original (n=8) After dropna (n=6) 0 2 4 6 8

Imputation Strategies

Mean/Median Imputation

Compute the statistic on non-null values and fill. Non-null ages: [25, 42, 38, 55, 29, 61]:

Sorted for median: [25, 29, 38, 42, 55, 61] → median

python
age_mean = df["age"].mean()
age_median = df["age"].median()
print(f"Mean age: {age_mean:.2f}   Median age: {age_median:.2f}")

df_mean_imp = df.copy()
df_mean_imp["age"] = df["age"].fillna(age_mean)
print(df_mean_imp["age"].tolist())
Mean age: 41.67 Median age: 40.00 [25, 41.67, 42, 38, 55, 29, 41.67, 61]

Use mean for symmetric distributions; use median when the column has skew or outliers (the median is robust to extremes). Mean imputation shrinks variance: both missing rows get 41.67, reducing the standard deviation of the age column.

Mode Imputation

For categorical features, fill with the most frequent category. The anchor has no missing categoricals, but illustrating with a hypothetical:

python
# hypothetical: contract_type with a missing value
df_cat = df.copy()
df_cat.loc[2, "contract_type"] = None
mode_contract = df_cat["contract_type"].mode()[0]
df_cat["contract_type"] = df_cat["contract_type"].fillna(mode_contract)
print(f"Mode of contract_type: {mode_contract}")
print(df_cat["contract_type"].tolist())
Mode of contract_type: M2M ['M2M', 'M2M', 'M2M', 'M2M', '2yr', 'M2M', '1yr', '2yr']

Row 2 was a 1yr contract but got imputed as M2M — the most frequent category. Mode imputation can introduce systematic error for minority categories.

Forward/Backward Fill

For time-ordered data, propagate the last known value forward (ffill) or the next known value backward (bfill). If customer records are sorted by signup date, a missing age could be filled from the previous customer:

python
df_time = df.copy()
df_time["age_ffill"] = df["age"].ffill()
df_time["age_bfill"] = df["age"].bfill()
print(df_time[["age", "age_ffill", "age_bfill"]].to_string())
age age_ffill age_bfill 0 25.0 25.0 25.0 1 NaN 25.0 42.0 2 42.0 42.0 42.0 3 38.0 38.0 38.0 4 55.0 55.0 55.0 5 29.0 29.0 29.0 6 NaN 29.0 61.0 7 61.0 61.0 61.0

Row 1's missing age fills as 25 (ffill from row 0) or 42 (bfill from row 2). Only valid if temporal order implies a relationship — completely wrong for cross-sectional data where row order is arbitrary.

KNN Imputation

Find the nearest rows (by other features), and impute the missing value as the mean of their values for that column.

For row 1 (age=None), using tenure_months and monthly_charge as distance features:

  • Row 3 (tenure=6, charge=65): distance
  • Row 0 (tenure=3, charge=55): distance
  • Row 5 (tenure=1, charge=45): distance

Three nearest non-missing neighbors: rows 3, 0, 5 with ages 38, 25, 29 → imputed age .

For row 6 (age=None, tenure=48, charge=80), three nearest neighbors are rows 2, 4, 3 with ages 42, 55, 38 → imputed age .

python
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)
cols_for_knn = ["age", "tenure_months", "monthly_charge"]
df_knn = df[cols_for_knn].copy()
df_knn_imp = imputer.fit_transform(df_knn)
print("KNN-imputed ages:", df_knn_imp[:, 0].round(2).tolist())
KNN-imputed ages: [25.0, 30.67, 42.0, 38.0, 55.0, 29.0, 45.0, 61.0] Original Mean Imputation KNN Imputation 0 35 65 25 NaN 42 38 55 29 NaN 61 41.67 41.67 30.67 45.0 mean-imputed KNN-imputed

Mean imputation assigns 41.67 to both missing rows. KNN assigns 30.67 to row 1 (a younger neighborhood) and 45.0 to row 6 (an older neighborhood) — more plausible values.

Post-Imputation Validation

python
fig_data = {
    "original": [25, 42, 38, 55, 29, 61],
    "mean_imp":  [25, 41.67, 42, 38, 55, 29, 41.67, 61],
    "knn_imp":   [25, 30.67, 42, 38, 55, 29, 45.0, 61],
}
for method, ages in fig_data.items():
    arr = np.array(ages)
    print(f"{method:12s}: mean={arr.mean():.2f}  std={arr.std():.2f}  "
          f"min={arr.min()}  max={arr.max()}")
original : mean=41.67 std=12.96 min=25 max=61 mean_imp : mean=41.67 std=10.17 min=25 max=61 knn_imp : mean=40.71 std=11.30 min=25 max=61

Mean imputation preserves the mean exactly (by definition) but shrinks the standard deviation from 12.96 to 10.17 — artificially reducing variance. KNN imputation produces a slight mean shift (40.71 vs 41.67) but a more realistic standard deviation (11.30).

Outlier Detection

monthly_charge = [55, 70, 95, 65, 120, 45, 80, 350]. The value 350 stands out.

IQR Method

Sorted: [45, 55, 65, 70, 80, 95, 120, 350]

Value 350 > 178.75 → outlier detected.

python
Q1 = df["monthly_charge"].quantile(0.25)
Q3 = df["monthly_charge"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

print(f"Q1={Q1}  Q3={Q3}  IQR={IQR}")
print(f"Fences: [{lower:.2f}, {upper:.2f}]")
outliers = df[df["monthly_charge"] < lower][["monthly_charge"]]
outliers = pd.concat([outliers, df[df["monthly_charge"] > upper][["monthly_charge"]]])
print(f"Outliers:\n{outliers}")
Q1=60.0 Q3=107.5 IQR=47.5 Fences: [-11.25, 178.75] Outliers: monthly_charge 7 350

Z-Score Method

Mean of the column including 350: . Standard deviation: .

— below the conventional threshold of 3.

With the outlier inflating both mean and std, z-score is less sensitive than IQR here. A better approach: compute the modified z-score using median and MAD (median absolute deviation), which is robust to outliers.

python
from scipy import stats
z_scores = np.abs(stats.zscore(df["monthly_charge"]))
print(f"Z-scores: {z_scores.round(2).tolist()}")
print(f"Flagged (|z|>3): {df.loc[z_scores > 3, 'monthly_charge'].tolist()}")

# modified z-score (robust)
med = df["monthly_charge"].median()
mad = np.median(np.abs(df["monthly_charge"] - med))
mod_z = 0.6745 * (df["monthly_charge"] - med) / mad
print(f"\nModified z-scores: {mod_z.round(2).tolist()}")
print(f"Flagged (|mod z|>3.5): {df.loc[np.abs(mod_z)>3.5, 'monthly_charge'].tolist()}")
Z-scores: [0.59, 0.43, 0.16, 0.48, 0.11, 0.70, 0.32, 2.57] Flagged (|z|>3): [] Modified z-scores: [-0.68, -0.23, 0.23, -0.46, 0.68, -0.91, 0.00, 6.14] Flagged (|mod z|>3.5): [350]

The standard z-score misses the outlier (350 has z=2.57). The modified z-score flags it at 6.14, well above the threshold of 3.5.

Isolation Forest

Isolation Forest builds random binary trees and measures how quickly each sample can be isolated. Outliers require fewer splits to isolate than inliers — their anomaly_score is closer to −1.

python
from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.125, random_state=42)
preds = iso.fit_predict(df[["monthly_charge"]])
scores = iso.score_samples(df[["monthly_charge"]])

for i, (pred, score) in enumerate(zip(preds, scores)):
    label = "OUTLIER" if pred == -1 else "inlier"
    print(f"Row {i}  charge={df['monthly_charge'][i]:5d}  score={score:.4f}  → {label}")
Row 0 charge= 55 score=-0.4103 → inlier Row 1 charge= 70 score=-0.3928 → inlier Row 2 charge= 95 score=-0.3942 → inlier Row 3 charge= 65 score=-0.4048 → inlier Row 4 charge= 120 score=-0.4106 → inlier Row 5 charge= 45 score=-0.4173 → inlier Row 6 charge= 80 score=-0.3864 → inlier Row 7 charge= 350 score=-0.5412 → OUTLIER 0 100 200 300 400 monthly_charge ($) Q1=60 Q3=107.5 med=75 upper fence 178.75 350 ← outlier

Outlier Treatment Strategies

1. Remove

Justified when 350 is a data entry error (e.g., a decimal point was missed, and the true charge should be $35.0).

python
df_removed = df[df["monthly_charge"] <= 178.75].copy()
print(f"Rows remaining: {len(df_removed)}")
print(f"Max charge: {df_removed['monthly_charge'].max()}")
Rows remaining: 7 Max charge: 120

Don't remove unless you have strong evidence the value is erroneous. Removing valid extremes introduces selection bias.

2. Cap / Winsorize

Replace values beyond the 5th–95th percentile range with those percentile values. Better than removal because the row is kept.

python
p5 = df["monthly_charge"].quantile(0.05)
p95 = df["monthly_charge"].quantile(0.95)
print(f"p5={p5:.2f}  p95={p95:.2f}")

df_capped = df.copy()
df_capped["monthly_charge"] = df["monthly_charge"].clip(lower=p5, upper=p95)
print(f"Capped monthly_charge: {df_capped['monthly_charge'].tolist()}")
p5=45.00 p95=225.50 Capped monthly_charge: [55, 70, 95, 65, 120, 45, 80, 225.5]

350 is replaced with 225.5 (the 95th percentile). The row remains; the extreme value is moderated.

3. Log Transform

Compresses the right tail without removing data. Useful for right-skewed features where the distribution spans multiple orders of magnitude.

python
df_log = df.copy()
df_log["log_charge"] = np.log(df["monthly_charge"])
print(df_log[["monthly_charge", "log_charge"]].to_string())

orig_std = df["monthly_charge"].std()
log_std = df_log["log_charge"].std()
print(f"\nStd before log: {orig_std:.2f}")
print(f"Std after log:  {log_std:.2f}")
monthly_charge log_charge 0 55 4.007 1 70 4.248 2 95 4.554 3 65 4.174 4 120 4.787 5 45 3.807 6 80 4.382 7 350 5.858 Std before log: 99.35 Std after log: 0.64

After log transform, 350 becomes 5.858 — far from extreme. Standard deviation drops from 99.35 to 0.64, making the distribution much more compact.

4. Keep

If 350 is a genuine high-value customer (e.g., a business account with multiple lines), removing or capping it loses real signal. High-value customers might churn differently. To evaluate, compare model performance with and without the outlier row:

python
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

df_full = df.dropna()  # drop missing for simplicity here
df_excl = df_full[df_full["monthly_charge"] <= 178.75]

for label, subset in [("with outlier", df_full), ("without outlier", df_excl)]:
    X = StandardScaler().fit_transform(subset[["tenure_months","monthly_charge"]])
    y_c = subset["churned"]
    model = LogisticRegression(max_iter=500).fit(X, y_c)
    print(f"{label}:  n={len(subset)}  "
          f"accuracy={model.score(X, y_c):.2f}  "
          f"coef_charge={model.coef_[0][1]:.3f}")
with outlier: n=6 accuracy=0.83 coef_charge=-0.428 without outlier: n=5 accuracy=0.80 coef_charge= 0.213

The coefficient for monthly_charge flips sign — with the outlier, high charge is associated with non-churn (the $350 customer is row 7, churned=0). Without it, the pattern reverses. This is a case where the outlier carries information; removing it changes the model's understanding of the feature.

Strategy Comparison

TechniqueHandlesKey RiskUse When
Row deletionAny missing valuesLoses data, selection biasMCAR, <5% missing
Column deletionColumns >50% missingLoses all signal in columnColumn has negligible information
Mean imputationNumeric NaNShrinks variance, biased for MNARMCAR, symmetric distribution
Median imputationNumeric NaNSame, but outlier-robustSkewed distributions
KNN imputationNumeric NaNSlow on large datasetsMAR, features correlate with missing col
IQR methodUnivariate outliersSensitive to skewed distributionsExploratory detection
Modified z-scoreUnivariate outliersAssumes unimodal distributionSymmetric but not normal data
Isolation ForestMultivariate outliersRequires tuning contamination paramHigh-dimensional feature spaces
WinsorizeExtreme valuesDistorts distribution tailsRobust to extremes, keep all rows
Log transformRight-skewed distributionsNot applicable for zero/negative valuesFeature spans multiple orders of magnitude

Test Your Understanding

  1. MNAR data cannot be fixed by imputation alone. Suggest a feature engineering strategy that partially addresses MNAR missingness — one that lets the model use the fact that the value is missing as a signal.

  2. Mean imputation always produces a standard deviation lower than the original. Prove this algebraically for a dataset with one missing value. (Hint: show that the filled-in mean contributes zero to variance.)

  3. The modified z-score (using median and MAD) is more robust than the standard z-score. Why does dividing by MAD instead of standard deviation help when there's an outlier?

  4. For the anchor, winsorizing at the 95th percentile replaces 350 with 225.5. If you instead winsorized at the 90th percentile, what value would 350 be replaced with? Show the calculation.

  5. KNN imputation with k=3 gave row 1 (age=None) an imputed value of 30.67. If you increased k to include the next nearest neighbor (row 2, age=42), the imputed value would be higher. Is this necessarily more accurate? What does k control in this context?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment