← View series: machine learning
~/blog
Handling Missing Values
Handling Missing Values: MCAR, MAR, MNAR and Imputation Techniques
Missing values are one of the most common reasons models underperform — not because the algorithm can't handle them, but because how you handle them changes what your model learns.
The key insight that separates a good approach from a naive one is this: the mechanism that caused the data to be missing matters as much as how you fill it. Treating all missing values the same way is a safe way to quietly bias your analysis.
The Three Mechanisms of Missing Data
MCAR — Missing Completely at Random
The probability of a value being missing has no relationship to any other variable — observed or missing. It's pure randomness.
A practical example: in a survey, some respondents accidentally skip a question because of a UI glitch. The people who skipped aren't systematically different from those who answered. The missingness is just noise.
In practice, MCAR is rare. Most real-world missing data has some structure.
MAR — Missing at Random
The probability of missingness depends on observed data, but not on the missing value itself.
Consider an income survey where men report their income more often than women. If you know the respondent's gender, you can predict who has missing income data. But among men specifically, the missing incomes aren't systematically different from the reported ones — it's just that more men report.
This is the most common assumption behind real-world imputation. Most off-the-shelf methods assume MAR holds.
MNAR — Missing Not at Random
The probability of missingness depends on the missing value itself.
People with very high incomes are less likely to report their income. The missing values aren't random — they're the high ones. No amount of looking at other variables fully corrects for this because the quantity you're trying to measure determines whether it's observed.
MNAR is the hardest to handle. It typically requires domain knowledge or sensitivity analysis, not a simple imputation function.
Seeing It on Real Data
The Titanic dataset is a good test case because it has missing values in several columns with different mechanisms.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('titanic')
df.isnull().sum()You'll see that age has 177 missing values and deck has 688. The age values are likely MAR (certain passenger classes had poorer records), while deck might be closer to MNAR (higher-status passengers may be more documented).
The most straightforward approach is deletion. Row-wise deletion drops any row with a missing value:
df.dropna().shape # (182, 15) — loses 80% of dataGoing from 891 rows to 182 is rarely acceptable. Column-wise deletion removes entire columns with too many missing values:
df.dropna(axis=1)This drops deck entirely, which may be the right call if 77% of its values are missing. The question is where to draw the threshold — there's no universal answer.
Imputation Strategies
When deletion is too aggressive, you fill in the missing values with estimates.
Mean Imputation
Replace missing values with the column mean. Works well when data is roughly normally distributed and missingness is MCAR.
df['age_mean'] = df['age'].fillna(df['age'].mean())The problem: this pulls variance toward the mean. If you have 20% missing values, the imputed column will have less spread than the real distribution, which can shrink effect sizes in downstream models.
Median Imputation
More robust when your data has outliers. The median is less sensitive to extreme values than the mean.
df['age_median'] = df['age'].fillna(df['age'].median())For age, the mean and median are close because the distribution is roughly normal. But for skewed features like fare, median imputation is noticeably better.
Mode Imputation
For categorical columns, the most frequent category is a reasonable default.
mode_value = df[df['embarked'].notna()]['embarked'].mode()[0]
df['embarked_mode'] = df['embarked'].fillna(mode_value)This assumes the missing values follow the same distribution as the observed ones — which is only true under MCAR.
When Simple Imputation Fails
All three methods (mean, median, mode) share a fundamental limitation: they assume missing values follow the same distribution as observed ones. If your data is MNAR, that assumption is wrong and your estimates will be biased.
A practical safeguard: add a boolean column that tracks which values were imputed.
df['age_missing'] = df['age'].isnull().astype(int)This lets your model learn if the missingness pattern itself is predictive — which it often is.
Beyond Simple Imputation
Two more robust methods worth knowing about:
KNN Imputation — finds k-nearest neighbors using non-missing columns and averages their values. Works well for MCAR and MAR but is computationally expensive for large datasets.
MICE (Multiple Imputation by Chained Equations) — models each feature with missing values as a function of other features, iteratively. More accurate than KNN but more complex.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
imputer.fit_transform(df[['age', 'fare', 'pclass']]),
columns=['age', 'fare', 'pclass']
)Summary
The mechanism behind missing data determines what you should do about it. MCAR lets you use almost any method safely. MAR is manageable with good imputation. MNAR requires careful thought and domain knowledge. Starting with simple methods and progressively trying more robust ones is the practical way to find out how sensitive your results are to the choice.