Back to blog
← View series: machine learning

~/blog

Handling Missing Values

Jun 1, 20264 min readBy Mohammed Vasim
Machine LearningAIData Science

Handling Missing Values: MCAR, MAR, MNAR and Imputation Techniques

Missing values are one of the most common reasons models underperform — not because the algorithm can't handle them, but because how you handle them changes what your model learns.

The key insight that separates a good approach from a naive one is this: the mechanism that caused the data to be missing matters as much as how you fill it. Treating all missing values the same way is a safe way to quietly bias your analysis.

The Three Mechanisms of Missing Data

MCAR — Missing Completely at Random

The probability of a value being missing has no relationship to any other variable — observed or missing. It's pure randomness.

A practical example: in a survey, some respondents accidentally skip a question because of a UI glitch. The people who skipped aren't systematically different from those who answered. The missingness is just noise.

In practice, MCAR is rare. Most real-world missing data has some structure.

MAR — Missing at Random

The probability of missingness depends on observed data, but not on the missing value itself.

Consider an income survey where men report their income more often than women. If you know the respondent's gender, you can predict who has missing income data. But among men specifically, the missing incomes aren't systematically different from the reported ones — it's just that more men report.

This is the most common assumption behind real-world imputation. Most off-the-shelf methods assume MAR holds.

MNAR — Missing Not at Random

The probability of missingness depends on the missing value itself.

People with very high incomes are less likely to report their income. The missing values aren't random — they're the high ones. No amount of looking at other variables fully corrects for this because the quantity you're trying to measure determines whether it's observed.

MNAR is the hardest to handle. It typically requires domain knowledge or sensitivity analysis, not a simple imputation function.

Seeing It on Real Data

The Titanic dataset is a good test case because it has missing values in several columns with different mechanisms.

python
import seaborn as sns
import pandas as pd

df = sns.load_dataset('titanic')
df.isnull().sum()

You'll see that age has 177 missing values and deck has 688. The age values are likely MAR (certain passenger classes had poorer records), while deck might be closer to MNAR (higher-status passengers may be more documented).

The most straightforward approach is deletion. Row-wise deletion drops any row with a missing value:

python
df.dropna().shape  # (182, 15) — loses 80% of data

Going from 891 rows to 182 is rarely acceptable. Column-wise deletion removes entire columns with too many missing values:

python
df.dropna(axis=1)

This drops deck entirely, which may be the right call if 77% of its values are missing. The question is where to draw the threshold — there's no universal answer.

Imputation Strategies

When deletion is too aggressive, you fill in the missing values with estimates.

Mean Imputation

Replace missing values with the column mean. Works well when data is roughly normally distributed and missingness is MCAR.

python
df['age_mean'] = df['age'].fillna(df['age'].mean())

The problem: this pulls variance toward the mean. If you have 20% missing values, the imputed column will have less spread than the real distribution, which can shrink effect sizes in downstream models.

Median Imputation

More robust when your data has outliers. The median is less sensitive to extreme values than the mean.

python
df['age_median'] = df['age'].fillna(df['age'].median())

For age, the mean and median are close because the distribution is roughly normal. But for skewed features like fare, median imputation is noticeably better.

Mode Imputation

For categorical columns, the most frequent category is a reasonable default.

python
mode_value = df[df['embarked'].notna()]['embarked'].mode()[0]
df['embarked_mode'] = df['embarked'].fillna(mode_value)

This assumes the missing values follow the same distribution as the observed ones — which is only true under MCAR.

When Simple Imputation Fails

All three methods (mean, median, mode) share a fundamental limitation: they assume missing values follow the same distribution as observed ones. If your data is MNAR, that assumption is wrong and your estimates will be biased.

A practical safeguard: add a boolean column that tracks which values were imputed.

python
df['age_missing'] = df['age'].isnull().astype(int)

This lets your model learn if the missingness pattern itself is predictive — which it often is.

Beyond Simple Imputation

Two more robust methods worth knowing about:

KNN Imputation — finds k-nearest neighbors using non-missing columns and averages their values. Works well for MCAR and MAR but is computationally expensive for large datasets.

MICE (Multiple Imputation by Chained Equations) — models each feature with missing values as a function of other features, iteratively. More accurate than KNN but more complex.

python
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df[['age', 'fare', 'pclass']]),
    columns=['age', 'fare', 'pclass']
)

Summary

The mechanism behind missing data determines what you should do about it. MCAR lets you use almost any method safely. MAR is manageable with good imputation. MNAR requires careful thought and domain knowledge. Starting with simple methods and progressively trying more robust ones is the practical way to find out how sensitive your results are to the choice.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment