Back to blog

~/blog

Missing Data Isn't Just a Nuisance — The Mechanism Determines the Fix

Jun 22, 202613 min readBy Mohammed Vasim
data-imputationmissing-datascikit-learnfeature-engineeringpreprocessing

You load your dataset, fit the model, and get:

text
ValueError: Input X contains NaN.

That's the easy case. The harder one is when you filled those NaNs with the column mean, the model trained without complaint, and your predictions are quietly off — and you have no idea why.

The missing values aren't the problem. The problem is that why they're missing determines what you should do about them, and most preprocessing tutorials skip that part entirely.

Why Data Goes Missing

Before touching a single NaN, it's worth asking whether the missingness is random noise or structured signal. Statisticians Donald Rubin and Roderick Little gave us the vocabulary for this in three categories.

Missing Completely at Random (MCAR) means the probability that a value is missing has no relationship to any variable in the dataset — observed or unobserved. A sensor that randomly drops a reading, a survey respondent who accidentally skips a page. The missing values are a random sample of all values. You can drop them or fill them with simple statistics and your estimates stay unbiased.

Missing at Random (MAR) means the missingness does depend on other observed variables, but not on the missing value itself. Younger patients in a clinical study are more likely to skip the follow-up questionnaire — the missingness depends on age (which you have), not on the answer they would have given. Simple deletion now introduces bias because you're systematically losing a subgroup. But because you have the variable that explains the missingness, you can model it out.

Missing Not at Random (MNAR) is the hard one. The probability of a value being missing depends on the missing value itself. High-income individuals don't report their income. Patients with severe symptoms drop out of a drug trial. No imputation method can fully correct for MNAR without external assumptions about the missingness mechanism — you're trying to estimate the very thing you don't have using the very thing you don't have.

The practical takeaway: if you can't identify the mechanism, assume MAR at best, and verify by checking whether missingness correlates with other features. MNAR requires domain knowledge to handle honestly.

Why is this value missing? MCAR Pure chance — no pattern Safe to drop or fill simply MAR Depends on observed vars Model the missingness MNAR Depends on missing value itself No clean fix without domain info

Example: income missing because... ...form glitched randomly ...younger respondents skip it ...high earners opt out

↑ easiest to handle ↑ hardest to handle

When Deletion Is the Right Call

Dropping rows with missing values — df.dropna() — gets a bad reputation, but it's genuinely fine in specific conditions: the data is MCAR, and you're not losing more than 5% of your dataset. When both are true, the dropped rows are a random subset and your remaining data is unbiased.

The trap is using deletion on MAR data. If missingness correlates with any feature, your training set is now a skewed sample. A churn model trained after dropping rows where last_login is missing will never see the behavioral pattern of users who haven't logged in for months — which is probably your highest-risk group.

Column deletion is even blunter. Drop a feature with 60% missing values and you throw away 40% of real signal along with it. Worth keeping in mind before you reach for df.dropna(axis=1).

Simple Statistical Imputation

The most common starting point: replace each missing value with the mean, median, or mode of that column.

python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

data = {
    "age":    [34, np.nan, 29, 45, np.nan],
    "income": [45000, 62000, np.nan, 78000, 54000],
    "rooms":  [3, 4, 2, np.nan, 3],
}
df = pd.DataFrame(data)

imp = SimpleImputer(strategy="median")
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=df.columns)
print(df_imputed)
text
age   income  rooms
0  34.0  45000.0    3.0
1  34.0  62000.0    4.0
2  29.0  54000.0    2.0
3  45.0  78000.0    3.0
4  34.0  54000.0    3.0

strategy="mean" for normally distributed numerical features, strategy="median" for skewed distributions or data with outliers, strategy="most_frequent" for categorical columns.

The catch: every imputed value is the same number — the column's central tendency. You're artificially concentrating the distribution at one point, compressing variance and erasing correlations between features. If 30% of income is missing and you fill it all with the median, you've just flattened something that should spread. Models that depend on variance — especially linear models — will underfit as a result.

The rule that matters most: fit your imputer on training data only, then use those learned statistics to transform both train and test sets. Fitting on the combined dataset leaks test distribution into training and inflates performance metrics.

python
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

imp = SimpleImputer(strategy="median")
X_train_imputed = imp.fit_transform(X_train)
X_test_imputed = imp.transform(X_test)

Multivariate Imputation

Simple imputation treats each column in isolation. Multivariate methods use relationships between features to produce better estimates — because in most real datasets, columns are correlated and that correlation is exactly the signal you want to exploit.

KNN Imputation

KNN imputation finds the k nearest complete neighbors for each row with a missing value, then fills using their average (for numerical) or mode (for categorical). Proximity is measured using nan_euclidean_distances, which normalizes over only the features that are present in both rows.

python
from sklearn.impute import KNNImputer

data = {
    "age":    [34, np.nan, 29, 45, 38],
    "income": [45000, 62000, 41000, 78000, 55000],
    "rooms":  [3, 4, 2, 5, 3],
}
df = pd.DataFrame(data)

knn_imp = KNNImputer(n_neighbors=2, weights="distance")
df_knn = pd.DataFrame(knn_imp.fit_transform(df), columns=df.columns)
print(df_knn["age"])
text
0    34.000000
1    36.461538
2    29.000000
3    45.000000
4    38.000000
dtype: float64

Row 1's missing age gets imputed from its two nearest neighbors (rows 0 and 4, weighted by distance) rather than the global median. The result is sensitive to local structure in the data.

The downside is cost: computing pairwise distances on a large dataset is O(n²). On 10,000+ rows with many features, it becomes slow enough to be a bottleneck in a training pipeline.

MICE — Multivariate Imputation by Chained Equations

MICE treats each feature with missing values as a regression target, using all other features as predictors. It runs in rounds:

  1. Initialize all missing values with the column median (placeholder)
  2. For each feature with missing values: fit a regression model on rows where that feature is present, predict the missing rows, update the imputed values
  3. Repeat for every feature — one full pass is one iteration
  4. Run for N iterations until values stabilize

Scikit-learn calls this IterativeImputer. It's marked experimental but is stable in practice.

python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

mice_imp = IterativeImputer(max_iter=10, random_state=0)
df_mice = pd.DataFrame(mice_imp.fit_transform(df), columns=df.columns)
print(df_mice["age"])
text
0    34.000000
1    35.893421
2    29.000000
3    45.000000
4    38.000000
dtype: float64
MICE — one iteration (round-robin) age (NaN) predict from income, rooms income (NaN) predict from age*, rooms rooms (NaN) predict from age*, income* repeat ×10 iterations until values converge

* uses previously imputed value from this iteration default estimator: BayesianRidge — swap for any sklearn regressor

missForest

missForest runs the same iterative idea but swaps the regressor for a Random Forest, which captures nonlinear relationships and handles mixed data types (numerical + categorical) in one pass. It initializes with mean/mode, then predicts each feature's missing values using a forest trained on observed rows, and repeats until the difference between iterations stops shrinking.

In scikit-learn, you get missForest behavior by passing a random forest estimator to IterativeImputer:

python
from sklearn.ensemble import RandomForestRegressor

rf_imp = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=50, random_state=0),
    max_iter=10,
    random_state=0,
)
df_rf = pd.DataFrame(rf_imp.fit_transform(df), columns=df.columns)

No hyperparameter tuning required — random forests are robust enough at defaults. The tradeoff is compute: fitting a forest per feature per iteration is significantly slower than the default BayesianRidge estimator.

Marking What Was Missing

One thing most imputation tutorials skip: the fact that a value was missing is itself informative, and you should preserve that signal.

MissingIndicator creates binary columns flagging each position where a value was absent. Feed those alongside the imputed features and your model can learn whether the pattern of missingness predicts the target independently of the imputed values.

python
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion
import numpy as np

X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [4, 5]])

union = FeatureUnion([
    ("imputer", SimpleImputer(strategy="mean")),
    ("indicator", MissingIndicator()),
])

X_out = union.fit_transform(X)
print(X_out)
text
[[1.  2.  0.  0.]
 [4.  3.  1.  0.]
 [7.  3.5 0.  1.]
 [4.  5.  0.  0.]]

The last two columns tell the model which values were originally absent. This is especially useful for MNAR data — the missingness pattern may be a stronger predictor than the imputed value itself. Both SimpleImputer and IterativeImputer expose an add_indicator=True parameter that does this in one step.

Deep Learning Methods

Classical methods win on most tabular datasets, but deep learning has a real edge when N is very large and missingness patterns are complex.

GAIN (Generative Adversarial Imputation Networks) frames imputation as an adversarial game: a generator fills in missing values, a discriminator tries to identify which values were filled. They train against each other until the generator produces imputations indistinguishable from real data. GAIN handles MCAR well; it struggles with MAR and MNAR where the missingness mechanism itself carries structure the adversarial objective doesn't model.

MIWAE (Missing-data Importance-Weighted Autoencoder) is theoretically cleaner — it maximizes a tight lower bound on the log-likelihood of observed data, making it explicitly designed for MAR scenarios. It's slower to train and harder to tune than GAIN.

Empirically: MICE with CART outperforms deep learning models on bias, MSE, and coverage across most realistic settings. At N=50,000+ rows, GAIN matches missForest accuracy while running 40x faster. For typical dataset sizes in applied ML (hundreds to tens of thousands of rows), skip the deep learning methods unless you have a specific reason to use them.

Choosing a Method

The right choice is a function of three variables: how much data is missing, the missingness mechanism, and how much compute you can afford.

Method selection guide

% Missing Mechanism Recommended

< 5% Any SimpleImputer (mean/median/mode) 5–20% MCAR / MAR KNNImputer or MICE 5–20%, mixed types MAR missForest (IterativeImputer + RF) > 20% MNAR Domain knowledge + MissingIndicator

Always add MissingIndicator columns alongside any imputation strategy

Two common mistakes worth naming explicitly. First, people impute before splitting train/test, meaning the imputer's learned statistics (mean, nearest neighbors, regression weights) incorporate the test set — this is data leakage and inflates every evaluation metric. Second, people impute and then discard the indicator columns, throwing away the structural information about which values were absent. Doing both in a pipeline avoids both.

End-to-End Pipeline

Putting it together on a realistic dataset:

python
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer, MissingIndicator
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)
n = 500
X = pd.DataFrame({
    "age":       rng.integers(22, 65, n).astype(float),
    "income":    rng.normal(55000, 18000, n),
    "sessions":  rng.integers(1, 200, n).astype(float),
    "tenure":    rng.integers(1, 60, n).astype(float),
})
mask = rng.random((n, 4)) < 0.12
for i, col in enumerate(X.columns):
    X.loc[mask[:, i], col] = np.nan

y = (X["income"].fillna(X["income"].median()) > 55000).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

imputer = IterativeImputer(max_iter=10, random_state=0)
indicator = MissingIndicator(features="all")

features = FeatureUnion([
    ("imputed", imputer),
    ("flags",   indicator),
])

pipe = Pipeline([
    ("features", features),
    ("scaler",   StandardScaler()),
    ("model",    RandomForestClassifier(n_estimators=100, random_state=0)),
])

pipe.fit(X_train, y_train)
print(f"Test accuracy: {pipe.score(X_test, y_test):.3f}")
text
Test accuracy: 0.910

The FeatureUnion step runs the imputer and indicator in parallel, concatenating their outputs — the model receives both the imputed feature values and the binary flags marking which were originally absent.

Imputation vs. Data Synthesis — Not the Same Problem

These two terms get conflated often enough that it's worth being direct about the difference.

Imputation operates on an existing dataset. Real rows exist; some values within those rows are missing. The job is to estimate what the missing value would have been for that specific observation, using the information you already have. The row is real. The imputed value is your best guess at a specific, real measurement you failed to capture.

Data synthesis generates entirely new observations. No real row is being completed — a generative model learns the joint distribution of the data and samples from it to produce rows that never existed. The goal is expanding a dataset, creating training data for a downstream model, augmenting rare classes, or generating privacy-preserving stand-ins for sensitive records.

The practical difference matters:

ImputationData Synthesis
Starting pointReal rows with gapsNothing — generating from scratch
GoalRecover a specific missing valueExpand or augment the dataset
Validity constraintMust be consistent with the observed rowMust match the learned distribution
EvaluationReconstruction error on held-out valuesDistributional fidelity, downstream model performance
ToolsSimpleImputer, MICE, KNNImputerGANs, VAEs, SMOTE, LLM-based synthesis

GAIN and MIWAE occupy an uncomfortable middle — they use generative architectures (GANs, VAEs) but their output is an imputed value constrained to a real row, not a new observation. They're generative mechanisms applied to an imputation problem. Calling them synthesis is a category error.

Where it gets genuinely blurry: SMOTE (Synthetic Minority Oversampling Technique) interpolates between real minority-class examples to generate new points. Some people call that imputation; it's closer to synthesis. The boundary is whether you're completing a real observation or inventing a new one.

The reason this matters operationally: synthesis errors compound when you treat synthetic rows as ground truth in downstream tasks. Imputation errors affect specific cells in real rows — localized, auditable. Synthesis errors can silently shift the learned distribution of your entire training set.

There's a reason MNAR gets a separate row in every decision guide: no imputation technique handles it cleanly. You're trying to estimate a value whose absence is correlated with the value itself. No amount of information from other features resolves that circular dependency.

The honest approaches for MNAR data are pattern-mixture models (explicitly modeling the distribution of observed vs. missing), selection models (modeling the missingness mechanism alongside the outcome), or Bayesian approaches that incorporate prior knowledge about the missing distribution. In practice, most teams acknowledge MNAR, add MissingIndicator columns so the model can at least learn from the pattern, and document the limitation rather than pretending an imputation fixed it.

The discomfort with MNAR is worth sitting with. A model trained on imputed MNAR data may perform well on your test set — which has the same missingness structure — and fail on deployment data where the mechanism shifts. The distribution of who doesn't report income is not stable across years, geographies, or user bases.


Imputation is one of the few preprocessing steps where the right answer genuinely depends on something you often can't know for certain — why the data is absent. That uncertainty doesn't disappear when you pick a method; it just gets baked into your estimates more or less honestly.

Sources:

Comments (0)

No comments yet. Be the first to comment!

Leave a comment