~/blog

Missing Data Isn't Just a Nuisance — The Mechanism Determines the Fix

Jun 22, 2026•13 min read•By Mohammed Vasim

data-imputationmissing-datascikit-learnfeature-engineeringpreprocessing

You load your dataset, fit the model, and get:

text

ValueError: Input X contains NaN.

That's the easy case. The harder one is when you filled those NaNs with the column mean, the model trained without complaint, and your predictions are quietly off — and you have no idea why.

The missing values aren't the problem. The problem is that why they're missing determines what you should do about them, and most preprocessing tutorials skip that part entirely.

Why Data Goes Missing

Before touching a single NaN, it's worth asking whether the missingness is random noise or structured signal. Statisticians Donald Rubin and Roderick Little gave us the vocabulary for this in three categories.

Missing Completely at Random (MCAR) means the probability that a value is missing has no relationship to any variable in the dataset — observed or unobserved. A sensor that randomly drops a reading, a survey respondent who accidentally skips a page. The missing values are a random sample of all values. You can drop them or fill them with simple statistics and your estimates stay unbiased.

Missing at Random (MAR) means the missingness does depend on other observed variables, but not on the missing value itself. Younger patients in a clinical study are more likely to skip the follow-up questionnaire — the missingness depends on age (which you have), not on the answer they would have given. Simple deletion now introduces bias because you're systematically losing a subgroup. But because you have the variable that explains the missingness, you can model it out.

Missing Not at Random (MNAR) is the hard one. The probability of a value being missing depends on the missing value itself. High-income individuals don't report their income. Patients with severe symptoms drop out of a drug trial. No imputation method can fully correct for MNAR without external assumptions about the missingness mechanism — you're trying to estimate the very thing you don't have using the very thing you don't have.

The practical takeaway: if you can't identify the mechanism, assume MAR at best, and verify by checking whether missingness correlates with other features. MNAR requires domain knowledge to handle honestly.

Example: income missing because... ...form glitched randomly ...younger respondents skip it ...high earners opt out

↑ easiest to handle ↑ hardest to handle

When Deletion Is the Right Call

Dropping rows with missing values — df.dropna() — gets a bad reputation, but it's genuinely fine in specific conditions: the data is MCAR, and you're not losing more than 5% of your dataset. When both are true, the dropped rows are a random subset and your remaining data is unbiased.

The trap is using deletion on MAR data. If missingness correlates with any feature, your training set is now a skewed sample. A churn model trained after dropping rows where last_login is missing will never see the behavioral pattern of users who haven't logged in for months — which is probably your highest-risk group.

Column deletion is even blunter. Drop a feature with 60% missing values and you throw away 40% of real signal along with it. Worth keeping in mind before you reach for df.dropna(axis=1).

Simple Statistical Imputation

The most common starting point: replace each missing value with the mean, median, or mode of that column.

python

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

data = {
    "age":    [34, np.nan, 29, 45, np.nan],
    "income": [45000, 62000, np.nan, 78000, 54000],
    "rooms":  [3, 4, 2, np.nan, 3],
}
df = pd.DataFrame(data)

imp = SimpleImputer(strategy="median")
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=df.columns)
print(df_imputed)

text

age   income  rooms
0  34.0  45000.0    3.0
1  34.0  62000.0    4.0
2  29.0  54000.0    2.0
3  45.0  78000.0    3.0
4  34.0  54000.0    3.0

strategy="mean" for normally distributed numerical features, strategy="median" for skewed distributions or data with outliers, strategy="most_frequent" for categorical columns.

The catch: every imputed value is the same number — the column's central tendency. You're artificially concentrating the distribution at one point, compressing variance and erasing correlations between features. If 30% of income is missing and you fill it all with the median, you've just flattened something that should spread. Models that depend on variance — especially linear models — will underfit as a result.

The rule that matters most: fit your imputer on training data only, then use those learned statistics to transform both train and test sets. Fitting on the combined dataset leaks test distribution into training and inflates performance metrics.

python

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

imp = SimpleImputer(strategy="median")
X_train_imputed = imp.fit_transform(X_train)
X_test_imputed = imp.transform(X_test)

Multivariate Imputation

Simple imputation treats each column in isolation. Multivariate methods use relationships between features to produce better estimates — because in most real datasets, columns are correlated and that correlation is exactly the signal you want to exploit.

KNN Imputation

KNN imputation finds the k nearest complete neighbors for each row with a missing value, then fills using their average (for numerical) or mode (for categorical). Proximity is measured using nan_euclidean_distances, which normalizes over only the features that are present in both rows.

python

from sklearn.impute import KNNImputer

data = {
    "age":    [34, np.nan, 29, 45, 38],
    "income": [45000, 62000, 41000, 78000, 55000],
    "rooms":  [3, 4, 2, 5, 3],
}
df = pd.DataFrame(data)

knn_imp = KNNImputer(n_neighbors=2, weights="distance")
df_knn = pd.DataFrame(knn_imp.fit_transform(df), columns=df.columns)
print(df_knn["age"])

text

0    34.000000
1    36.461538
2    29.000000
3    45.000000
4    38.000000
dtype: float64

Row 1's missing age gets imputed from its two nearest neighbors (rows 0 and 4, weighted by distance) rather than the global median. The result is sensitive to local structure in the data.

The downside is cost: computing pairwise distances on a large dataset is O(n²). On 10,000+ rows with many features, it becomes slow enough to be a bottleneck in a training pipeline.

MICE — Multivariate Imputation by Chained Equations

MICE treats each feature with missing values as a regression target, using all other features as predictors. It runs in rounds:

Initialize all missing values with the column median (placeholder)
For each feature with missing values: fit a regression model on rows where that feature is present, predict the missing rows, update the imputed values
Repeat for every feature — one full pass is one iteration
Run for N iterations until values stabilize

Scikit-learn calls this IterativeImputer. It's marked experimental but is stable in practice.

python

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

mice_imp = IterativeImputer(max_iter=10, random_state=0)
df_mice = pd.DataFrame(mice_imp.fit_transform(df), columns=df.columns)
print(df_mice["age"])

text

0    34.000000
1    35.893421
2    29.000000
3    45.000000
4    38.000000
dtype: float64

* uses previously imputed value from this iteration default estimator: BayesianRidge — swap for any sklearn regressor

missForest

missForest runs the same iterative idea but swaps the regressor for a Random Forest, which captures nonlinear relationships and handles mixed data types (numerical + categorical) in one pass. It initializes with mean/mode, then predicts each feature's missing values using a forest trained on observed rows, and repeats until the difference between iterations stops shrinking.

In scikit-learn, you get missForest behavior by passing a random forest estimator to IterativeImputer:

python

from sklearn.ensemble import RandomForestRegressor

rf_imp = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=50, random_state=0),
    max_iter=10,
    random_state=0,
)
df_rf = pd.DataFrame(rf_imp.fit_transform(df), columns=df.columns)

No hyperparameter tuning required — random forests are robust enough at defaults. The tradeoff is compute: fitting a forest per feature per iteration is significantly slower than the default BayesianRidge estimator.

Marking What Was Missing

One thing most imputation tutorials skip: the fact that a value was missing is itself informative, and you should preserve that signal.

MissingIndicator creates binary columns flagging each position where a value was absent. Feed those alongside the imputed features and your model can learn whether the pattern of missingness predicts the target independently of the imputed values.

python

from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion
import numpy as np

X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [4, 5]])

union = FeatureUnion([
    ("imputer", SimpleImputer(strategy="mean")),
    ("indicator", MissingIndicator()),
])

X_out = union.fit_transform(X)
print(X_out)

text

[[1.  2.  0.  0.]
 [4.  3.  1.  0.]
 [7.  3.5 0.  1.]
 [4.  5.  0.  0.]]

The last two columns tell the model which values were originally absent. This is especially useful for MNAR data — the missingness pattern may be a stronger predictor than the imputed value itself. Both SimpleImputer and IterativeImputer expose an add_indicator=True parameter that does this in one step.

Deep Learning Methods

Classical methods win on most tabular datasets, but deep learning has a real edge when N is very large and missingness patterns are complex.

GAIN (Generative Adversarial Imputation Networks) frames imputation as an adversarial game: a generator fills in missing values, a discriminator tries to identify which values were filled. They train against each other until the generator produces imputations indistinguishable from real data. GAIN handles MCAR well; it struggles with MAR and MNAR where the missingness mechanism itself carries structure the adversarial objective doesn't model.

MIWAE (Missing-data Importance-Weighted Autoencoder) is theoretically cleaner — it maximizes a tight lower bound on the log-likelihood of observed data, making it explicitly designed for MAR scenarios. It's slower to train and harder to tune than GAIN.

Empirically: MICE with CART outperforms deep learning models on bias, MSE, and coverage across most realistic settings. At N=50,000+ rows, GAIN matches missForest accuracy while running 40x faster. For typical dataset sizes in applied ML (hundreds to tens of thousands of rows), skip the deep learning methods unless you have a specific reason to use them.

Choosing a Method

The right choice is a function of three variables: how much data is missing, the missingness mechanism, and how much compute you can afford.

% Missing Mechanism Recommended

< 5% Any SimpleImputer (mean/median/mode) 5–20% MCAR / MAR KNNImputer or MICE 5–20%, mixed types MAR missForest (IterativeImputer + RF) > 20% MNAR Domain knowledge + MissingIndicator

Always add MissingIndicator columns alongside any imputation strategy

Two common mistakes worth naming explicitly. First, people impute before splitting train/test, meaning the imputer's learned statistics (mean, nearest neighbors, regression weights) incorporate the test set — this is data leakage and inflates every evaluation metric. Second, people impute and then discard the indicator columns, throwing away the structural information about which values were absent. Doing both in a pipeline avoids both.

End-to-End Pipeline

Putting it together on a realistic dataset:

python

import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer, MissingIndicator
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)
n = 500
X = pd.DataFrame({
    "age":       rng.integers(22, 65, n).astype(float),
    "income":    rng.normal(55000, 18000, n),
    "sessions":  rng.integers(1, 200, n).astype(float),
    "tenure":    rng.integers(1, 60, n).astype(float),
})
mask = rng.random((n, 4)) < 0.12
for i, col in enumerate(X.columns):
    X.loc[mask[:, i], col] = np.nan

y = (X["income"].fillna(X["income"].median()) > 55000).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

imputer = IterativeImputer(max_iter=10, random_state=0)
indicator = MissingIndicator(features="all")

features = FeatureUnion([
    ("imputed", imputer),
    ("flags",   indicator),
])

pipe = Pipeline([
    ("features", features),
    ("scaler",   StandardScaler()),
    ("model",    RandomForestClassifier(n_estimators=100, random_state=0)),
])

pipe.fit(X_train, y_train)
print(f"Test accuracy: {pipe.score(X_test, y_test):.3f}")

text

Test accuracy: 0.910

The FeatureUnion step runs the imputer and indicator in parallel, concatenating their outputs — the model receives both the imputed feature values and the binary flags marking which were originally absent.

Imputation vs. Data Synthesis — Not the Same Problem

These two terms get conflated often enough that it's worth being direct about the difference.

Imputation operates on an existing dataset. Real rows exist; some values within those rows are missing. The job is to estimate what the missing value would have been for that specific observation, using the information you already have. The row is real. The imputed value is your best guess at a specific, real measurement you failed to capture.

Data synthesis generates entirely new observations. No real row is being completed — a generative model learns the joint distribution of the data and samples from it to produce rows that never existed. The goal is expanding a dataset, creating training data for a downstream model, augmenting rare classes, or generating privacy-preserving stand-ins for sensitive records.

The practical difference matters:

	Imputation	Data Synthesis
Starting point	Real rows with gaps	Nothing — generating from scratch
Goal	Recover a specific missing value	Expand or augment the dataset
Validity constraint	Must be consistent with the observed row	Must match the learned distribution
Evaluation	Reconstruction error on held-out values	Distributional fidelity, downstream model performance
Tools	SimpleImputer, MICE, KNNImputer	GANs, VAEs, SMOTE, LLM-based synthesis

GAIN and MIWAE occupy an uncomfortable middle — they use generative architectures (GANs, VAEs) but their output is an imputed value constrained to a real row, not a new observation. They're generative mechanisms applied to an imputation problem. Calling them synthesis is a category error.

Where it gets genuinely blurry: SMOTE (Synthetic Minority Oversampling Technique) interpolates between real minority-class examples to generate new points. Some people call that imputation; it's closer to synthesis. The boundary is whether you're completing a real observation or inventing a new one.

The reason this matters operationally: synthesis errors compound when you treat synthetic rows as ground truth in downstream tasks. Imputation errors affect specific cells in real rows — localized, auditable. Synthesis errors can silently shift the learned distribution of your entire training set.

There's a reason MNAR gets a separate row in every decision guide: no imputation technique handles it cleanly. You're trying to estimate a value whose absence is correlated with the value itself. No amount of information from other features resolves that circular dependency.

The honest approaches for MNAR data are pattern-mixture models (explicitly modeling the distribution of observed vs. missing), selection models (modeling the missingness mechanism alongside the outcome), or Bayesian approaches that incorporate prior knowledge about the missing distribution. In practice, most teams acknowledge MNAR, add MissingIndicator columns so the model can at least learn from the pattern, and document the limitation rather than pretending an imputation fixed it.

The discomfort with MNAR is worth sitting with. A model trained on imputed MNAR data may perform well on your test set — which has the same missingness structure — and fail on deployment data where the mechanism shifts. The distribution of who doesn't report income is not stable across years, geographies, or user bases.

Imputation is one of the few preprocessing steps where the right answer genuinely depends on something you often can't know for certain — why the data is absent. That uncertainty doesn't disappear when you pick a method; it just gets baked into your estimates more or less honestly.

Sources:

Missing Data Isn't Just a Nuisance — The Mechanism Determines the Fix

Why Data Goes Missing

When Deletion Is the Right Call

Simple Statistical Imputation

Multivariate Imputation

KNN Imputation

MICE — Multivariate Imputation by Chained Equations

missForest

Marking What Was Missing

Deep Learning Methods

Choosing a Method

End-to-End Pipeline

Imputation vs. Data Synthesis — Not the Same Problem

Comments (0)

Leave a comment