← View series: statistics
~/blog
Exploratory Data Analysis
Statistical tests answer specific questions. EDA finds the questions worth asking.
John Tukey put it plainly: "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question." EDA is the process of forming hypotheses before you test them. Skip it and you risk fitting a model on data with hidden class imbalance, undetected outliers, near-constant features, or extreme scale mismatches — the model will train, but the answers will be wrong.
EDA has three phases:
- Understand the shape of the data — structure, types, missingness
- Summarize each variable — univariate distributions
- Investigate relationships — bivariate and multivariate structure
Anchors used throughout:
import pandas as pd
data = pd.DataFrame({
'fold': [1, 2, 3, 4, 5, 6],
'accuracy': [0.82, 0.79, 0.91, 0.85, 0.78, 0.88],
'precision': [0.80, 0.76, 0.89, 0.83, 0.75, 0.86],
'train_size': [4200, 4200, 4200, 4200, 4200, 4200]
})
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]The tabular data DataFrame represents six cross-validation evaluation runs. latency_ms shows a right-skewed continuous variable for distribution examples.
Phase 1 — Understand the Shape of the Data
These five operations are the first thing to run on any new dataset — in this order:
print(data.shape)
print(data.dtypes)
print(data.isnull().sum())
print(data.duplicated().sum())
print(data.describe())(6, 4)
fold int64
accuracy float64
precision float64
train_size int64
dtype: object
fold 0
accuracy 0
precision 0
train_size 0
dtype: int64
0
fold accuracy precision train_size
count 6.000000 6.000000 6.000000 6.0
mean 3.500000 0.838333 0.815000 4200.0
std 1.870829 0.051299 0.054122 0.0
min 1.000000 0.780000 0.750000 4200.0
25% 2.250000 0.797500 0.770000 4200.0
50% 3.500000 0.835000 0.815000 4200.0
75% 4.750000 0.872500 0.852500 4200.0
max 6.000000 0.910000 0.890000 4200.0
What to look for in df.describe():
| Signal | What It Tells You |
|---|---|
count < n_rows | Missing values — cross-check with isnull().sum() |
min or max far from mean | Potential outliers |
std ≈ 0 | Near-constant column (low information, can cause numerical issues in linear models) |
mean far from 50% | Skewed distribution — use IQR instead of SD for summaries |
For the anchor: train_size has std=0 — all folds used the same dataset size. This column adds no information and would cause a zero-variance error in many ML preprocessing steps. Flag it for removal.
Missing Value Patterns
Not all missingness is the same. The type determines the correct response:
| Type | Description | Detection | Response |
|---|---|---|---|
| MCAR (Missing Completely At Random) | No pattern — missing randomly | df.isnull().corr() shows no correlations | Drop rows or impute mean/median |
| MAR (Missing At Random) | Missingness depends on another observed variable | Missingness in col A correlates with col B | Conditional imputation (e.g., KNN or model-based) |
| MNAR (Missing Not At Random) | Missingness depends on the missing value itself | Cannot detect from data alone | Domain knowledge required — dropping rows creates systematic bias |
Example of MNAR: users with very low income are less likely to report income. Dropping those rows underestimates poverty rates.
# Detect which columns have missingness correlated with other columns
# (MCAR vs MAR diagnostic)
import numpy as np
missing_indicators = data.isnull().astype(int)
if missing_indicators.sum().sum() > 0:
print(missing_indicators.corr())
else:
print("No missing values in dataset")No missing values in dataset
Phase 2 — Univariate Distributions
Inspect each variable individually before looking at relationships. Shape, skewness, and outliers are univariate properties.
Continuous Variables
import numpy as np
from scipy import stats
for col in data.select_dtypes(include='number').columns:
vals = data[col].dropna()
skew = stats.skew(vals)
kurt = stats.kurtosis(vals)
print(f"{col:>12}: mean={vals.mean():.3f}, median={vals.median():.3f}, "
f"std={vals.std():.3f}, skew={skew:.3f}, kurt={kurt:.3f}") fold: mean=3.500, median=3.500, std=1.871, skew=0.000, kurt=-1.200
accuracy: mean=0.838, median=0.835, std=0.051, skew=0.280, kurt=-1.483
precision: mean=0.815, median=0.815, std=0.054, skew=0.152, kurt=-1.482
train_size: mean=4200.000, median=4200.000, std=0.000, skew=0.000, kurt=0.000
accuracy skewness = 0.28 — approximately symmetric (|γ₁| < 0.5). train_size std = 0 — constant, uninformative.
For comparison, latency_ms has strong right skew:
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
print(f"latency_ms: mean={np.mean(latency_ms):.1f}, median={np.median(latency_ms):.1f}, "
f"skew={stats.skew(latency_ms):.3f}")latency_ms: mean=261.4, median=151.5, skew=2.560
latency_ms skewness = 2.56 — highly right-skewed. Mean (261ms) is pulled far right of median (151ms) by the 890ms spike. Consider log transform before modeling.
The SVG below shows histogram (shape) and boxplot (outlier detection) side-by-side for both distributions:
Categorical Variables
Use bar charts, not pie charts — proportions are harder to compare in a pie. For the anchor, fold is ordered discrete (not truly categorical here), but the pattern applies:
# For a genuinely categorical column:
# print(df['category_col'].value_counts())
# Flag if any category has < 5% of total
n = len(data)
for col in data.select_dtypes(include='object').columns:
counts = data[col].value_counts()
low_freq = counts[counts / n < 0.05]
if len(low_freq) > 0:
print(f"Low-frequency categories in '{col}': {low_freq.index.tolist()}")Phase 3 — Bivariate and Multivariate Relationships
Correlation Matrix
# Compute correlation matrix (Pearson r by default)
corr = data[['accuracy', 'precision', 'fold']].corr()
print(corr.round(3)) accuracy precision fold
accuracy 1.000 0.998 0.219
precision 0.998 1.000 0.169
fold 0.219 0.169 1.000
What to flag:
|r| > 0.9→ multicollinearity concern for linear models: accuracy and precision have r = 0.998. Including both as features in a linear regression introduces near-singular behavior — one should be dropped or PCA applied.|r| < 0.05→ potentially uninformative feature for a linear model (though nonlinear models may still find it useful).
SVG below shows the 3×3 correlation heatmap:
Scatter Plot
# Scatter: accuracy vs precision
import matplotlib.pyplot as plt
plt.scatter(data['accuracy'], data['precision'])
plt.xlabel('accuracy'); plt.ylabel('precision')
plt.title('accuracy vs precision (r=0.998)')The scatter would show near-perfect alignment — every point lies almost exactly on a line. In a feature matrix, this degree of collinearity means one of these columns adds almost no additional information to a linear model.
Pair Plot
import seaborn as sns
g = sns.pairplot(data[['accuracy', 'precision', 'fold']], diag_kind='kde')The pair plot diagonal shows the marginal distribution of each variable (KDE curve). The off-diagonal cells show all pairwise scatter plots. If you color by a categorical target (hue='target'), class separation becomes visible without running a single model — useful for spotting whether the classes are linearly separable.
Group-wise Comparisons
When a grouping variable exists, groupby shows whether group means differ — setting up a future t-test or ANOVA:
# Hypothetical: did early folds (1-3) vs late folds (4-6) differ in accuracy?
data['phase'] = data['fold'].apply(lambda x: 'early' if x <= 3 else 'late')
print(data.groupby('phase')[['accuracy', 'precision']].agg(['mean', 'std', 'count'])) accuracy precision
mean std count mean std count
phase
early 0.840000 0.062450 3 0.816667 0.065064 3
late 0.836667 0.045092 3 0.813333 0.045735 3
The means are nearly identical — no obvious phase effect. This tells you the hypothesis "later folds have higher accuracy due to better data ordering" is not supported here. You would still run a t-test to confirm, but EDA just saved you from chasing a non-existent effect.
Outlier Investigation
EDA detects outliers. It does not automatically remove them. Three detection strategies:
Strategy 1 — Z-score method:
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mean_acc = sum(accuracy) / len(accuracy)
std_acc = (sum((x - mean_acc)**2 for x in accuracy) / (len(accuracy)-1)) ** 0.5
z_scores = [(x - mean_acc) / std_acc for x in accuracy]
for fold, (acc, z) in enumerate(zip(accuracy, z_scores), 1):
flag = " ← outlier" if abs(z) > 3 else ""
print(f"Fold {fold}: acc={acc:.2f}, z={z:+.3f}{flag}")Fold 1: acc=0.82, z=-0.358
Fold 2: acc=0.79, z=-0.944
Fold 3: acc=0.91, z=+1.401
Fold 4: acc=0.85, z=+0.228
Fold 5: acc=0.78, z=-1.140
Fold 6: acc=0.88, z=+0.815
No fold exceeds |z| = 3. All values are within 1.5 SDs of the mean.
Strategy 2 — IQR method (robust):
import numpy as np
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
q1 = np.percentile(latency_ms, 25)
q3 = np.percentile(latency_ms, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
print(f"Q1={q1:.1f}, Q3={q3:.1f}, IQR={iqr:.1f}")
print(f"Fences: [{lower:.1f}, {upper:.1f}]")
outliers = [x for x in latency_ms if x < lower or x > upper]
print(f"Outliers: {outliers}")Q1=143.5, Q3=196.5, IQR=53.0
Fences: [64.0, 276.0]
Outliers: [312, 890]
Both 312ms and 890ms are flagged by IQR. The z-score method on the same data would not flag them as strongly because the outliers themselves inflate the standard deviation, shrinking their z-scores.
The investigation rule — never drop without understanding:
| Finding | Action |
|---|---|
| Data error (mislabeled sample, sensor malfunction, wrong units) | Correct the error or remove the row |
| Rare genuine event (infrastructure failure, unusual request) | Keep it, document it, consider robust methods |
| Influential point (changes model significantly when included/excluded) | Analyze with and without; report both results |
For latency_ms = 890: Is this a real request that took 890ms (perhaps a cold-start or cache miss) or a logging error? Without that answer, removing it is unjustified.
EDA Checklist
Before modeling or hypothesis testing, confirm every item below:
- Shape: n_rows, n_cols, dtypes verified — constant columns (std=0) identified
- Missing values: counts checked, pattern type assessed (MCAR / MAR / MNAR)
- Duplicates: checked and resolved
- Univariate — numeric: histogram and boxplot inspected, skewness computed, log transform flagged if |γ₁| > 1
- Univariate — categorical: value counts checked, low-frequency categories (< 5%) flagged
- Bivariate: correlation matrix computed, |r| > 0.9 (collinearity) and |r| < 0.05 (uninformative) both flagged
- Outliers: detected via both IQR and z-score, each outlier investigated before any decision to remove
- Scale: whether features need standardization before modeling noted (especially for distance-based algorithms)
Related Concepts
EDA draws on every descriptive statistics technique covered so far: mean and SD (Phase 1 summaries), skewness and kurtosis (Phase 2 shape checks), percentiles and IQR (boxplots and outlier fences), histograms (univariate distributions), correlation (Phase 3 relationships). Once EDA confirms the data is clean and you understand its structure, the next step is inferential statistics — forming specific hypotheses and testing them with t-tests, chi-square tests, ANOVA, or regression.
When This Breaks Down
EDA is not a substitute for domain knowledge. df.describe() can tell you the mean is far from the median but not whether that is a genuine data property or a labeling error. Correlation matrices find linear associations but miss everything else — always plot the scatter. Pair plots scale poorly beyond 10 variables; for wide datasets, start with a variance threshold to drop near-constant columns, then use hierarchical clustering of the correlation matrix to find groups of correlated features. Finally: EDA is inherently subjective. Two analysts may draw different hypotheses from the same plots. Document what you found, what you assumed, and what you left for the statistical tests to decide.
Test Your Understanding
-
You load a dataset and find that
df.describe()showsmean=35.0and50%=18.0for a numeric column. What does this suggest about the distribution's shape? What would you plot first to confirm, and what transform would you apply before modeling? -
You detect 3 rows with
accuracy = 0.00in a model evaluation dataset. List the three possible explanations (data error, genuine result, influential point) and describe how you would determine which one applies. -
A correlation matrix shows r = 0.97 between
feature_Aandfeature_B. You plan to train a logistic regression. What specific problem does this create, and what two remedies would you consider? -
You compute the IQR-based outlier fences for inference latency and flag 8 out of 200 requests. Your colleague argues all 8 should be removed because they "contaminate the distribution." What would you ask about each flagged request before deciding?
-
Why does the z-score method underperform compared to IQR when detecting outliers in a heavily right-skewed distribution? Give a concrete example where z-score misses an outlier that IQR catches.