Back to blog
← View series: statistics

~/blog

Exploratory Data Analysis

Jun 21, 202611 min readBy Mohammed Vasim
StatisticsMathData Science

Statistical tests answer specific questions. EDA finds the questions worth asking.

John Tukey put it plainly: "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question." EDA is the process of forming hypotheses before you test them. Skip it and you risk fitting a model on data with hidden class imbalance, undetected outliers, near-constant features, or extreme scale mismatches — the model will train, but the answers will be wrong.

EDA has three phases:

  1. Understand the shape of the data — structure, types, missingness
  2. Summarize each variable — univariate distributions
  3. Investigate relationships — bivariate and multivariate structure

Anchors used throughout:

python
import pandas as pd

data = pd.DataFrame({
    'fold':       [1, 2, 3, 4, 5, 6],
    'accuracy':   [0.82, 0.79, 0.91, 0.85, 0.78, 0.88],
    'precision':  [0.80, 0.76, 0.89, 0.83, 0.75, 0.86],
    'train_size': [4200, 4200, 4200, 4200, 4200, 4200]
})

latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]

The tabular data DataFrame represents six cross-validation evaluation runs. latency_ms shows a right-skewed continuous variable for distribution examples.


Phase 1 — Understand the Shape of the Data

These five operations are the first thing to run on any new dataset — in this order:

python
print(data.shape)
print(data.dtypes)
print(data.isnull().sum())
print(data.duplicated().sum())
print(data.describe())
(6, 4) fold int64 accuracy float64 precision float64 train_size int64 dtype: object fold 0 accuracy 0 precision 0 train_size 0 dtype: int64 0 fold accuracy precision train_size count 6.000000 6.000000 6.000000 6.0 mean 3.500000 0.838333 0.815000 4200.0 std 1.870829 0.051299 0.054122 0.0 min 1.000000 0.780000 0.750000 4200.0 25% 2.250000 0.797500 0.770000 4200.0 50% 3.500000 0.835000 0.815000 4200.0 75% 4.750000 0.872500 0.852500 4200.0 max 6.000000 0.910000 0.890000 4200.0

What to look for in df.describe():

SignalWhat It Tells You
count < n_rowsMissing values — cross-check with isnull().sum()
min or max far from meanPotential outliers
std ≈ 0Near-constant column (low information, can cause numerical issues in linear models)
mean far from 50%Skewed distribution — use IQR instead of SD for summaries

For the anchor: train_size has std=0 — all folds used the same dataset size. This column adds no information and would cause a zero-variance error in many ML preprocessing steps. Flag it for removal.

Missing Value Patterns

Not all missingness is the same. The type determines the correct response:

TypeDescriptionDetectionResponse
MCAR (Missing Completely At Random)No pattern — missing randomlydf.isnull().corr() shows no correlationsDrop rows or impute mean/median
MAR (Missing At Random)Missingness depends on another observed variableMissingness in col A correlates with col BConditional imputation (e.g., KNN or model-based)
MNAR (Missing Not At Random)Missingness depends on the missing value itselfCannot detect from data aloneDomain knowledge required — dropping rows creates systematic bias

Example of MNAR: users with very low income are less likely to report income. Dropping those rows underestimates poverty rates.

python
# Detect which columns have missingness correlated with other columns
# (MCAR vs MAR diagnostic)
import numpy as np
missing_indicators = data.isnull().astype(int)
if missing_indicators.sum().sum() > 0:
    print(missing_indicators.corr())
else:
    print("No missing values in dataset")
No missing values in dataset

Phase 2 — Univariate Distributions

Inspect each variable individually before looking at relationships. Shape, skewness, and outliers are univariate properties.

Continuous Variables

python
import numpy as np
from scipy import stats

for col in data.select_dtypes(include='number').columns:
    vals = data[col].dropna()
    skew = stats.skew(vals)
    kurt = stats.kurtosis(vals)
    print(f"{col:>12}: mean={vals.mean():.3f}, median={vals.median():.3f}, "
          f"std={vals.std():.3f}, skew={skew:.3f}, kurt={kurt:.3f}")
fold: mean=3.500, median=3.500, std=1.871, skew=0.000, kurt=-1.200 accuracy: mean=0.838, median=0.835, std=0.051, skew=0.280, kurt=-1.483 precision: mean=0.815, median=0.815, std=0.054, skew=0.152, kurt=-1.482 train_size: mean=4200.000, median=4200.000, std=0.000, skew=0.000, kurt=0.000

accuracy skewness = 0.28 — approximately symmetric (|γ₁| < 0.5). train_size std = 0 — constant, uninformative.

For comparison, latency_ms has strong right skew:

python
latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
print(f"latency_ms: mean={np.mean(latency_ms):.1f}, median={np.median(latency_ms):.1f}, "
      f"skew={stats.skew(latency_ms):.3f}")
latency_ms: mean=261.4, median=151.5, skew=2.560

latency_ms skewness = 2.56 — highly right-skewed. Mean (261ms) is pulled far right of median (151ms) by the 890ms spike. Consider log transform before modeling.

The SVG below shows histogram (shape) and boxplot (outlier detection) side-by-side for both distributions:

accuracy (symmetric)

2 1 2 1 0.78 0.85 0.91 No outliers (within Tukey fences)

latency_ms (right-skewed)

6 1 1 120–200 300–400 890ms 890 Outlier: 890ms spike

Categorical Variables

Use bar charts, not pie charts — proportions are harder to compare in a pie. For the anchor, fold is ordered discrete (not truly categorical here), but the pattern applies:

python
# For a genuinely categorical column:
# print(df['category_col'].value_counts())
# Flag if any category has < 5% of total
n = len(data)
for col in data.select_dtypes(include='object').columns:
    counts = data[col].value_counts()
    low_freq = counts[counts / n < 0.05]
    if len(low_freq) > 0:
        print(f"Low-frequency categories in '{col}': {low_freq.index.tolist()}")

Phase 3 — Bivariate and Multivariate Relationships

Correlation Matrix

python
# Compute correlation matrix (Pearson r by default)
corr = data[['accuracy', 'precision', 'fold']].corr()
print(corr.round(3))
accuracy precision fold accuracy 1.000 0.998 0.219 precision 0.998 1.000 0.169 fold 0.219 0.169 1.000

What to flag:

  • |r| > 0.9 → multicollinearity concern for linear models: accuracy and precision have r = 0.998. Including both as features in a linear regression introduces near-singular behavior — one should be dropped or PCA applied.
  • |r| < 0.05 → potentially uninformative feature for a linear model (though nonlinear models may still find it useful).

SVG below shows the 3×3 correlation heatmap:

Correlation Matrix

accuracy precision fold

accuracy precision fold

1.000 0.998 0.219 0.998 1.000 0.169 0.219 0.169 1.000

accuracy vs precision: r=0.998 → multicollinearity risk

Scatter Plot

python
# Scatter: accuracy vs precision
import matplotlib.pyplot as plt
plt.scatter(data['accuracy'], data['precision'])
plt.xlabel('accuracy'); plt.ylabel('precision')
plt.title('accuracy vs precision (r=0.998)')

The scatter would show near-perfect alignment — every point lies almost exactly on a line. In a feature matrix, this degree of collinearity means one of these columns adds almost no additional information to a linear model.

Pair Plot

python
import seaborn as sns
g = sns.pairplot(data[['accuracy', 'precision', 'fold']], diag_kind='kde')

The pair plot diagonal shows the marginal distribution of each variable (KDE curve). The off-diagonal cells show all pairwise scatter plots. If you color by a categorical target (hue='target'), class separation becomes visible without running a single model — useful for spotting whether the classes are linearly separable.

Group-wise Comparisons

When a grouping variable exists, groupby shows whether group means differ — setting up a future t-test or ANOVA:

python
# Hypothetical: did early folds (1-3) vs late folds (4-6) differ in accuracy?
data['phase'] = data['fold'].apply(lambda x: 'early' if x <= 3 else 'late')
print(data.groupby('phase')[['accuracy', 'precision']].agg(['mean', 'std', 'count']))
accuracy precision mean std count mean std count phase early 0.840000 0.062450 3 0.816667 0.065064 3 late 0.836667 0.045092 3 0.813333 0.045735 3

The means are nearly identical — no obvious phase effect. This tells you the hypothesis "later folds have higher accuracy due to better data ordering" is not supported here. You would still run a t-test to confirm, but EDA just saved you from chasing a non-existent effect.


Outlier Investigation

EDA detects outliers. It does not automatically remove them. Three detection strategies:

Strategy 1 — Z-score method:

python
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
mean_acc = sum(accuracy) / len(accuracy)
std_acc  = (sum((x - mean_acc)**2 for x in accuracy) / (len(accuracy)-1)) ** 0.5
z_scores = [(x - mean_acc) / std_acc for x in accuracy]

for fold, (acc, z) in enumerate(zip(accuracy, z_scores), 1):
    flag = " ← outlier" if abs(z) > 3 else ""
    print(f"Fold {fold}: acc={acc:.2f}, z={z:+.3f}{flag}")
Fold 1: acc=0.82, z=-0.358 Fold 2: acc=0.79, z=-0.944 Fold 3: acc=0.91, z=+1.401 Fold 4: acc=0.85, z=+0.228 Fold 5: acc=0.78, z=-1.140 Fold 6: acc=0.88, z=+0.815

No fold exceeds |z| = 3. All values are within 1.5 SDs of the mean.

Strategy 2 — IQR method (robust):

python
import numpy as np

latency_ms = [120, 135, 142, 148, 155, 189, 312, 890]
q1 = np.percentile(latency_ms, 25)
q3 = np.percentile(latency_ms, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

print(f"Q1={q1:.1f}, Q3={q3:.1f}, IQR={iqr:.1f}")
print(f"Fences: [{lower:.1f}, {upper:.1f}]")
outliers = [x for x in latency_ms if x < lower or x > upper]
print(f"Outliers: {outliers}")
Q1=143.5, Q3=196.5, IQR=53.0 Fences: [64.0, 276.0] Outliers: [312, 890]

Both 312ms and 890ms are flagged by IQR. The z-score method on the same data would not flag them as strongly because the outliers themselves inflate the standard deviation, shrinking their z-scores.

The investigation rule — never drop without understanding:

FindingAction
Data error (mislabeled sample, sensor malfunction, wrong units)Correct the error or remove the row
Rare genuine event (infrastructure failure, unusual request)Keep it, document it, consider robust methods
Influential point (changes model significantly when included/excluded)Analyze with and without; report both results

For latency_ms = 890: Is this a real request that took 890ms (perhaps a cold-start or cache miss) or a logging error? Without that answer, removing it is unjustified.


EDA Checklist

Before modeling or hypothesis testing, confirm every item below:

  • Shape: n_rows, n_cols, dtypes verified — constant columns (std=0) identified
  • Missing values: counts checked, pattern type assessed (MCAR / MAR / MNAR)
  • Duplicates: checked and resolved
  • Univariate — numeric: histogram and boxplot inspected, skewness computed, log transform flagged if |γ₁| > 1
  • Univariate — categorical: value counts checked, low-frequency categories (< 5%) flagged
  • Bivariate: correlation matrix computed, |r| > 0.9 (collinearity) and |r| < 0.05 (uninformative) both flagged
  • Outliers: detected via both IQR and z-score, each outlier investigated before any decision to remove
  • Scale: whether features need standardization before modeling noted (especially for distance-based algorithms)

EDA draws on every descriptive statistics technique covered so far: mean and SD (Phase 1 summaries), skewness and kurtosis (Phase 2 shape checks), percentiles and IQR (boxplots and outlier fences), histograms (univariate distributions), correlation (Phase 3 relationships). Once EDA confirms the data is clean and you understand its structure, the next step is inferential statistics — forming specific hypotheses and testing them with t-tests, chi-square tests, ANOVA, or regression.

When This Breaks Down

EDA is not a substitute for domain knowledge. df.describe() can tell you the mean is far from the median but not whether that is a genuine data property or a labeling error. Correlation matrices find linear associations but miss everything else — always plot the scatter. Pair plots scale poorly beyond 10 variables; for wide datasets, start with a variance threshold to drop near-constant columns, then use hierarchical clustering of the correlation matrix to find groups of correlated features. Finally: EDA is inherently subjective. Two analysts may draw different hypotheses from the same plots. Document what you found, what you assumed, and what you left for the statistical tests to decide.

Test Your Understanding

  1. You load a dataset and find that df.describe() shows mean=35.0 and 50%=18.0 for a numeric column. What does this suggest about the distribution's shape? What would you plot first to confirm, and what transform would you apply before modeling?

  2. You detect 3 rows with accuracy = 0.00 in a model evaluation dataset. List the three possible explanations (data error, genuine result, influential point) and describe how you would determine which one applies.

  3. A correlation matrix shows r = 0.97 between feature_A and feature_B. You plan to train a logistic regression. What specific problem does this create, and what two remedies would you consider?

  4. You compute the IQR-based outlier fences for inference latency and flag 8 out of 200 requests. Your colleague argues all 8 should be removed because they "contaminate the distribution." What would you ask about each flagged request before deciding?

  5. Why does the z-score method underperform compared to IQR when detecting outliers in a heavily right-skewed distribution? Give a concrete example where z-score misses an outlier that IQR catches.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment