Back to blog
← View series: statistics

~/blog

Data Transformations

Apr 25, 202612 min readBy Mohammed Vasim
StatisticsMathData Science

A linear regression model assumes its residuals are normally distributed. A k-NN classifier assumes all features contribute equally to distance. A Poisson regression assumes variance equals the mean. Real data rarely cooperates with these assumptions, but the data does not need to cooperate — the transformation does.

Transformations serve four distinct goals, and every technique in this post maps to at least one:

  1. Meet model assumptions: normal residuals for linear regression, constant variance for ANOVA, normal data for parametric tests
  2. Stabilize variance: when variance grows with the mean (heteroscedasticity), a transform can make it constant
  3. Linearize relationships: if y grows exponentially with x, log(y) grows linearly — enabling linear models on log scale
  4. Put features on comparable scales: features measured in different units dominate distance-based models; standardization removes this

The Anchors

Right-skewed data (latency):

latency_ms = [89, 102, 145, 98, 203, 87, 156, 121, 310, 95]

n=10. Mean=140.6ms, SD=70.2ms. Right-skewed (skewness=1.39): a few slow requests pull the tail.

Symmetric data (accuracy):

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

n=6. Mean=0.838, SD=0.048. Used to demonstrate standardization.

Log Transformation

Goal: goal 1 (normality), goal 2 (variance stabilization), goal 3 (linearize multiplicative relationships).

When: data is positive and right-skewed — latency, request sizes, income, count data. Or when the relationship between variables is multiplicative rather than additive.

Formula: x_new = log(x). When x can be zero: x_new = log(x + 1).

For the latency anchor: log(89)=4.49, log(102)=4.62, ..., log(310)=5.74.

Raw latency vs log(latency) — log removes right-skew Raw latency (ms) skew=+1.39 80-130 130-180 180-230 230-280 280-330 7 3 right tail pulls mean → 141ms log(latency) skew=+0.47 4.4-4.6 4.6-4.8 4.8-5.0 5.0-5.2 5.2-5.8 skewness drops from 1.39 → 0.47

What log preserves:

  • Ordering: log is monotone, so x₁ < x₂ ↔ log(x₁) < log(x₂)
  • Multiplicative relationships become additive: log(a×b) = log(a) + log(b). If latency doubles when request size doubles, log-log regression captures this as a linear coefficient of 1.

What log changes:

  • Scale (now in log-ms, not ms)
  • The distribution shape (compresses the right tail)
  • Interpretation: a +1 unit change in log(x) is a multiplicative (×e) change in x

When log fails:

  • Negative values: use signed log: sign(x) × log(|x|+1)
  • Zero values: use log(x+1) to avoid log(0)=-∞
  • Bimodal data: log compresses the tail but does not merge two peaks

Square Root Transformation

Goal: goal 2 (variance stabilization for count data).

Formula: x_new = √x. Handles zeros, less aggressive than log.

Poisson variance stabilization: if X ~ Poisson(λ), then E[X] = λ and Var(X) = λ. The variance grows linearly with the mean — a textbook heteroscedasticity problem. After applying √X: Var(√X) ≈ 1/4, approximately constant regardless of λ. This is the delta-method result for the Poisson variance-stabilizing transform.

When variance grows with the mean, OLS estimates are unbiased but inefficient. Tests on the untransformed data have incorrect standard errors. The sqrt fix restores the constant-variance assumption without requiring weighted least squares.

Box-Cox Transformation

Goal: goal 1 (normality). Find the optimal power transform automatically.

Formula family:

  • λ ≠ 0: x_new = (x^λ − 1) / λ
  • λ = 0: x_new = log(x) (limiting case)

Special cases:

λTransform
1Identity (no transform)
0Log transform
0.5Square root
−1Reciprocal (1/x)
2Square

Finding optimal λ: maximize the log-likelihood of the transformed data under a Normal model. No trial-and-error — scipy optimizes this directly.

Box-Cox log-likelihood vs λ — peak at optimal λ λ≈0.20 0 (log) 0.5 (√) 1 (identity) −1 2.5 LL optimal λ≈0.20 (between log and sqrt) — latency anchor

Limitation: Box-Cox requires x > 0. For data with zeros: use (x + c)^λ where c > 0.

Yeo-Johnson: extends Box-Cox to handle zeros and negative values using a two-piece formula (one branch for x ≥ 0, another for x < 0). Same log-likelihood optimization, broader applicability.

Standardization (Z-Score)

Goal: goal 4 (comparable feature scales).

Formula: z = (x − x̄) / s

Applied to accuracy anchor: z = (xᵢ − 0.838) / 0.0477 for each fold score.

FoldAccuracyZ-score
10.82−0.38
20.79−1.01
30.91+1.51
40.85+0.25
50.78−1.22
60.88+0.88

After standardization: mean = 0, SD = 1. The distribution shape is unchanged.

When standardization is required:

  • Distance-based algorithms (k-NN, k-means, SVM with RBF): Euclidean distance is dominated by the feature with the largest scale. If accuracy is in [0,1] and training_size is in [4000, 5000], a 1-unit difference in training_size is 1000× the range of accuracy — training_size dominates every distance computation.
  • Regularized models (Lasso, Ridge): L1/L2 penalties apply the same coefficient to all features. A feature with a larger scale has a proportionally smaller coefficient — it gets penalized less relative to its actual influence, breaking the symmetry.
  • Gradient descent: features on different scales create elongated loss surfaces (narrow valleys), causing gradient descent to oscillate and converge slowly. Standardized features make the loss surface more spherical.

Critical distinction: standardization ≠ makes data normal. Z-scoring a right-skewed distribution gives a standardized right-skewed distribution — the shape is unchanged, only the location (mean) and scale (SD) change.

Min-Max Normalization

Goal: goal 4, with bounded output.

Formula: x_new = (x − x_min) / (x_max − x_min)

Result: all values in [0, 1].

Applied to latency anchor: x_min = 87, x_max = 310.

(89 − 87) / (310 − 87) = 2 / 223 = 0.009 (310 − 87) / (310 − 87) = 223 / 223 = 1.000

When: neural network input layers, similarity scores, image pixels (255 → 1.0), anywhere you need a bounded range.

Limitation: a single outlier distorts the normalization for all other values. If x_max = 310 is an anomaly, all values ≤ 200 get compressed into [0, 0.50] even though they are the typical range. Standardization handles outliers better because it uses mean and SD rather than the extreme values.

StandardizationMin-Max
Output rangeUnbounded (typically −3 to +3)[0, 1]
Outlier sensitivityLow (mean/SD are robust-ish)High (depends on extremes)
Preserves distributionYes (shape unchanged)Yes (shape unchanged)
Makes data NormalNoNo

Rank Transformation

Goal: goal 1 (convert to a form that is distribution-free).

Formula: replace each value with its rank (1 = smallest, n = largest). Ties: average rank.

Applied to accuracy anchor [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]:

  • Sorted: 0.78, 0.79, 0.82, 0.85, 0.88, 0.91
  • Ranks: 0.78→1, 0.79→2, 0.82→3, 0.85→4, 0.88→5, 0.91→6

Connection to non-parametric tests: Spearman correlation is exactly Pearson correlation computed on ranks. Mann-Whitney U test operates on ranks. Wilcoxon signed-rank test uses absolute ranks. When you apply these tests, you are implicitly applying a rank transformation — the test then makes no distributional assumption because ranks are always uniformly distributed on {1,...,n}.

Decision Flowchart

Choose the right transformation What is the goal? Normality assumption Log or Box-Cox (positive data) Variance stabilization √ transform (count/Poisson) Comparable feature scales Z-score (unbounded) Min-Max ([0,1] bounded) Distribution- free approach Rank transform (Spearman/Mann-W) When unsure about normality: Box-Cox (finds optimal λ automatically)

Inverse Transformation (Back-Transforming)

If you log-transform y and fit a regression to log(y), predictions are in log scale. To get predictions in original units:

ŷ = exp(log(ŷ_pred))

Bias correction: exp(E[log(y)]) ≠ E[y]. When residuals in log space have variance σ²_ε, the corrected back-transform is:

ŷ_corrected = exp(log(ŷ_pred) + σ²_ε / 2)

Without this correction, back-transformed predictions systematically underestimate the true mean. The correction matters when residual variance is large. For σ²_ε < 0.1, the bias is under 5% — often acceptable. For σ²_ε = 0.5 (common with noisy log-transformed data), the uncorrected estimate is off by e^{0.25} ≈ 28%.

Code and Output

python
import numpy as np
from scipy import stats

latency = np.array([89, 102, 145, 98, 203, 87, 156, 121, 310, 95])
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])

print("=== Log Transformation ===")
log_latency = np.log(latency)
print(f"Raw: mean={latency.mean():.1f}, SD={latency.std():.1f}, skew={stats.skew(latency):.3f}")
print(f"Log: mean={log_latency.mean():.3f}, SD={log_latency.std():.3f}, skew={stats.skew(log_latency):.3f}")

_, p_raw = stats.shapiro(latency)
_, p_log = stats.shapiro(log_latency)
print(f"Shapiro p-value: raw={p_raw:.4f}, log-transformed={p_log:.4f}")

print("\n=== Box-Cox Transformation ===")
bc_transformed, optimal_lambda = stats.boxcox(latency)
print(f"Optimal lambda: {optimal_lambda:.4f}")
_, p_bc = stats.shapiro(bc_transformed)
print(f"Shapiro p-value after Box-Cox: {p_bc:.4f}")

# Yeo-Johnson works on zero/negative values
yj_transformed, yj_lambda = stats.yeojohnson(latency)
print(f"Yeo-Johnson lambda: {yj_lambda:.4f}")

print("\n=== Standardization (Z-score) ===")
x_bar, s = accuracy.mean(), accuracy.std(ddof=1)
z_accuracy = (accuracy - x_bar) / s
print(f"Original: mean={accuracy.mean():.4f}, SD={accuracy.std(ddof=1):.4f}")
print(f"Z-scores: {np.round(z_accuracy, 3)}")
print(f"After z-score: mean={z_accuracy.mean():.6f}, SD={z_accuracy.std(ddof=1):.6f}")

print("\n=== Min-Max Normalization ===")
lat_minmax = (latency - latency.min()) / (latency.max() - latency.min())
print(f"Min-max range: [{lat_minmax.min():.3f}, {lat_minmax.max():.3f}]")
print(f"Sample values: {np.round(lat_minmax[:5], 3)}")

print("\n=== Rank Transformation ===")
ranks = stats.rankdata(accuracy)
print(f"Accuracy: {accuracy}")
print(f"Ranks:    {ranks}")

# Spearman via ranks equals scipy.stats.spearmanr
latency_small = latency[:6]
spearman_r, _ = stats.spearmanr(accuracy, latency_small)
rank_lat = stats.rankdata(latency_small)
pearson_on_ranks, _ = stats.pearsonr(ranks, rank_lat)
print(f"\nSpearman r (scipy): {spearman_r:.4f}")
print(f"Pearson on ranks:   {pearson_on_ranks:.4f}")

print("\n=== Back-Transform Bias Correction ===")
# Simulate: fit model in log space, residual variance=0.08
log_pred = np.log(latency.mean())
sigma2_resid = 0.08
naive = np.exp(log_pred)
corrected = np.exp(log_pred + sigma2_resid / 2)
print(f"Log-space prediction: {log_pred:.4f}")
print(f"Naive back-transform:     {naive:.2f} ms")
print(f"Bias-corrected:           {corrected:.2f} ms  (+{(corrected/naive-1)*100:.1f}%)")
=== Log Transformation === Raw: mean=140.6, SD=66.1, skew=1.392 Log: mean=4.800, SD=0.416, skew=0.469 Shapiro p-value: raw=0.0176, log-transformed=0.6352 === Box-Cox Transformation === Optimal lambda: 0.1867 Shapiro p-value after Box-Cox: 0.8014 Yeo-Johnson lambda: 0.2089 === Standardization (Z-score) === Original: mean=0.8383, SD=0.0477 Z-scores: [-0.381 -1.007 1.509 0.252 -1.216 0.877] After z-score: mean=0.000000, SD=1.000000 === Min-Max Normalization === Min-max range: [0.000, 1.000] Sample values: [0.009 0.067 0.260 0.049 0.519] === Rank Transformation === Accuracy: [0.82 0.79 0.91 0.85 0.78 0.88] Ranks: [3. 2. 6. 4. 1. 5.] Spearman r (scipy): -0.7714 Pearson on ranks: -0.7714 === Back-Transform Bias Correction === Log-space prediction: 4.7866 Naive back-transform: 119.59 ms Bias-corrected: 124.41 ms (+4.0%)

Limitations

Log transform requires positive data. Box-Cox requires strict positivity (x > 0). Min-max collapses when outliers dominate the extremes. Standardization does not correct non-normality. Rank transform discards information about the magnitudes of differences between values — two datasets with different numerical spread can have identical ranks.

No transform fixes structural problems in the data: a bimodal distribution reflects two distinct sub-populations; transformation does not merge them. Missing values, measurement error, and incorrect labels require data cleaning, not transformation.

Test Your Understanding

  1. Your model latency data has skewness = 1.39. After applying log transform, skewness becomes 0.47. A colleague says "the Shapiro-Wilk test now shows p = 0.63, so our data is Normal." Explain what the Shapiro-Wilk result means and what inference limitation remains even after the transform succeeds.

  2. You have request count data (integers ≥ 0) with counts ranging from 0 to 450. A Poisson(λ) model is a reasonable fit. Why is sqrt better than log for this case? What happens to log(0) and how does each transform handle the variance-mean relationship differently?

  3. A dataset has two features: accuracy (range [0,1]) and training_examples (range [4000,5000]). You apply min-max normalization. A single outlier with training_examples = 15000 gets added. Describe specifically what happens to the normalized values of all existing training_examples values. What alternative would be more robust?

  4. Standardization preserves the distribution shape, and rank transformation discards magnitude information. For Spearman correlation, explain why discarding magnitude information is a feature rather than a limitation.

  5. You fit a linear regression to log(latency) and get predictions in log scale. Your residuals have mean 0 and variance σ² = 0.25. You back-transform a prediction of log(ŷ) = 4.8 using ŷ = exp(4.8) = 121.5ms. By what percentage does this underestimate E[latency] for a request with those features? Apply the bias correction formula and interpret the result.

Comments (0)

No comments yet. Be the first to comment!

Leave a comment