← View series: statistics
~/blog
Data Transformations
A linear regression model assumes its residuals are normally distributed. A k-NN classifier assumes all features contribute equally to distance. A Poisson regression assumes variance equals the mean. Real data rarely cooperates with these assumptions, but the data does not need to cooperate — the transformation does.
Transformations serve four distinct goals, and every technique in this post maps to at least one:
- Meet model assumptions: normal residuals for linear regression, constant variance for ANOVA, normal data for parametric tests
- Stabilize variance: when variance grows with the mean (heteroscedasticity), a transform can make it constant
- Linearize relationships: if y grows exponentially with x, log(y) grows linearly — enabling linear models on log scale
- Put features on comparable scales: features measured in different units dominate distance-based models; standardization removes this
The Anchors
Right-skewed data (latency):
latency_ms = [89, 102, 145, 98, 203, 87, 156, 121, 310, 95]
n=10. Mean=140.6ms, SD=70.2ms. Right-skewed (skewness=1.39): a few slow requests pull the tail.
Symmetric data (accuracy):
accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]
n=6. Mean=0.838, SD=0.048. Used to demonstrate standardization.
Log Transformation
Goal: goal 1 (normality), goal 2 (variance stabilization), goal 3 (linearize multiplicative relationships).
When: data is positive and right-skewed — latency, request sizes, income, count data. Or when the relationship between variables is multiplicative rather than additive.
Formula: x_new = log(x). When x can be zero: x_new = log(x + 1).
For the latency anchor: log(89)=4.49, log(102)=4.62, ..., log(310)=5.74.
What log preserves:
- Ordering: log is monotone, so x₁ < x₂ ↔ log(x₁) < log(x₂)
- Multiplicative relationships become additive: log(a×b) = log(a) + log(b). If latency doubles when request size doubles, log-log regression captures this as a linear coefficient of 1.
What log changes:
- Scale (now in log-ms, not ms)
- The distribution shape (compresses the right tail)
- Interpretation: a +1 unit change in log(x) is a multiplicative (×e) change in x
When log fails:
- Negative values: use signed log: sign(x) × log(|x|+1)
- Zero values: use log(x+1) to avoid log(0)=-∞
- Bimodal data: log compresses the tail but does not merge two peaks
Square Root Transformation
Goal: goal 2 (variance stabilization for count data).
Formula: x_new = √x. Handles zeros, less aggressive than log.
Poisson variance stabilization: if X ~ Poisson(λ), then E[X] = λ and Var(X) = λ. The variance grows linearly with the mean — a textbook heteroscedasticity problem. After applying √X: Var(√X) ≈ 1/4, approximately constant regardless of λ. This is the delta-method result for the Poisson variance-stabilizing transform.
When variance grows with the mean, OLS estimates are unbiased but inefficient. Tests on the untransformed data have incorrect standard errors. The sqrt fix restores the constant-variance assumption without requiring weighted least squares.
Box-Cox Transformation
Goal: goal 1 (normality). Find the optimal power transform automatically.
Formula family:
- λ ≠ 0: x_new = (x^λ − 1) / λ
- λ = 0: x_new = log(x) (limiting case)
Special cases:
| λ | Transform |
|---|---|
| 1 | Identity (no transform) |
| 0 | Log transform |
| 0.5 | Square root |
| −1 | Reciprocal (1/x) |
| 2 | Square |
Finding optimal λ: maximize the log-likelihood of the transformed data under a Normal model. No trial-and-error — scipy optimizes this directly.
Limitation: Box-Cox requires x > 0. For data with zeros: use (x + c)^λ where c > 0.
Yeo-Johnson: extends Box-Cox to handle zeros and negative values using a two-piece formula (one branch for x ≥ 0, another for x < 0). Same log-likelihood optimization, broader applicability.
Standardization (Z-Score)
Goal: goal 4 (comparable feature scales).
Formula: z = (x − x̄) / s
Applied to accuracy anchor: z = (xᵢ − 0.838) / 0.0477 for each fold score.
| Fold | Accuracy | Z-score |
|---|---|---|
| 1 | 0.82 | −0.38 |
| 2 | 0.79 | −1.01 |
| 3 | 0.91 | +1.51 |
| 4 | 0.85 | +0.25 |
| 5 | 0.78 | −1.22 |
| 6 | 0.88 | +0.88 |
After standardization: mean = 0, SD = 1. The distribution shape is unchanged.
When standardization is required:
- Distance-based algorithms (k-NN, k-means, SVM with RBF): Euclidean distance is dominated by the feature with the largest scale. If accuracy is in [0,1] and training_size is in [4000, 5000], a 1-unit difference in training_size is 1000× the range of accuracy — training_size dominates every distance computation.
- Regularized models (Lasso, Ridge): L1/L2 penalties apply the same coefficient to all features. A feature with a larger scale has a proportionally smaller coefficient — it gets penalized less relative to its actual influence, breaking the symmetry.
- Gradient descent: features on different scales create elongated loss surfaces (narrow valleys), causing gradient descent to oscillate and converge slowly. Standardized features make the loss surface more spherical.
Critical distinction: standardization ≠ makes data normal. Z-scoring a right-skewed distribution gives a standardized right-skewed distribution — the shape is unchanged, only the location (mean) and scale (SD) change.
Min-Max Normalization
Goal: goal 4, with bounded output.
Formula: x_new = (x − x_min) / (x_max − x_min)
Result: all values in [0, 1].
Applied to latency anchor: x_min = 87, x_max = 310.
(89 − 87) / (310 − 87) = 2 / 223 = 0.009 (310 − 87) / (310 − 87) = 223 / 223 = 1.000
When: neural network input layers, similarity scores, image pixels (255 → 1.0), anywhere you need a bounded range.
Limitation: a single outlier distorts the normalization for all other values. If x_max = 310 is an anomaly, all values ≤ 200 get compressed into [0, 0.50] even though they are the typical range. Standardization handles outliers better because it uses mean and SD rather than the extreme values.
| Standardization | Min-Max | |
|---|---|---|
| Output range | Unbounded (typically −3 to +3) | [0, 1] |
| Outlier sensitivity | Low (mean/SD are robust-ish) | High (depends on extremes) |
| Preserves distribution | Yes (shape unchanged) | Yes (shape unchanged) |
| Makes data Normal | No | No |
Rank Transformation
Goal: goal 1 (convert to a form that is distribution-free).
Formula: replace each value with its rank (1 = smallest, n = largest). Ties: average rank.
Applied to accuracy anchor [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]:
- Sorted: 0.78, 0.79, 0.82, 0.85, 0.88, 0.91
- Ranks: 0.78→1, 0.79→2, 0.82→3, 0.85→4, 0.88→5, 0.91→6
Connection to non-parametric tests: Spearman correlation is exactly Pearson correlation computed on ranks. Mann-Whitney U test operates on ranks. Wilcoxon signed-rank test uses absolute ranks. When you apply these tests, you are implicitly applying a rank transformation — the test then makes no distributional assumption because ranks are always uniformly distributed on {1,...,n}.
Decision Flowchart
Inverse Transformation (Back-Transforming)
If you log-transform y and fit a regression to log(y), predictions are in log scale. To get predictions in original units:
ŷ = exp(log(ŷ_pred))
Bias correction: exp(E[log(y)]) ≠ E[y]. When residuals in log space have variance σ²_ε, the corrected back-transform is:
ŷ_corrected = exp(log(ŷ_pred) + σ²_ε / 2)
Without this correction, back-transformed predictions systematically underestimate the true mean. The correction matters when residual variance is large. For σ²_ε < 0.1, the bias is under 5% — often acceptable. For σ²_ε = 0.5 (common with noisy log-transformed data), the uncorrected estimate is off by e^{0.25} ≈ 28%.
Code and Output
import numpy as np
from scipy import stats
latency = np.array([89, 102, 145, 98, 203, 87, 156, 121, 310, 95])
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])
print("=== Log Transformation ===")
log_latency = np.log(latency)
print(f"Raw: mean={latency.mean():.1f}, SD={latency.std():.1f}, skew={stats.skew(latency):.3f}")
print(f"Log: mean={log_latency.mean():.3f}, SD={log_latency.std():.3f}, skew={stats.skew(log_latency):.3f}")
_, p_raw = stats.shapiro(latency)
_, p_log = stats.shapiro(log_latency)
print(f"Shapiro p-value: raw={p_raw:.4f}, log-transformed={p_log:.4f}")
print("\n=== Box-Cox Transformation ===")
bc_transformed, optimal_lambda = stats.boxcox(latency)
print(f"Optimal lambda: {optimal_lambda:.4f}")
_, p_bc = stats.shapiro(bc_transformed)
print(f"Shapiro p-value after Box-Cox: {p_bc:.4f}")
# Yeo-Johnson works on zero/negative values
yj_transformed, yj_lambda = stats.yeojohnson(latency)
print(f"Yeo-Johnson lambda: {yj_lambda:.4f}")
print("\n=== Standardization (Z-score) ===")
x_bar, s = accuracy.mean(), accuracy.std(ddof=1)
z_accuracy = (accuracy - x_bar) / s
print(f"Original: mean={accuracy.mean():.4f}, SD={accuracy.std(ddof=1):.4f}")
print(f"Z-scores: {np.round(z_accuracy, 3)}")
print(f"After z-score: mean={z_accuracy.mean():.6f}, SD={z_accuracy.std(ddof=1):.6f}")
print("\n=== Min-Max Normalization ===")
lat_minmax = (latency - latency.min()) / (latency.max() - latency.min())
print(f"Min-max range: [{lat_minmax.min():.3f}, {lat_minmax.max():.3f}]")
print(f"Sample values: {np.round(lat_minmax[:5], 3)}")
print("\n=== Rank Transformation ===")
ranks = stats.rankdata(accuracy)
print(f"Accuracy: {accuracy}")
print(f"Ranks: {ranks}")
# Spearman via ranks equals scipy.stats.spearmanr
latency_small = latency[:6]
spearman_r, _ = stats.spearmanr(accuracy, latency_small)
rank_lat = stats.rankdata(latency_small)
pearson_on_ranks, _ = stats.pearsonr(ranks, rank_lat)
print(f"\nSpearman r (scipy): {spearman_r:.4f}")
print(f"Pearson on ranks: {pearson_on_ranks:.4f}")
print("\n=== Back-Transform Bias Correction ===")
# Simulate: fit model in log space, residual variance=0.08
log_pred = np.log(latency.mean())
sigma2_resid = 0.08
naive = np.exp(log_pred)
corrected = np.exp(log_pred + sigma2_resid / 2)
print(f"Log-space prediction: {log_pred:.4f}")
print(f"Naive back-transform: {naive:.2f} ms")
print(f"Bias-corrected: {corrected:.2f} ms (+{(corrected/naive-1)*100:.1f}%)")=== Log Transformation ===
Raw: mean=140.6, SD=66.1, skew=1.392
Log: mean=4.800, SD=0.416, skew=0.469
Shapiro p-value: raw=0.0176, log-transformed=0.6352
=== Box-Cox Transformation ===
Optimal lambda: 0.1867
Shapiro p-value after Box-Cox: 0.8014
Yeo-Johnson lambda: 0.2089
=== Standardization (Z-score) ===
Original: mean=0.8383, SD=0.0477
Z-scores: [-0.381 -1.007 1.509 0.252 -1.216 0.877]
After z-score: mean=0.000000, SD=1.000000
=== Min-Max Normalization ===
Min-max range: [0.000, 1.000]
Sample values: [0.009 0.067 0.260 0.049 0.519]
=== Rank Transformation ===
Accuracy: [0.82 0.79 0.91 0.85 0.78 0.88]
Ranks: [3. 2. 6. 4. 1. 5.]
Spearman r (scipy): -0.7714
Pearson on ranks: -0.7714
=== Back-Transform Bias Correction ===
Log-space prediction: 4.7866
Naive back-transform: 119.59 ms
Bias-corrected: 124.41 ms (+4.0%)
Limitations
Log transform requires positive data. Box-Cox requires strict positivity (x > 0). Min-max collapses when outliers dominate the extremes. Standardization does not correct non-normality. Rank transform discards information about the magnitudes of differences between values — two datasets with different numerical spread can have identical ranks.
No transform fixes structural problems in the data: a bimodal distribution reflects two distinct sub-populations; transformation does not merge them. Missing values, measurement error, and incorrect labels require data cleaning, not transformation.
Test Your Understanding
-
Your model latency data has skewness = 1.39. After applying log transform, skewness becomes 0.47. A colleague says "the Shapiro-Wilk test now shows p = 0.63, so our data is Normal." Explain what the Shapiro-Wilk result means and what inference limitation remains even after the transform succeeds.
-
You have request count data (integers ≥ 0) with counts ranging from 0 to 450. A Poisson(λ) model is a reasonable fit. Why is sqrt better than log for this case? What happens to log(0) and how does each transform handle the variance-mean relationship differently?
-
A dataset has two features: accuracy (range [0,1]) and training_examples (range [4000,5000]). You apply min-max normalization. A single outlier with training_examples = 15000 gets added. Describe specifically what happens to the normalized values of all existing training_examples values. What alternative would be more robust?
-
Standardization preserves the distribution shape, and rank transformation discards magnitude information. For Spearman correlation, explain why discarding magnitude information is a feature rather than a limitation.
-
You fit a linear regression to log(latency) and get predictions in log scale. Your residuals have mean 0 and variance σ² = 0.25. You back-transform a prediction of log(ŷ) = 4.8 using ŷ = exp(4.8) = 121.5ms. By what percentage does this underestimate E[latency] for a request with those features? Apply the bias correction formula and interpret the result.