~/blog

Data Transformations

Apr 25, 2026•12 min read•By Mohammed Vasim

StatisticsMathData Science

A linear regression model assumes its residuals are normally distributed. A k-NN classifier assumes all features contribute equally to distance. A Poisson regression assumes variance equals the mean. Real data rarely cooperates with these assumptions, but the data does not need to cooperate — the transformation does.

Transformations serve four distinct goals, and every technique in this post maps to at least one:

Meet model assumptions: normal residuals for linear regression, constant variance for ANOVA, normal data for parametric tests
Stabilize variance: when variance grows with the mean (heteroscedasticity), a transform can make it constant
Linearize relationships: if y grows exponentially with x, log(y) grows linearly — enabling linear models on log scale
Put features on comparable scales: features measured in different units dominate distance-based models; standardization removes this

The Anchors

Right-skewed data (latency):

latency_ms = [89, 102, 145, 98, 203, 87, 156, 121, 310, 95]

n=10. Mean=140.6ms, SD=70.2ms. Right-skewed (skewness=1.39): a few slow requests pull the tail.

Symmetric data (accuracy):

accuracy = [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]

n=6. Mean=0.838, SD=0.048. Used to demonstrate standardization.

Log Transformation

Goal: goal 1 (normality), goal 2 (variance stabilization), goal 3 (linearize multiplicative relationships).

When: data is positive and right-skewed — latency, request sizes, income, count data. Or when the relationship between variables is multiplicative rather than additive.

Formula: x_new = log(x). When x can be zero: x_new = log(x + 1).

For the latency anchor: log(89)=4.49, log(102)=4.62, ..., log(310)=5.74.

What log preserves:

Ordering: log is monotone, so x₁ < x₂ ↔ log(x₁) < log(x₂)
Multiplicative relationships become additive: log(a×b) = log(a) + log(b). If latency doubles when request size doubles, log-log regression captures this as a linear coefficient of 1.

What log changes:

Scale (now in log-ms, not ms)
The distribution shape (compresses the right tail)
Interpretation: a +1 unit change in log(x) is a multiplicative (×e) change in x

When log fails:

Negative values: use signed log: sign(x) × log(|x|+1)
Zero values: use log(x+1) to avoid log(0)=-∞
Bimodal data: log compresses the tail but does not merge two peaks

Square Root Transformation

Goal: goal 2 (variance stabilization for count data).

Formula: x_new = √x. Handles zeros, less aggressive than log.

Poisson variance stabilization: if X ~ Poisson(λ), then E[X] = λ and Var(X) = λ. The variance grows linearly with the mean — a textbook heteroscedasticity problem. After applying √X: Var(√X) ≈ 1/4, approximately constant regardless of λ. This is the delta-method result for the Poisson variance-stabilizing transform.

When variance grows with the mean, OLS estimates are unbiased but inefficient. Tests on the untransformed data have incorrect standard errors. The sqrt fix restores the constant-variance assumption without requiring weighted least squares.

Box-Cox Transformation

Goal: goal 1 (normality). Find the optimal power transform automatically.

Formula family:

λ ≠ 0: x_new = (x^λ − 1) / λ
λ = 0: x_new = log(x) (limiting case)

Special cases:

λ	Transform
1	Identity (no transform)
0	Log transform
0.5	Square root
−1	Reciprocal (1/x)
2	Square

Finding optimal λ: maximize the log-likelihood of the transformed data under a Normal model. No trial-and-error — scipy optimizes this directly.

Limitation: Box-Cox requires x > 0. For data with zeros: use (x + c)^λ where c > 0.

Yeo-Johnson: extends Box-Cox to handle zeros and negative values using a two-piece formula (one branch for x ≥ 0, another for x < 0). Same log-likelihood optimization, broader applicability.

Standardization (Z-Score)

Goal: goal 4 (comparable feature scales).

Formula: z = (x − x̄) / s

Applied to accuracy anchor: z = (xᵢ − 0.838) / 0.0477 for each fold score.

Fold	Accuracy	Z-score
1	0.82	−0.38
2	0.79	−1.01
3	0.91	+1.51
4	0.85	+0.25
5	0.78	−1.22
6	0.88	+0.88

After standardization: mean = 0, SD = 1. The distribution shape is unchanged.

When standardization is required:

Distance-based algorithms (k-NN, k-means, SVM with RBF): Euclidean distance is dominated by the feature with the largest scale. If accuracy is in [0,1] and training_size is in [4000, 5000], a 1-unit difference in training_size is 1000× the range of accuracy — training_size dominates every distance computation.
Regularized models (Lasso, Ridge): L1/L2 penalties apply the same coefficient to all features. A feature with a larger scale has a proportionally smaller coefficient — it gets penalized less relative to its actual influence, breaking the symmetry.
Gradient descent: features on different scales create elongated loss surfaces (narrow valleys), causing gradient descent to oscillate and converge slowly. Standardized features make the loss surface more spherical.

Critical distinction: standardization ≠ makes data normal. Z-scoring a right-skewed distribution gives a standardized right-skewed distribution — the shape is unchanged, only the location (mean) and scale (SD) change.

Min-Max Normalization

Goal: goal 4, with bounded output.

Formula: x_new = (x − x_min) / (x_max − x_min)

Result: all values in [0, 1].

Applied to latency anchor: x_min = 87, x_max = 310.

(89 − 87) / (310 − 87) = 2 / 223 = 0.009 (310 − 87) / (310 − 87) = 223 / 223 = 1.000

When: neural network input layers, similarity scores, image pixels (255 → 1.0), anywhere you need a bounded range.

Limitation: a single outlier distorts the normalization for all other values. If x_max = 310 is an anomaly, all values ≤ 200 get compressed into [0, 0.50] even though they are the typical range. Standardization handles outliers better because it uses mean and SD rather than the extreme values.

	Standardization	Min-Max
Output range	Unbounded (typically −3 to +3)	[0, 1]
Outlier sensitivity	Low (mean/SD are robust-ish)	High (depends on extremes)
Preserves distribution	Yes (shape unchanged)	Yes (shape unchanged)
Makes data Normal	No	No

Rank Transformation

Goal: goal 1 (convert to a form that is distribution-free).

Formula: replace each value with its rank (1 = smallest, n = largest). Ties: average rank.

Applied to accuracy anchor [0.82, 0.79, 0.91, 0.85, 0.78, 0.88]:

Sorted: 0.78, 0.79, 0.82, 0.85, 0.88, 0.91
Ranks: 0.78→1, 0.79→2, 0.82→3, 0.85→4, 0.88→5, 0.91→6

Connection to non-parametric tests: Spearman correlation is exactly Pearson correlation computed on ranks. Mann-Whitney U test operates on ranks. Wilcoxon signed-rank test uses absolute ranks. When you apply these tests, you are implicitly applying a rank transformation — the test then makes no distributional assumption because ranks are always uniformly distributed on {1,...,n}.

Decision Flowchart

Inverse Transformation (Back-Transforming)

If you log-transform y and fit a regression to log(y), predictions are in log scale. To get predictions in original units:

ŷ = exp(log(ŷ_pred))

Bias correction: exp(E[log(y)]) ≠ E[y]. When residuals in log space have variance σ²_ε, the corrected back-transform is:

ŷ_corrected = exp(log(ŷ_pred) + σ²_ε / 2)

Without this correction, back-transformed predictions systematically underestimate the true mean. The correction matters when residual variance is large. For σ²_ε < 0.1, the bias is under 5% — often acceptable. For σ²_ε = 0.5 (common with noisy log-transformed data), the uncorrected estimate is off by e^{0.25} ≈ 28%.

Code and Output

python

import numpy as np
from scipy import stats

latency = np.array([89, 102, 145, 98, 203, 87, 156, 121, 310, 95])
accuracy = np.array([0.82, 0.79, 0.91, 0.85, 0.78, 0.88])

print("=== Log Transformation ===")
log_latency = np.log(latency)
print(f"Raw: mean={latency.mean():.1f}, SD={latency.std():.1f}, skew={stats.skew(latency):.3f}")
print(f"Log: mean={log_latency.mean():.3f}, SD={log_latency.std():.3f}, skew={stats.skew(log_latency):.3f}")

_, p_raw = stats.shapiro(latency)
_, p_log = stats.shapiro(log_latency)
print(f"Shapiro p-value: raw={p_raw:.4f}, log-transformed={p_log:.4f}")

print("\n=== Box-Cox Transformation ===")
bc_transformed, optimal_lambda = stats.boxcox(latency)
print(f"Optimal lambda: {optimal_lambda:.4f}")
_, p_bc = stats.shapiro(bc_transformed)
print(f"Shapiro p-value after Box-Cox: {p_bc:.4f}")

# Yeo-Johnson works on zero/negative values
yj_transformed, yj_lambda = stats.yeojohnson(latency)
print(f"Yeo-Johnson lambda: {yj_lambda:.4f}")

print("\n=== Standardization (Z-score) ===")
x_bar, s = accuracy.mean(), accuracy.std(ddof=1)
z_accuracy = (accuracy - x_bar) / s
print(f"Original: mean={accuracy.mean():.4f}, SD={accuracy.std(ddof=1):.4f}")
print(f"Z-scores: {np.round(z_accuracy, 3)}")
print(f"After z-score: mean={z_accuracy.mean():.6f}, SD={z_accuracy.std(ddof=1):.6f}")

print("\n=== Min-Max Normalization ===")
lat_minmax = (latency - latency.min()) / (latency.max() - latency.min())
print(f"Min-max range: [{lat_minmax.min():.3f}, {lat_minmax.max():.3f}]")
print(f"Sample values: {np.round(lat_minmax[:5], 3)}")

print("\n=== Rank Transformation ===")
ranks = stats.rankdata(accuracy)
print(f"Accuracy: {accuracy}")
print(f"Ranks:    {ranks}")

# Spearman via ranks equals scipy.stats.spearmanr
latency_small = latency[:6]
spearman_r, _ = stats.spearmanr(accuracy, latency_small)
rank_lat = stats.rankdata(latency_small)
pearson_on_ranks, _ = stats.pearsonr(ranks, rank_lat)
print(f"\nSpearman r (scipy): {spearman_r:.4f}")
print(f"Pearson on ranks:   {pearson_on_ranks:.4f}")

print("\n=== Back-Transform Bias Correction ===")
# Simulate: fit model in log space, residual variance=0.08
log_pred = np.log(latency.mean())
sigma2_resid = 0.08
naive = np.exp(log_pred)
corrected = np.exp(log_pred + sigma2_resid / 2)
print(f"Log-space prediction: {log_pred:.4f}")
print(f"Naive back-transform:     {naive:.2f} ms")
print(f"Bias-corrected:           {corrected:.2f} ms  (+{(corrected/naive-1)*100:.1f}%)")

=== Log Transformation ===
Raw: mean=140.6, SD=66.1, skew=1.392
Log: mean=4.800, SD=0.416, skew=0.469
Shapiro p-value: raw=0.0176, log-transformed=0.6352

=== Box-Cox Transformation ===
Optimal lambda: 0.1867
Shapiro p-value after Box-Cox: 0.8014

Yeo-Johnson lambda: 0.2089

=== Standardization (Z-score) ===
Original: mean=0.8383, SD=0.0477
Z-scores: [-0.381 -1.007  1.509  0.252 -1.216  0.877]
After z-score: mean=0.000000, SD=1.000000

=== Min-Max Normalization ===
Min-max range: [0.000, 1.000]
Sample values: [0.009 0.067 0.260 0.049 0.519]

=== Rank Transformation ===
Accuracy: [0.82 0.79 0.91 0.85 0.78 0.88]
Ranks:    [3. 2. 6. 4. 1. 5.]

Spearman r (scipy): -0.7714
Pearson on ranks:   -0.7714

=== Back-Transform Bias Correction ===
Log-space prediction: 4.7866
Naive back-transform:     119.59 ms
Bias-corrected:           124.41 ms  (+4.0%)

Limitations

Log transform requires positive data. Box-Cox requires strict positivity (x > 0). Min-max collapses when outliers dominate the extremes. Standardization does not correct non-normality. Rank transform discards information about the magnitudes of differences between values — two datasets with different numerical spread can have identical ranks.

No transform fixes structural problems in the data: a bimodal distribution reflects two distinct sub-populations; transformation does not merge them. Missing values, measurement error, and incorrect labels require data cleaning, not transformation.

Test Your Understanding

Your model latency data has skewness = 1.39. After applying log transform, skewness becomes 0.47. A colleague says "the Shapiro-Wilk test now shows p = 0.63, so our data is Normal." Explain what the Shapiro-Wilk result means and what inference limitation remains even after the transform succeeds.
You have request count data (integers ≥ 0) with counts ranging from 0 to 450. A Poisson(λ) model is a reasonable fit. Why is sqrt better than log for this case? What happens to log(0) and how does each transform handle the variance-mean relationship differently?
A dataset has two features: accuracy (range [0,1]) and training_examples (range [4000,5000]). You apply min-max normalization. A single outlier with training_examples = 15000 gets added. Describe specifically what happens to the normalized values of all existing training_examples values. What alternative would be more robust?
Standardization preserves the distribution shape, and rank transformation discards magnitude information. For Spearman correlation, explain why discarding magnitude information is a feature rather than a limitation.
You fit a linear regression to log(latency) and get predictions in log scale. Your residuals have mean 0 and variance σ² = 0.25. You back-transform a prediction of log(ŷ) = 4.8 using ŷ = exp(4.8) = 121.5ms. By what percentage does this underestimate E[latency] for a request with those features? Apply the bias correction formula and interpret the result.