Layer Normalization does two things: it subtracts the mean (centering) and divides by the standard deviation (scaling). RMSNorm asks: do you need both? The hypothesis from Zhang & Sennrich (2019) is that re-scaling is what matters for training stability — the re-centering adds little and costs compute. So RMSNorm drops the mean subtraction entirely. It only divides by the root-mean-square of the input features.
LLaMA (all versions), Mistral, Falcon, and Gemma all use RMSNorm. At 7B+ parameters trained for trillions of tokens, saving 7–10% of normalization compute per forward and backward pass is a meaningful reduction in training cost.
Anchor: x = [2.0, 0.5, −1.0, 1.5]. Same as LayerNorm to enable direct comparison. γ = 1.0 per feature, no β.
LayerNorm vs RMSNorm
LayerNorm:
- Compute mean μ = (1/d) Σ xᵢ
- Compute variance σ² = (1/d) Σ (xᵢ − μ)²
- Normalize: x̂ᵢ = (xᵢ − μ) / √(σ² + ε)
- Scale and shift: yᵢ = γᵢ · x̂ᵢ + βᵢ
RMSNorm:
- Compute RMS = √((1/d) Σ xᵢ²)
- Normalize: x̂ᵢ = xᵢ / RMS(x)
- Scale: yᵢ = γᵢ · x̂ᵢ ← no β, no mean subtraction
The mean subtraction (step 1 in LN) and the β shift parameter are both gone. Two fewer operations in the forward pass, simpler backward pass.
Computing on the Anchor
x = [2.0, 0.5, −1.0, 1.5]
Step 1 — Square each value: xᵢ² = [2.0², 0.5², (−1.0)², 1.5²] = [4.0, 0.25, 1.0, 2.25]
Step 2 — Mean of squares: (4.0 + 0.25 + 1.0 + 2.25) / 4 = 7.5 / 4 = 1.875
Step 3 — RMS: RMS = √1.875 = 1.3693
Step 4 — Normalize:
- x̂₁ = 2.0 / 1.3693 = 1.4606
- x̂₂ = 0.5 / 1.3693 = 0.3651
- x̂₃ = −1.0 / 1.3693 = −0.7303
- x̂₄ = 1.5 / 1.3693 = 1.0954
Step 5 — Scale (γ = 1): y = x̂
| Step | Formula | Result |
|---|---|---|
| xᵢ² | [2²,0.5²,1²,1.5²] | [4.0, 0.25, 1.0, 2.25] |
| mean(xᵢ²) | (4.0+0.25+1.0+2.25)/4 | 1.875 |
| RMS | √1.875 | 1.3693 |
| x̂ | xᵢ/1.3693 | [1.461, 0.365, −0.730, 1.095] |
| y (γ=1) | 1.0·x̂ | [1.461, 0.365, −0.730, 1.095] |
Compare with LayerNorm on same x:
| Feature 0 | Feature 1 | Feature 2 | Feature 3 | |
|---|---|---|---|---|
| RMSNorm | 1.461 | 0.365 | −0.730 | 1.095 |
| LayerNorm | 1.091 | −0.218 | −1.527 | 0.655 |
The outputs differ because LayerNorm subtracts μ = 0.75 first — the anchor mean is positive, so LN shifts all values left before normalizing. RMSNorm does not subtract anything, so the positive mean carries through into the normalized values.
Gradient Through RMSNorm
∂L/∂xᵢ = (γᵢ/RMS) · [∂L/∂yᵢ − x̂ᵢ · (1/d) Σⱼ x̂ⱼ · ∂L/∂yⱼ]
With ∂L/∂y = [0.1, −0.2, 0.3, −0.1], γ = 1, RMS = 1.3693, x̂ = [1.461, 0.365, −0.730, 1.095]:
Shared term: (1/4) Σⱼ x̂ⱼ · ∂L/∂yⱼ = (1/4) · [1.461×0.1 + 0.365×(−0.2) + (−0.730)×0.3 + 1.095×(−0.1)] = (1/4) · [0.1461 − 0.073 − 0.219 − 0.1095] = (1/4) · (−0.2554) = −0.0639
Then ∂L/∂x₁ = (1/1.3693) · [0.1 − 1.461×(−0.0639)] = (0.730) · [0.1 + 0.0934] = 0.730 × 0.1934 = 0.141
Compared to LN's gradient which has two mean-subtraction correction terms, RMSNorm has only one — the x̂ᵢ · shared_term piece. This is the simpler backward pass.
Code
import numpy as np
def rms_norm(x, gamma, eps=1e-8):
rms = np.sqrt((x**2).mean() + eps)
x_hat = x / rms
return gamma * x_hat, rms, x_hat
def layer_norm(x, gamma, beta, eps=1e-5):
mu = x.mean()
std = np.sqrt(x.var() + eps)
x_hat = (x - mu) / std
return gamma * x_hat + beta, mu, std, x_hat
x = np.array([2.0, 0.5, -1.0, 1.5])
gamma = np.ones(4)
y_rms, rms, x_hat_rms = rms_norm(x, gamma)
y_ln, mu, std, x_hat_ln = layer_norm(x, gamma, np.zeros(4))
print(f"x: {x}")
print(f"\nRMSNorm:")
print(f" RMS: {rms:.4f}")
print(f" x_hat: {x_hat_rms.round(4)}")
print(f" output y: {y_rms.round(4)}")
print(f"\nLayerNorm (same x):")
print(f" mean: {mu:.4f}")
print(f" std: {std:.4f}")
print(f" x_hat: {x_hat_ln.round(4)}")
print(f" output y: {y_ln.round(4)}")x: [ 2. 0.5 -1. 1.5]
RMSNorm:
RMS: 1.3693
x_hat: [ 1.4606 0.3651 -0.7303 1.0954]
output y: [ 1.4606 0.3651 -0.7303 1.0954]
LayerNorm (same x):
mean: 0.7500
std: 1.1456
x_hat: [ 1.0911 -0.2182 -1.5275 0.6547]
output y: [ 1.0911 -0.2182 -1.5275 0.6547]Related Concepts
RMSNorm is a direct simplification of Layer Normalization (05-layer-normalization.md) — read that post first to understand the base design. The hypothesis that re-centering is unnecessary was empirically validated across NLP tasks; in the LLaMA architecture, RMSNorm is applied before every self-attention block and every FFN block (Pre-LN placement). The SwiGLU FFN (03-activations/11-swiglu.md) sits between two RMSNorm calls in LLaMA, making the two posts closely connected.
Honest Limitations
Without mean subtraction, the RMSNorm output mean is not zero — it is bounded by the input's mean-to-RMS ratio. If the input activation consistently has a non-zero mean (which can happen with one-sided activations), the output will be systematically shifted. LayerNorm corrects for this via the β parameter and mean subtraction; RMSNorm relies on downstream layers (or the loss) to handle it.
With only γ and no β, RMSNorm cannot shift the output distribution — it can only scale it. For tasks where the downstream activation expects zero-centered inputs (like sigmoid for gating), this can be a subtle problem that LayerNorm's β would have corrected for free.
The empirical gains of RMSNorm over LN are modest in models below 1B parameters — the speed difference is measurable but unlikely to affect final model quality. For small-scale experiments, the choice between LN and RMSNorm is mostly a matter of matching the architecture you are reproducing rather than a principled decision.
Test Your Understanding
-
Compute RMS for x = [−2.0, 1.0, 0.0, 3.0]. Now compute RMSNorm(x) with γ = 1. Compare to what LayerNorm would output on the same x.
-
RMSNorm has no β parameter. If a downstream activation function requires zero-centered inputs, how would you compensate for the missing shift without adding β back?
-
The LLaMA-2 7B model has d=4096 hidden dimensions and 32 layers. How many trainable parameters are in all RMSNorm layers combined (pre-attention and pre-FFN, one per sublayer)?
-
Compute ∂L/∂x₂ for the anchor x = [2.0, 0.5, −1.0, 1.5] using the gradient formula and ∂L/∂y = [0.1, −0.2, 0.3, −0.1]. Show all substituted values.
-
An engineer proposes adding β back to RMSNorm to recover the centering ability, arguing it costs only d extra parameters per layer. Under what specific conditions would this actually improve model performance, and when would it make no difference?