~/blog

RMSNorm

Jul 3, 20266 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Layer Normalization does two things: it subtracts the mean (centering) and divides by the standard deviation (scaling). RMSNorm asks: do you need both? The hypothesis from Zhang & Sennrich (2019) is that re-scaling is what matters for training stability — the re-centering adds little and costs compute. So RMSNorm drops the mean subtraction entirely. It only divides by the root-mean-square of the input features.

LLaMA (all versions), Mistral, Falcon, and Gemma all use RMSNorm. At 7B+ parameters trained for trillions of tokens, saving 7–10% of normalization compute per forward and backward pass is a meaningful reduction in training cost.

Anchor: x = [2.0, 0.5, −1.0, 1.5]. Same as LayerNorm to enable direct comparison. γ = 1.0 per feature, no β.


LayerNorm vs RMSNorm

LayerNorm:

  1. Compute mean μ = (1/d) Σ xᵢ
  2. Compute variance σ² = (1/d) Σ (xᵢ − μ)²
  3. Normalize: x̂ᵢ = (xᵢ − μ) / √(σ² + ε)
  4. Scale and shift: yᵢ = γᵢ · x̂ᵢ + βᵢ

RMSNorm:

  1. Compute RMS = √((1/d) Σ xᵢ²)
  2. Normalize: x̂ᵢ = xᵢ / RMS(x)
  3. Scale: yᵢ = γᵢ · x̂ᵢ ← no β, no mean subtraction

The mean subtraction (step 1 in LN) and the β shift parameter are both gone. Two fewer operations in the forward pass, simpler backward pass.

LayerNorm RMSNorm x (input) subtract μ divide by σ γ · x̂ + β 4 ops, 2 params (γ, β) x (input) divide by RMS γ · x̂ 2 ops, 1 param (γ only)

Computing on the Anchor

x = [2.0, 0.5, −1.0, 1.5]

Step 1 — Square each value: xᵢ² = [2.0², 0.5², (−1.0)², 1.5²] = [4.0, 0.25, 1.0, 2.25]

Step 2 — Mean of squares: (4.0 + 0.25 + 1.0 + 2.25) / 4 = 7.5 / 4 = 1.875

Step 3 — RMS: RMS = √1.875 = 1.3693

Step 4 — Normalize:

  • x̂₁ = 2.0 / 1.3693 = 1.4606
  • x̂₂ = 0.5 / 1.3693 = 0.3651
  • x̂₃ = −1.0 / 1.3693 = −0.7303
  • x̂₄ = 1.5 / 1.3693 = 1.0954

Step 5 — Scale (γ = 1): y = x̂

StepFormulaResult
xᵢ²[2²,0.5²,1²,1.5²][4.0, 0.25, 1.0, 2.25]
mean(xᵢ²)(4.0+0.25+1.0+2.25)/41.875
RMS√1.8751.3693
xᵢ/1.3693[1.461, 0.365, −0.730, 1.095]
y (γ=1)1.0·x̂[1.461, 0.365, −0.730, 1.095]

Compare with LayerNorm on same x:

Feature 0Feature 1Feature 2Feature 3
RMSNorm1.4610.365−0.7301.095
LayerNorm1.091−0.218−1.5270.655

The outputs differ because LayerNorm subtracts μ = 0.75 first — the anchor mean is positive, so LN shifts all values left before normalizing. RMSNorm does not subtract anything, so the positive mean carries through into the normalized values.


Gradient Through RMSNorm

∂L/∂xᵢ = (γᵢ/RMS) · [∂L/∂yᵢ − x̂ᵢ · (1/d) Σⱼ x̂ⱼ · ∂L/∂yⱼ]

With ∂L/∂y = [0.1, −0.2, 0.3, −0.1], γ = 1, RMS = 1.3693, x̂ = [1.461, 0.365, −0.730, 1.095]:

Shared term: (1/4) Σⱼ x̂ⱼ · ∂L/∂yⱼ = (1/4) · [1.461×0.1 + 0.365×(−0.2) + (−0.730)×0.3 + 1.095×(−0.1)] = (1/4) · [0.1461 − 0.073 − 0.219 − 0.1095] = (1/4) · (−0.2554) = −0.0639

Then ∂L/∂x₁ = (1/1.3693) · [0.1 − 1.461×(−0.0639)] = (0.730) · [0.1 + 0.0934] = 0.730 × 0.1934 = 0.141

Compared to LN's gradient which has two mean-subtraction correction terms, RMSNorm has only one — the x̂ᵢ · shared_term piece. This is the simpler backward pass.


Code

python
import numpy as np

def rms_norm(x, gamma, eps=1e-8):
    rms = np.sqrt((x**2).mean() + eps)
    x_hat = x / rms
    return gamma * x_hat, rms, x_hat

def layer_norm(x, gamma, beta, eps=1e-5):
    mu = x.mean()
    std = np.sqrt(x.var() + eps)
    x_hat = (x - mu) / std
    return gamma * x_hat + beta, mu, std, x_hat

x = np.array([2.0, 0.5, -1.0, 1.5])
gamma = np.ones(4)

y_rms, rms, x_hat_rms = rms_norm(x, gamma)
y_ln, mu, std, x_hat_ln = layer_norm(x, gamma, np.zeros(4))

print(f"x:              {x}")
print(f"\nRMSNorm:")
print(f"  RMS:          {rms:.4f}")
print(f"  x_hat:        {x_hat_rms.round(4)}")
print(f"  output y:     {y_rms.round(4)}")
print(f"\nLayerNorm (same x):")
print(f"  mean:         {mu:.4f}")
print(f"  std:          {std:.4f}")
print(f"  x_hat:        {x_hat_ln.round(4)}")
print(f"  output y:     {y_ln.round(4)}")
text
x:              [ 2.   0.5 -1.   1.5]

RMSNorm:
  RMS:          1.3693
  x_hat:        [ 1.4606  0.3651 -0.7303  1.0954]
  output y:     [ 1.4606  0.3651 -0.7303  1.0954]

LayerNorm (same x):
  mean:         0.7500
  std:          1.1456
  x_hat:        [ 1.0911 -0.2182 -1.5275  0.6547]
  output y:     [ 1.0911 -0.2182 -1.5275  0.6547]

RMSNorm is a direct simplification of Layer Normalization (05-layer-normalization.md) — read that post first to understand the base design. The hypothesis that re-centering is unnecessary was empirically validated across NLP tasks; in the LLaMA architecture, RMSNorm is applied before every self-attention block and every FFN block (Pre-LN placement). The SwiGLU FFN (03-activations/11-swiglu.md) sits between two RMSNorm calls in LLaMA, making the two posts closely connected.

Honest Limitations

Without mean subtraction, the RMSNorm output mean is not zero — it is bounded by the input's mean-to-RMS ratio. If the input activation consistently has a non-zero mean (which can happen with one-sided activations), the output will be systematically shifted. LayerNorm corrects for this via the β parameter and mean subtraction; RMSNorm relies on downstream layers (or the loss) to handle it.

With only γ and no β, RMSNorm cannot shift the output distribution — it can only scale it. For tasks where the downstream activation expects zero-centered inputs (like sigmoid for gating), this can be a subtle problem that LayerNorm's β would have corrected for free.

The empirical gains of RMSNorm over LN are modest in models below 1B parameters — the speed difference is measurable but unlikely to affect final model quality. For small-scale experiments, the choice between LN and RMSNorm is mostly a matter of matching the architecture you are reproducing rather than a principled decision.


Test Your Understanding

  1. Compute RMS for x = [−2.0, 1.0, 0.0, 3.0]. Now compute RMSNorm(x) with γ = 1. Compare to what LayerNorm would output on the same x.

  2. RMSNorm has no β parameter. If a downstream activation function requires zero-centered inputs, how would you compensate for the missing shift without adding β back?

  3. The LLaMA-2 7B model has d=4096 hidden dimensions and 32 layers. How many trainable parameters are in all RMSNorm layers combined (pre-attention and pre-FFN, one per sublayer)?

  4. Compute ∂L/∂x₂ for the anchor x = [2.0, 0.5, −1.0, 1.5] using the gradient formula and ∂L/∂y = [0.1, −0.2, 0.3, −0.1]. Show all substituted values.

  5. An engineer proposes adding β back to RMSNorm to recover the centering ability, arguing it costs only d extra parameters per layer. Under what specific conditions would this actually improve model performance, and when would it make no difference?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment