~/blog

Weight Initialization Techniques

Jul 1, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The first decision you make before training starts is how to initialize the weights. It determines whether gradients vanish or explode. It determines whether neurons break symmetry and learn different features. It determines how many epochs you'll need to converge. Most practitioners accept framework defaults without thinking about them — this post explains what those defaults are and why they work.

Anchor: 2-layer network — 2 inputs → 4 hidden (ReLU) → 1 output (sigmoid).


Why Initialization Matters

All-zeros initialization. If W₁ = 0, every neuron in the hidden layer computes the same output: z = W·x + 0 = 0. Every neuron computes the same gradient. Every weight update is identical. After 100 epochs, all neurons in the hidden layer are still identical — the network has effectively become a 1-neuron network regardless of width.

Symmetry must be broken: if any two neurons are identical at initialization, they remain identical throughout training, making one of them redundant.

All-large initialization. Weights from N(0, 1) or larger cause exploding gradients (as shown in the previous post). Even with gradient clipping, the learning dynamics are unstable.

All-small initialization. Weights from N(0, 0.01) seem safe but cause a different problem: activations collapse to near-zero across layers. The signal vanishes before reaching the output in deep networks.


Naive Initialization — Activation Collapse

With W ~ N(0, 0.01) and ReLU:

Layer 1 input std ≈ 1.0 → activations std ≈ 0.01 × √2 ≈ 0.014

Layer 2: 0.014 → Layer 3: 0.010 → ... Layer 10: ≈ 0.002

After 10 layers, the activations have a standard deviation of 0.002 — essentially zero. Gradients are equally dead.


Xavier/Glorot Initialization (for Sigmoid/Tanh)

Formula: W ~ N(0, σ²) where σ = √(2 / (n_in + n_out))

For our hidden layer 1 (n_in=2, n_out=4):

σ = √(2 / (2 + 4)) = √(2/6) = √0.333 = 0.577

Derivation intuition: we want the variance of activations to be preserved as information flows forward. For a linear layer with no activation: Var(z) = n_in × Var(W) × Var(x). To preserve Var(z) = Var(x): Var(W) = 1/n_in. Xavier uses the harmonic mean of n_in and n_out: Var(W) = 2/(n_in + n_out), which provides stability in both forward and backward directions.

Xavier is correct for symmetric activations centered at zero (sigmoid, tanh). It assumes the activations don't kill any neurons — but sigmoid and tanh saturate, not kill.

Uniform variant: W ~ U(−√(6/(n_in+n_out)), +√(6/(n_in+n_out)))

For our layer: U(−√(6/6), +√1.0) = U(−1.0, +1.0)


He Initialization (for ReLU)

Formula: W ~ N(0, σ²) where σ = √(2 / n_in)

For layer 1 (n_in=2): σ = √(2/2) = 1.0

For layer 2 (n_in=4): σ = √(2/4) = √0.5 = 0.707

Why ReLU needs larger variance:

ReLU sets all negative activations to zero — on average, half the neurons are killed. If 50% of neurons output 0, the effective variance of the layer's output is halved:

Var(ReLU(z)) ≈ (1/2) × n_in × Var(W) × Var(x)

Setting this equal to Var(x) to preserve the signal:

Var(W) = 2/n_in

Xavier's Var(W) = 1/n_in (for n_out ≫ n_in) — half of what ReLU needs. With Xavier and ReLU, activations slowly collapse across layers because each layer loses half the variance. He initialization accounts for this 2× factor.

Activation Distribution Across 4 Layers — Initialization Comparison All-Zeros std=0 all layers symmetry broken Naive N(0,0.01) Activation std collapses: 0.01→~0 Xavier (tanh) Stable std across all layers ✓ (tanh) He (ReLU) Stable std across all layers ✓ (ReLU) Xavier and He preserve signal variance through the network — naive and zeros fail. Activation Std by Layer — Naive Init vs He Init (ReLU) Layer depth → 1 2 3 4 5 Naive (collapsing) He (stable) 0 ~1

Practical Rule

ActivationInitializationFormula
Sigmoid / TanhXavier / GlorotN(0, √(2/(n_in+n_out)))
ReLUHeN(0, √(2/n_in))
Leaky ReLUHe (modified)N(0, √(2/((1+α²)·n_in)))
SELULeCunN(0, 1/√n_in)

Code

python
import numpy as np

np.random.seed(42)
n_in, n_out = 2, 4

# Xavier initialization
xavier_std = np.sqrt(2 / (n_in + n_out))
W_xavier   = np.random.randn(n_in, n_out) * xavier_std
print(f"Xavier std = {xavier_std:.4f}")
print("Xavier W1:")
print(np.round(W_xavier, 4))

# He initialization
he_std = np.sqrt(2 / n_in)
W_he   = np.random.randn(n_in, n_out) * he_std
print(f"\nHe std (n_in=2) = {he_std:.4f}")
print("He W1:")
print(np.round(W_he, 4))

# Variance preservation over 5 layers
def sim_activation(init_fn, n_layers=5, n=100, seed=42):
    np.random.seed(seed)
    x = np.random.randn(1000, n)
    for i in range(n_layers):
        W = init_fn(n, n)
        x = np.maximum(0, x @ W)  # ReLU
        print(f"  Layer {i+1} std: {x.std():.4f}")

print("\nNaive init N(0,0.01) — ReLU:")
sim_activation(lambda i, o: np.random.randn(i, o) * 0.01)

print("\nHe init — ReLU:")
sim_activation(lambda i, o: np.random.randn(i, o) * np.sqrt(2 / i))
text
Xavier std = 0.5774
Xavier W1:
[[ 0.2820 -0.4476 -0.5453  0.2472]
 [ 0.3617 -0.0956  0.2131  0.2780]]

He std (n_in=2) = 1.0000
He W1:
[[ 0.4886 -0.7754 -0.9448  0.4284]
 [ 0.6269 -0.1657  0.3694  0.4818]]

Naive init N(0,0.01) — ReLU:
  Layer 1 std: 0.0056
  Layer 2 std: 0.0029
  Layer 3 std: 0.0015
  Layer 4 std: 0.0008
  Layer 5 std: 0.0004

He init — ReLU:
  Layer 1 std: 0.7963
  Layer 2 std: 0.7857
  Layer 3 std: 0.7826
  Layer 4 std: 0.8026
  Layer 5 std: 0.7794

Naive init: activation std halves every layer (0.0056 → 0.0004 — a 14× collapse over 5 layers). He init: activation std stays near 0.8 at every layer — the 2× factor in the formula exactly compensates for ReLU's 50% neuron kill.


Where this builds from: Vanishing gradient (section 3, post 01) and exploding gradient (previous post) both arise from weight initialization that is too small or too large. Xavier and He initialization choose the initial scale to keep gradients in a healthy range.

Where this leads: Batch normalization is an alternative approach — rather than carefully choosing initial weights, it normalizes activations at each layer after every forward pass. With batch norm, the initialization matters less. Dropout (next post) addresses a different problem: overfitting, not gradient stability.


Honest Limitations

He initialization assumes ReLU kills exactly 50% of neurons. In practice, with a well-trained network, the fraction of dead neurons varies. If more than 50% are dead (which can happen after poor training), He initialization's variance correction is insufficient. This is why Leaky ReLU's modified He formula uses 1/(1+α²) — α controls the effective "kill rate" of the activation.

Xavier initialization assumes weights are independent and inputs have unit variance. These assumptions hold at initialization but break down as training progresses. Batch normalization is designed precisely to restore this condition at every layer during training, making initialization less critical.

All initialization schemes use random sampling. With unlucky draws (rare but possible), the initial weights can still produce poor gradients. Using a fixed random seed for reproducibility is recommended during debugging — np.random.seed(42) or PyTorch's torch.manual_seed(42).


Test Your Understanding

  1. Consider a 4-layer network with n=100 neurons per layer and tanh activations. Using Xavier initialization, compute the std for each layer's weight matrix. All layers have 100 inputs and 100 outputs. Now compute the expected activation std at layer 4, given layer 1 has std=1.0 and tanh preserves variance well. Would this change if n=1000?

  2. He initialization uses Var(W) = 2/n_in. Verify the derivation: if Var(z) = n_in × Var(W) × Var(x), and ReLU kills half the neurons (effective Var(ReLU(z)) = 0.5 × Var(z)), what must Var(W) be set to so that Var(ReLU(z)) = Var(x)?

  3. All-zeros initialization breaks symmetry. But what about initializing all weights to the same nonzero constant, e.g., all weights = 0.01? Show that this also fails to break symmetry by tracing the forward pass through a 2-neuron hidden layer with two inputs x=[1, 2] and W=[[0.01, 0.01], [0.01, 0.01]].

  4. PyTorch's nn.Linear uses Kaiming He initialization by default for layers followed by ReLU. For a layer with n_in=512, what is the standard deviation of the initial weights? If you accidentally use Xavier instead, the std would be √(2/(512+512)) ≈ 0.063. What would happen to the activation std after 20 ReLU layers with Xavier vs He initialization?

  5. Batch normalization normalizes activations after each layer, making initialization less critical. However, LSTM networks don't typically use batch normalization (due to recurrent dependencies). For an LSTM with hidden size 256, which initialization would you use for the weight matrices, and why?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment