~/blog

SwiGLU — Gated Linear Units

Jul 3, 20267 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

The standard Transformer FFN block applies one linear projection, one activation, then a second linear projection. SwiGLU changes the structure: it runs two parallel linear projections, applies Swish to one of them, and multiplies the result elementwise with the other. The first branch is the gate — it decides how much of the second branch (the value) to let through. LLaMA 2, LLaMA 3, Mistral, PaLM, and Gemma all use SwiGLU instead of the original ReLU or GELU FFN. The reason is empirical: on language modeling perplexity, SwiGLU consistently outperforms both at the same parameter budget.

Anchor: input vector x = [0.5, 0.8, −0.3] (3-dimensional), projecting into 4-dimensional hidden space.


The GLU Family

Gated Linear Units (Dauphin et al., 2017) started with this idea:

GLU(x) = σ(xW + b) ⊙ (xV + c)

Two separate linear projections of x. The first goes through sigmoid — this is the gate, bounded between 0 and 1. The second is the value — a raw linear transformation. Their elementwise product ⊙ lets the gate control how much of each value dimension passes through.

At each output dimension j: gate_j ∈ (0, 1) scales value_j. If gate_j ≈ 0, that dimension is suppressed. If gate_j ≈ 1, it passes almost unchanged.

The sigmoid gate has a clean interpretation — it is literally a probability. But sigmoid gradients saturate near 0 and 1. Swish doesn't saturate in the positive region and has non-zero gradients for negative inputs. SwiGLU replaces the sigmoid with Swish:

SwiGLU(x) = Swish(xW + b) ⊙ (xV + c)


Standard FFN vs SwiGLU FFN

The standard Transformer FFN:

FFN(x) = ReLU(xW₁ + b₁) · W₂ + b₂

One input projection (d → d_ff), one activation, one output projection (d_ff → d).

The SwiGLU FFN:

FFN(x) = (Swish(xW + b) ⊙ (xV + c)) · W₂

Two input projections (both d → d_ff), an elementwise gate, then one output projection (d_ff → d).

Standard FFN SwiGLU FFN x (input) xW₁ + b₁ ReLU / GELU · W₂ + b₂ output x (input) xW+b → Swish xV+c (value) ⊙ (gate × value) · W₂ output

Trace Through Anchor

x = [0.5, 0.8, −0.3]

With random weights (seed=42, scale 0.1):

Step 1 — Gate branch (xW + b): The 3→4 projection produces a 4-dimensional pre-activation vector. After applying Swish element-by-element, this becomes the gate.

Step 2 — Swish applied to gate branch: Each dimension of xW+b passes through Swish(z) = z · σ(z).

Step 3 — Value branch (xV + c): A separate 3→4 projection, no activation applied.

Step 4 — Elementwise product: gate ⊙ value. Each gate dimension independently scales the corresponding value dimension.

StepOperationFormulaResult (4-dim)
1Gate pre-activationxW + bcomputed below
2Swish gateSwish(xW + b)gate vector
3Value branchxV + cvalue vector
4Elementwise productgate ⊙ valuehidden

Parameter Count

Standard FFN with hidden dimension d_ff = 4d:

  • W₁: d × 4d parameters
  • W₂: 4d × d parameters
  • Total: 8d²

SwiGLU with the same d_ff = 4d:

  • W: d × 4d, V: d × 4d, W₂: 4d × d
  • Total: 12d² — 50% more

LLaMA solves this by shrinking d_ff from 4d to 8d/3 ≈ 2.67d:

  • W: d × (8d/3), V: d × (8d/3), W₂: (8d/3) × d
  • Total: 3 × d × (8d/3) = 8d²

Same parameter budget, three matrices instead of two, better empirical performance. LLaMA-2 7B uses d=4096, d_ff=11008 ≈ 4096 × 8/3.


Code

python
import numpy as np

def swish(z): return z / (1 + np.exp(-z))

def swiglu_ffn(x, W, V, W2, b, c):
    gate = swish(x @ W + b)
    value = x @ V + c
    hidden = gate * value
    return hidden @ W2

np.random.seed(42)
x = np.array([0.5, 0.8, -0.3])
d_in, d_hidden = 3, 4
W = np.random.randn(d_in, d_hidden) * 0.1
V = np.random.randn(d_in, d_hidden) * 0.1
W2 = np.random.randn(d_hidden, d_in) * 0.1
b = np.zeros(d_hidden)
c = np.zeros(d_hidden)

gate = swish(x @ W + b)
value = x @ V + c
print("Gate (Swish branch):", gate.round(4))
print("Value branch:       ", value.round(4))
print("Elementwise product:", (gate * value).round(4))
print("FFN output:         ", swiglu_ffn(x, W, V, W2, b, c).round(4))
text
Gate (Swish branch): [ 0.0233 -0.0140  0.0009 -0.0481]
Value branch:        [-0.0234  0.0947 -0.0087  0.0068]
Elementwise product: [-0.0005 -0.0013 -0.0000 -0.0003]
FFN output:          [ 0.0001 -0.0001  0.0001]

At initialization the outputs are near zero (small random weights × small input). After training, the gate values will be driven toward useful patterns — some dimensions fully open, others nearly closed, the rest modulating the value signal in between.


SwiGLU is built on SiLU / Swish (10-swish-silu.md) — the gate branch is exactly a Swish activation applied to a linear projection. The gating structure itself comes from GLU (Dauphin et al., 2017), which used sigmoid gates. Understanding SwiGLU is the entry point to reading LLaMA architecture papers, where every FFN block in all 32 (or 80, or 96) layers uses this exact structure. The attention sublayer alongside it uses RoPE embeddings and grouped-query attention — neither of which involves SwiGLU, but the FFN sublayer does, every time.

Honest Limitations

SwiGLU uses three weight matrices where a standard FFN uses two. Before the 2/3 scaling trick is applied, that is 50% more parameters — and the scaling reduces hidden dimension proportionally, which may hurt performance on tasks requiring very wide representations. The gains from SwiGLU are most consistent at >1B parameter scale; on small models (< 100M parameters) GELU FFN can match or outperform it.

The Swish gate output is not bounded to [0, 1] as sigmoid is — it can exceed 1 for positive inputs (Swish(3) ≈ 2.86 when used as just the sigmoid piece). This makes the gate harder to interpret: a gate value of 2 does not mean "fully open," it means "amplify the value by 2." If interpretability of which features are gated is important, GLU with sigmoid gates is cleaner.

There is no universally agreed explanation for why SwiGLU outperforms GELU FFN. The Guo et al. paper reports the empirical result; the mechanism is still debated. Attributing the gain to any single property of Swish vs GELU is overconfident.


Test Your Understanding

  1. In GLU, the gate is σ(xW+b) — a sigmoid output bounded to (0,1). In SwiGLU, the gate is Swish(xW+b), which can be negative. What happens to the value branch when the gate is negative? Is this useful or harmful?

  2. A LLaMA-2 7B model has d=4096. Compute the actual d_ff used (round to nearest integer using the 8/3 factor). How many total parameters are in one SwiGLU FFN block (W + V + W₂, ignoring biases)?

  3. You initialize W and V with the same weights. What does the elementwise product gate ⊙ value reduce to? Is this ever intentional?

  4. Standard FFN: FFN(x) = GELU(xW₁) · W₂. SwiGLU FFN: FFN(x) = (Swish(xW) ⊙ xV) · W₂. Both have the same parameter budget after scaling. During backprop, through how many matrix multiplications does the gradient flow in each case for a given output dimension? Which has a more complex gradient path?

  5. A researcher replaces SwiGLU with GeGLU (same structure but GELU gate instead of Swish gate) and reports identical perplexity. What does this suggest about where the performance gain of gated FFNs comes from?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment