~/blog

Swish / SiLU Activation Function

Jul 3, 20268 min readBy Mohammed Vasim
deep-learningneural-networksmachine-learningrepresentation-learning

Swish was found by accident — or rather, by a neural architecture search algorithm. Google Brain researchers asked an automated search to find the best activation function for deep networks, and the winner was z · σ(z): the input multiplied by its own sigmoid. That self-referential structure is the key insight. The input gates itself. A large positive z opens the gate wide (σ(z) → 1). A large negative z closes it (σ(z) → 0). A small negative z — say z = −0.5 — leaves the gate partially open (σ(−0.5) ≈ 0.38), producing output −0.19 instead of ReLU's hard zero.

Anchor: z ∈ {−3, −1, 0, 0.5, 1, 3} from the churn network hidden layer.


The Self-Gating Mechanism

The general Swish formula is:

Swish(z) = z · σ(βz)

where σ is the sigmoid function and β controls the gate's sharpness. When β = 1, this is called SiLU (Sigmoid Linear Unit):

SiLU(z) = z · σ(z)

This is the version used in LLaMA, MobileNetV3, and EfficientNet. The self-gating idea: instead of an external threshold deciding which values pass (as ReLU does with its 0-cutoff), the value decides for itself through its own sigmoid. Strongly positive values pass nearly unchanged. Strongly negative values are suppressed. Moderately negative values pass through slightly damped.

The critical difference from ReLU shows at z = −1.28, where SiLU reaches its minimum of approximately −0.278. Below that point the output rises back toward zero as z → −∞ (because σ(z) → 0 dominates). ReLU is flat at 0 for all z < 0. GELU approaches 0 from below. SiLU dips below −0.278 then returns — a unique shape with one minimum.

zσ(z)SiLU(z) = z·σ(z)SiLU'(z)
−30.0474−0.1423−0.0908
−10.2689−0.26890.0723
00.50000.00000.5000
0.50.62250.31130.8673
10.73110.73111.0998
30.95262.85771.0865
SiLU(z) = z · σ(z) — Anchor Points 0 3 −0.3 −3 0 1 3 −0.142 −0.269 0.000 0.311 0.731 2.858 min ≈ −0.278 at z≈−1.28

Learnable β vs Fixed β

When β = 1 (SiLU), every neuron uses the same gate sharpness. With learnable β, each neuron tunes how sharply it gates its input during training. A high β makes the function more like ReLU (sharper cutoff). A low β makes it more linear (gate barely activates).

Swish(z) = z · σ(βz) for β ∈ {0.5, 1, 2} 0 −3 0 3 β=2 β=1 (SiLU) β=0.5 High β → sharper gate (closer to ReLU) Low β → softer gate (more linear) β=1 → SiLU (most common)

Derivative

SiLU'(z) = σ(z) · (1 + z · (1 − σ(z)))

At z = −1: σ(−1) = 0.2689, so SiLU'(−1) = 0.2689 · (1 + (−1)(1 − 0.2689)) = 0.2689 · (1 − 0.7311) = 0.2689 · 0.2689 ≈ 0.0723.

ReLU'(−1) = 0. GELU'(−1) ≈ 0.083. SiLU'(−1) ≈ 0.072. All three are non-zero for GELU and SiLU, all zero for ReLU in the negative region — this is the dead-neuron difference in numbers.

The derivative turns negative for z < −1.28 (around −0.091 at z = −3), which means the SiLU function is not monotonic. This is by design — the non-monotonicity gives the model additional expressive power.


Comparison: ReLU vs GELU vs SiLU

zReLUGELUSiLU
−30−0.0040−0.1423
−10−0.1587−0.2689
000.00000.0000
0.50.50.34570.3113
110.84130.7311
332.99602.8577

SiLU and GELU are both near-zero for large negatives but diverge in the intermediate zone: at z = −1, SiLU outputs −0.269 while GELU outputs −0.159. SiLU is more permissive for moderately negative inputs.

ReLU vs GELU vs SiLU — Negative Region z < 0 region 0 −3 0 3 ReLU GELU SiLU ReLU: 0 GELU: −0.004 SiLU: −0.142

Code

python
import numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))
def swish(z, beta=1.0): return z * sigmoid(beta * z)
def swish_grad(z, beta=1.0):
    s = sigmoid(beta * z)
    return s + beta * z * s * (1 - s)

z_vals = np.array([-3., -1., 0., 0.5, 1., 3.])
print(f"{'z':>5} | {'σ(z)':>7} | {'SiLU(z)':>8} | {'SiLU_grad':>9}")
for z in z_vals:
    s = sigmoid(z)
    print(f"{z:5.1f} | {s:7.4f} | {swish(z):8.4f} | {swish_grad(z):9.4f}")

for beta in [0.5, 1.0, 2.0]:
    print(f"\nbeta={beta}: {[round(swish(z, beta), 4) for z in z_vals]}")
text
z |    σ(z) |  SiLU(z) | SiLU_grad
 -3.0 |  0.0474 |  -0.1423 |   -0.0908
 -1.0 |  0.2689 |  -0.2689 |    0.0723
  0.0 |  0.5000 |   0.0000 |    0.5000
  0.5 |  0.6225 |   0.3113 |    0.8673
  1.0 |  0.7311 |   0.7311 |    1.0998
  3.0 |  0.9526 |   2.8577 |    1.0865

beta=0.5: [-0.2127, -0.3775, 0.0, 0.2811, 0.6225, 2.5628]
beta=1.0: [-0.1423, -0.2689, 0.0, 0.3113, 0.7311, 2.8577]
beta=2.0: [-0.0635, -0.1192, 0.0, 0.3457, 0.8808, 2.9526]

Notice the gradient at z = −3 is actually negative (−0.0908) — that is the non-monotonic region below the minimum at z ≈ −1.28. A weight connected to a neuron consistently receiving z < −1.28 will be pushed in the wrong direction by SiLU gradients. In practice this is rare because inputs cluster around 0 after normalization layers.


SiLU is a direct descendant of sigmoid (02-sigmoid.md) — it is the sigmoid used as a multiplicative gate on its own input rather than as a standalone activation. It differs from GELU (09-gelu.md) in using a sigmoid gate instead of a Gaussian CDF gate; both produce smooth near-zero behavior for negative inputs but GELU decays faster. SiLU is the building block of SwiGLU (11-swiglu.md), where a second learned projection replaces the self-gate: instead of z gating itself, one linear branch gates another.

Honest Limitations

SiLU costs roughly 2–3× more than ReLU per forward pass — it requires a sigmoid evaluation (which involves exp()) followed by a multiplication. On hardware where memory bandwidth is the bottleneck this is negligible, but for edge inference with strict latency budgets, profile before defaulting to SiLU over ReLU.

With β = 1 fixed, every neuron uses the same gate sharpness regardless of what the data requires. If β is made learnable, each of the d_ff neurons in a Transformer FFN block learns its own β — that is 3072 extra parameters in BERT-base. Small overhead at scale, but it complicates weight sharing and distillation.

The slight negative output region (minimum ≈ −0.278 at z ≈ −1.28) can surprise batch normalization layers placed after SiLU that assume non-negative inputs. If the downstream layer was tuned expecting non-negative activations (as is common when replacing ReLU), re-tune or add layer normalization between them.


Test Your Understanding

  1. Compute SiLU(−1.28) using the formula z · σ(z). What is σ(−1.28)? Why is this the approximate minimum of the SiLU curve?

  2. At β = 2, how does Swish(−1) compare to SiLU(−1)? Use σ(−2) ≈ 0.119. Does higher β make the function more or less permissive for negative inputs?

  3. A model replaces ReLU with SiLU in 50 layers. Training loss decreases faster for the first 10 epochs but then plateaus earlier. Propose one explanation involving the non-monotonic region of SiLU.

  4. SiLU'(z) = σ(z)(1 + z(1 − σ(z))). At z = 0, compute the derivative. Now explain why the gradient at z = 0 is different from both ReLU (gradient undefined at 0) and GELU (gradient = 0.5 at 0).

  5. LLaMA 2 uses SwiGLU, which internally uses Swish. If you distill a LLaMA 2 model into a smaller student network that uses ReLU, what specific gradient-flow property of SiLU does the student lose, and in what layer type does this matter most?

Comments (0)

No comments yet. Be the first to comment!

Leave a comment