Vanishing Gradient Problem
You train a 5-layer sigmoid network on a classification task and watch the loss barely move for the first hundred epochs. The model isn't broken — gradient desc…
~/blog/tutorials/deep-learning
You train a 5-layer sigmoid network on a classification task and watch the loss barely move for the first hundred epochs. The model isn't broken — gradient desc…
A neuron computes a weighted sum z = w·x + b. That number can be anything: −1000, 0, 47.3. But for a binary classification output — "will this loan default?" —…
Sigmoid solves one problem — mapping z to a probability — but introduces another: every output is positive, which forces all upstream weight gradients to update…
Sigmoid and tanh both saturate — for large |z|, their derivatives collapse toward zero and gradients die. ReLU sidesteps this entirely for positive values: the…
ReLU kills neurons. When z ≤ 0 for every input a neuron encounters, its output is zero and its gradient is zero — the weight never moves again. The fix is simpl…