Exploding Gradient Problem
The vanishing gradient problem (section 3, post 01) described what happens when weights are too small: gradients shrink layer by layer until the first layers re…
~/blog/tutorials/deep-learning
The vanishing gradient problem (section 3, post 01) described what happens when weights are too small: gradients shrink layer by layer until the first layers re…
The first decision you make before training starts is how to initialize the weights. It determines whether gradients vanish or explode. It determines whether ne…
A neural network that achieves 99% accuracy on training data and 70% on test data is not a good model — it has memorized the training set. Dropout is the most w…
Training a deep network without normalization means that every weight update in layer 3 shifts the distribution of inputs to layer 4, which shifts what layer 4…
Batch Normalization normalizes across samples in a batch — it averages over the N-dimension. That design breaks the moment you process a single sample: at infer…