Gradient Descent
Training a neural network is an optimization problem. You have a cost function J(w, b) — a surface defined over all model parameters. You want to find the lowes…
~/blog/tutorials/deep-learning
Training a neural network is an optimization problem. You have a cost function J(w, b) — a surface defined over all model parameters. You want to find the lowes…
Batch gradient descent computes the exact gradient — but it requires processing all n training samples before taking a single weight update. With 1 million samp…
Batch GD: one update per epoch, exact gradient, slow. SGD: n updates per epoch, noisy gradient, fast but volatile. Mini-batch SGD: k updates per epoch, approxim…
Mini-batch SGD has a problem in narrow loss valleys. The gradient across the narrow dimension is large and oscillates in sign — the optimizer zigzags side-to-si…
All previous optimizers — batch GD, SGD, mini-batch SGD, momentum — use the same learning rate η for every parameter. This is a bad assumption when different pa…