~/blog/tutorials/deep-learning

Optimizers

Jul 1, 20268 min read

Gradient Descent

Training a neural network is an optimization problem. You have a cost function J(w, b) — a surface defined over all model parameters. You want to find the lowes…

Tutorial

Jul 1, 20269 min read

Stochastic Gradient Descent (SGD)

Batch gradient descent computes the exact gradient — but it requires processing all n training samples before taking a single weight update. With 1 million samp…

Tutorial

Jul 1, 20268 min read

Mini-Batch SGD

Batch GD: one update per epoch, exact gradient, slow. SGD: n updates per epoch, noisy gradient, fast but volatile. Mini-batch SGD: k updates per epoch, approxim…

Tutorial

Jul 1, 20268 min read

SGD with Momentum

Mini-batch SGD has a problem in narrow loss valleys. The gradient across the narrow dimension is large and oscillates in sign — the optimizer zigzags side-to-si…

Tutorial

Jul 1, 20268 min read

Adagrad

All previous optimizers — batch GD, SGD, mini-batch SGD, momentum — use the same learning rate η for every parameter. This is a bad assumption when different pa…