Interactive Gradient Descent

Himmelblau’s function has multiple local minima—see how different start points, learning rates, or Adam can lead you to different local or global minima (or get "semi-stuck" near a saddle). If you have trouble with Himmelblau's (because it is very simple and has only 4 global minima, with no local, but a nice saddle point), I have created a more complex (n-hump camel) loss surface with multiple local minima where you can see how these gradient descent traverses a more complex loss surfaces.


3.00

2.00

3.2e-2
Epoch: 0 Loss: --

3D Surface

2D Contour

Loss vs. Epoch

Adam vs. Plain Gradient Descent

In plain gradient descent, each update subtracts the gradient scaled by a single learning rate \(\alpha\): $$ \theta_{t+1} \;=\; \theta_t \;-\; \alpha \,\nabla_\theta J(\theta_t) $$

What does Momentum add?

Unlike in plain gradient descent, with Momentum, we introduce a “velocity” \(v_t\) that accumulates past gradients: $$ v_t \;=\; \mu \,v_{t-1} \;+\; \alpha \,\nabla_{\theta}J(\theta_t), \quad \theta_{t+1} \;=\; \theta_t \;-\; v_t $$

This way, if gradients keep pointing in the same direction, \(v_t\) grows, speeding us along. If they fluctuate, \(v_t\) smooths out the noise by averaging recent steps instead of reacting only to the current gradient.

What about RMSProp?

It adapts the learning rate by tracking how big or small gradients typically are for each parameter. If gradients are large, RMSProp shrinks the step size; if they are small, it enlarges the step. In simpler terms: $$ \text{Adaptive Step Size} \;\approx\; \frac{1}{\sqrt{\text{rolling average of }(\nabla_{\theta} J(\theta))^2}} $$

That means parameters with consistently big gradients slow down, while those with small gradients speed up.

Adam (Adaptive Moment Estimation) extends this by tracking two moving averages of the gradients: 1. A “first moment” (numerator) which acts like momentum—an exponential average of gradients that can help smooth out noise. 2. A “second moment” comes from RMSProp (denominator) that tracks the average of squared gradients, adjusting the effective learning rate so steep directions get smaller updates. The update roughly looks like: $$ \theta_{t+1} \;\approx\; \theta_t \;-\; \frac{\alpha \,\text{(avg gradient)}}{\sqrt{\text{(avg of gradient}^2)} + \epsilon} $$ The numerator tells you which direction to move, the denominator tells you how much, and the learning rate sets the pace. Finally, \(\epsilon\) is a small constant (often around \(10^{-8}\)) to avoid dividing by zero and ensure stable updates. Adam automatically tunes step sizes for each parameter dimension. In practice (high dimmensional spaces), this often converges faster and is more tolerant of tricky or noisy gradients.