Interactive Gradient Descent

Explore how gradient descent navigates loss surfaces. Try the pre-set scenarios below to see different behaviors: divergence, getting trapped in local minima, escaping saddle points, and smooth convergence. Toggle "Compare: Adam" to see how adaptive optimizers differ.


3.00

2.00

3.2e-2

Epoch: 0 Loss: --

3D Surface

2D Contour

Loss vs. Epoch

Adam vs. Plain Gradient Descent

In plain gradient descent, each update subtracts the gradient scaled by a single learning rate \(\alpha\): $$ \theta_{t+1} \;=\; \theta_t \;-\; \alpha \,\nabla_\theta J(\theta_t) $$

What does Momentum add?

Unlike in plain gradient descent, with Momentum, we introduce a “velocity” \(v_t\) that accumulates past gradients: $$ v_t \;=\; \mu \,v_{t-1} \;+\; \alpha \,\nabla_{\theta}J(\theta_t), \quad \theta_{t+1} \;=\; \theta_t \;-\; v_t $$

This way, if gradients keep pointing in the same direction, \(v_t\) grows, speeding us along. If they fluctuate, \(v_t\) smooths out the noise by averaging recent steps instead of reacting only to the current gradient.

What about RMSProp?

It adapts the learning rate by tracking how big or small gradients typically are for each parameter. If gradients are large, RMSProp shrinks the step size; if they are small, it enlarges the step. In simpler terms: $$ \text{Adaptive Step Size} \;\approx\; \frac{1}{\sqrt{\text{rolling average of }(\nabla_{\theta} J(\theta))^2}} $$

That means parameters with consistently big gradients slow down, while those with small gradients speed up.

Adam (Adaptive Moment Estimation) extends this by tracking two moving averages of the gradients: 1. A “first moment” (numerator) which acts like momentum—an exponential average of gradients that can help smooth out noise. 2. A “second moment” comes from RMSProp (denominator) that tracks the average of squared gradients, adjusting the effective learning rate so steep directions get smaller updates. The update roughly looks like: $$ \theta_{t+1} \;\approx\; \theta_t \;-\; \frac{\alpha \,\text{(avg gradient)}}{\sqrt{\text{(avg of gradient}^2)} + \epsilon} $$ The numerator tells you which direction to move, the denominator tells you how much, and the learning rate sets the pace. Finally, \(\epsilon\) is a small constant (often around \(10^{-8}\)) to avoid dividing by zero and ensure stable updates. Adam automatically tunes step sizes for each parameter dimension. In practice (high dimmensional spaces), this often converges faster and is more tolerant of tricky or noisy gradients.