Explore how gradient descent navigates loss surfaces. Try the pre-set scenarios below to see different behaviors: divergence, getting trapped in local minima, escaping saddle points, and smooth convergence. Toggle "Compare: Adam" to see how adaptive optimizers differ.
What does Momentum add?
Unlike in plain gradient descent, with Momentum, we introduce
a “velocity” \(v_t\) that accumulates past gradients:
$$
v_t \;=\; \mu \,v_{t-1} \;+\; \alpha \,\nabla_{\theta}J(\theta_t),
\quad
\theta_{t+1} \;=\; \theta_t \;-\; v_t
$$
What about RMSProp?
It adapts the learning rate by tracking how big or small gradients typically are for each parameter. If gradients are large, RMSProp shrinks the step size; if they are small, it enlarges the step. In simpler terms: $$ \text{Adaptive Step Size} \;\approx\; \frac{1}{\sqrt{\text{rolling average of }(\nabla_{\theta} J(\theta))^2}} $$
Adam (Adaptive Moment Estimation) extends this by tracking two moving averages of the gradients: 1. A “first moment” (numerator) which acts like momentum—an exponential average of gradients that can help smooth out noise. 2. A “second moment” comes from RMSProp (denominator) that tracks the average of squared gradients, adjusting the effective learning rate so steep directions get smaller updates. The update roughly looks like: $$ \theta_{t+1} \;\approx\; \theta_t \;-\; \frac{\alpha \,\text{(avg gradient)}}{\sqrt{\text{(avg of gradient}^2)} + \epsilon} $$ The numerator tells you which direction to move, the denominator tells you how much, and the learning rate sets the pace. Finally, \(\epsilon\) is a small constant (often around \(10^{-8}\)) to avoid dividing by zero and ensure stable updates. Adam automatically tunes step sizes for each parameter dimension. In practice (high dimmensional spaces), this often converges faster and is more tolerant of tricky or noisy gradients.