Himmelblau Interactive (GD + Adam)

Himmelblau’s function has multiple local minima—see how different start points, learning rates, or Adam can lead you to different local or global minima (or get "semi-stuck" near a saddle). If you have trouble with Himmelblau's (because it is very simple and has only 4 global minima, with no local, but a nice saddle point), I have created a more complex (n-hump camel) loss surface with multiple local minima where you can see how these gradient descent traverses a more complex loss surfaces.

Adam vs. Plain Gradient Descent

In plain gradient descent, each update subtracts the gradient scaled by a single learning rate $\alpha$: $$ \theta_{t+1} \;=\; \theta_t \;-\; \alpha \,\nabla_\theta J(\theta_t) $$

$\theta_t$: The parameter vector (e.g., the weights of your model) at iteration $t$.
$\alpha$: The learning rate or step size.
$\nabla_\theta J(\theta_t)$: The gradient of the loss function with respect to the parameters at iteration $t$. This gradient indicates the direction in which $J$ increases most rapidly.
$J(\theta_t)$: The loss (or cost) function evaluated at $\theta_t$.

What does Momentum add?

Unlike in plain gradient descent, with Momentum, we introduce a “velocity” $v_t$ that accumulates past gradients: $$ v_t \;=\; \mu \,v_{t-1} \;+\; \alpha \,\nabla_{\theta}J(\theta_t), \quad \theta_{t+1} \;=\; \theta_t \;-\; v_t $$

$v_t$ is the current velocity (a rolling average of gradients).
$v_{t-1}$ is the velocity from the previous step.
$\mu$ (often around 0.9) is the momentum factor, controlling how much of the past velocity we retain.
$\alpha$ is now multiplied into the velocity rather than directly into $\nabla_{\theta} J(\theta_t)$.

This way, if gradients keep pointing in the same direction, $v_t$ grows, speeding us along. If they fluctuate, $v_t$ smooths out the noise by averaging recent steps instead of reacting only to the current gradient.

What about RMSProp?

It adapts the learning rate by tracking how big or small gradients typically are for each parameter. If gradients are large, RMSProp shrinks the step size; if they are small, it enlarges the step. In simpler terms: $$ \text{Adaptive Step Size} \;\approx\; \frac{1}{\sqrt{\text{rolling average of }(\nabla_{\theta} J(\theta))^2}} $$

$\nabla_\theta J(\theta_t)$: As above, this is the gradient of the loss function at the current parameters.
Rolling Average of Squared Gradients: A moving average that tracks the squared values of the gradients. This average is used to adapt the learning rate for each parameter dynamically.

That means parameters with consistently big gradients slow down, while those with small gradients speed up.

Adam (Adaptive Moment Estimation) extends this by tracking two moving averages of the gradients: 1. A “first moment” (numerator) which acts like momentum—an exponential average of gradients that can help smooth out noise. 2. A “second moment” comes from RMSProp (denominator) that tracks the average of squared gradients, adjusting the effective learning rate so steep directions get smaller updates. The update roughly looks like: $$ \theta_{t+1} \;\approx\; \theta_t \;-\; \frac{\alpha \,\text{(avg gradient)}}{\sqrt{\text{(avg of gradient}^2)} + \epsilon} $$ The numerator tells you which direction to move, the denominator tells you how much, and the learning rate sets the pace. Finally, $\epsilon$ is a small constant (often around $10^{-8}$) to avoid dividing by zero and ensure stable updates. Adam automatically tunes step sizes for each parameter dimension. In practice (high dimmensional spaces), this often converges faster and is more tolerant of tricky or noisy gradients.

Interactive Gradient Descent

3D Surface

2D Contour

Loss vs. Epoch

Adam vs. Plain Gradient Descent