Himmelblau’s function has multiple local minima—see how different start points, learning rates, or Adam can lead you to different local or global minima (or get "semi-stuck" near a saddle). If you have trouble with Himmelblau's (because it is very simple and has only 4 global minima, with no local, but a nice saddle point), I have created a more complex (n-hump camel) loss surface with multiple local minima where you can see how these gradient descent traverses a more complex loss surfaces.
What does Momentum add?
Unlike in plain gradient descent, with Momentum, we introduce
a “velocity” \(v_t\) that accumulates past gradients:
$$
v_t \;=\; \mu \,v_{t-1} \;+\; \alpha \,\nabla_{\theta}J(\theta_t),
\quad
\theta_{t+1} \;=\; \theta_t \;-\; v_t
$$
What about RMSProp?
It adapts the learning rate by tracking how big or small gradients typically are for each parameter. If gradients are large, RMSProp shrinks the step size; if they are small, it enlarges the step. In simpler terms: $$ \text{Adaptive Step Size} \;\approx\; \frac{1}{\sqrt{\text{rolling average of }(\nabla_{\theta} J(\theta))^2}} $$
Adam (Adaptive Moment Estimation) extends this by tracking two moving averages of the gradients: 1. A “first moment” (numerator) which acts like momentum—an exponential average of gradients that can help smooth out noise. 2. A “second moment” comes from RMSProp (denominator) that tracks the average of squared gradients, adjusting the effective learning rate so steep directions get smaller updates. The update roughly looks like: $$ \theta_{t+1} \;\approx\; \theta_t \;-\; \frac{\alpha \,\text{(avg gradient)}}{\sqrt{\text{(avg of gradient}^2)} + \epsilon} $$ The numerator tells you which direction to move, the denominator tells you how much, and the learning rate sets the pace. Finally, \(\epsilon\) is a small constant (often around \(10^{-8}\)) to avoid dividing by zero and ensure stable updates. Adam automatically tunes step sizes for each parameter dimension. In practice (high dimmensional spaces), this often converges faster and is more tolerant of tricky or noisy gradients.