DL Notes: Advanced Gradient Descent

In my previous article about gradient descent, I explained the basic concepts behind it and summarized the main challenges of this kind of optimization.
However, I only covered Stochastic Gradient Descent (SGD) and the "batch" and "mini-batch" implementation of gradient descent.
Other algorithms offer advantages in terms of convergence speed, robustness to "landscape" features (the vanishing gradient problem), and less dependence on the choice of learning rate to achieve good performance.
So today I'll write about more advanced optimization algorithms, implementing them from scratch in Python and comparing them through animated visualizations.
I've also listed the resources I used for learning about these algorithms. They are great for diving deeper into formal concepts.
Comparing the algorithms using a simple objective function

Throughout this article, I'll show how I implemented the different algorithms in Python.
I've created a Jupyter Notebook that you can access on GitHub or directly on Google Colab to see all the code used to create the figures shown here.
To generate the animations, I used the approach shown in my previous post about creating an animated gradient descent figure in Python.
The function definitions assume that the following code has been already included, as they use numpy
classes and methods and call both the function f
and its gradient, grad_f
.
import numpy as np
# Create a function to compute the surface
def f(theta):
x = theta[0]
y = theta[1]
return x**2 - y**2
# Define a function to compute the gradient
def grad_f(theta):
returnValue = np.array([0.,0.])
x = theta[0]
y = theta[1]
returnValue[0] += 2*x
returnValue[1] += - 2*y
return returnValue
Momentum

We could compare the optimization algorithm with a ball rolling downhill.
If the "ball" had momentum like it has in real life, it would be less likely to remain stuck in a local minimum after accelerating down the hill at full speed.
That's what people realized when dealing with the problem of gradient descent getting stuck in local minima.
From high school physics, we know that translational momentum is defined by the product of the mass of an object and its velocity:

We also know that the gravitational potential energy of an object of mass m is proportional to the height h at which it is placed:

Moreover, there is a direct relationship between the potential energy of an object and the force exerted on it

The relationship between p and U can be derived from Newton's second law of motion:
The change of motion of an object is proportional to the force impressed; and is made in the direction of the straight line in which the force is impressed.
