DL Notes: Gradient Descent
Artificial Neural Networks (ANNs) are universal function approximators. They can approximate any complex function if provided with enough data, have a proper architecture, and are trained for enough time.
But what does "training" a network even mean?
In a previous post about the feedforward process, I mentioned that training a network means adjusting the value of its weights, to obtain a better fit of the function we are trying to approximate.
In this post, I'm going to describe the algorithm of gradient descent, which is used to adjust the weights of an ANN.
Let's start with the basic concepts.
Descending from a mountain
Imagine we are at the top of a mountain and need to get to the lowest point of a valley next to it.
We don't have a map, it is foggy and getting dark, we lost the routes and need to get to the bottom quickly. Not a nice scenario, but it shows the "boundaries" of the problem.
For our safety, let's suppose there are no steep ridges in the mountain, so it is similar to a differentiable function.
When it gets dark, we can't see in which direction we are moving. The only way we can descend is by taking small steps, and checking whether we are at a lower height or not.
If we notice we moved up, we go in the opposite direction. If we moved down, we continue that way. We repeat the process until, eventually, we'll reach the bottom.
As you can see, this is not necessarily the best approach. We might end up in a small valley, not at the bottom of the mountain, or we could spend a huge amount of time on a plateau.
This illustrates the basic working principle of gradient descent and also its main challenges. We'll come back to this example, but let's see a more formal explanation first.
What is a gradient?
A gradient is a representation of the rate of change of a function. It indicates the direction of the greatest increase or decrease. Intuitively, that means the gradient is zero at a local maximum or a local minimum.
For a function that depends on several variables (or coordinate axes), the gradient is a vector whose components are the partial derivatives of the function, evaluated at a given point. This is denoted with the symbol ∇ (nabla) which represents the vector differential operator.
Let's see this in math notation. Suppose we have an n-dimensional function f:
The gradient of this function at point p (which is determined by n coordinates), is given by:
Coming back to the example of the mountain, there are areas of the mountain where the terrain is steep, like the mountain slopes, and other zones where the terrain is almost flat, like a valley or a plateau. Valleys and plateaus represent local minima, which are usually critical points.
The gradient descent method
For many optimization problems, we aim to minimize a loss function to achieve the most accurate result.
In Deep Learning and ANNs, the loss functions we use are differentiable: they have no discontinuities, being smooth across their whole domain.
This allows us to use the derivative of the loss function with respect to the independent variables as an indication of whether we are moving towards a solution (a global minimum).
How large are the steps we take in proportion to the derivative? this is determined by a step size parameter, η (we can call it learning rate when we are talking about Deep Learning). It will multiply the gradient, scaling it to determine the step size.
This way, steeper gradients will produce larger steps. As we approach a local minimum, the slope (gradient) will tend to zero.
Let's look at the following figure, for example, to illustrate how this works when optimizing a 1D function:
As you can see, we initialize our "search" for the minima at an arbitrary point (I've depicted two examples, A and B). We gradually take steps toward the closest minima, making changes in θ in proportion to the slope.
The illustration represents what the following algorithm does (pseudo-code) [1]:
θ(0) = θ_init # Initialization of optimization variable
η = 0.02 # Arbitrary step-size parameter
Є = 0.01 # Optimization accuracy
k = 0 # Iteration counter
while |f(θ(k+1) - f(θ(k))| > Є:
θ(k+1) = θ(k) - η ∇f(θ(k))
k = k + 1
The loss function here is like our example of the mountain when it's dark and we don't have a map: we don't know what it looks like. We want to know the value of θ for which J is minimized, but the optimization algorithm doesn't know what the value of J would be for all possible inputs θ .
This is why we initialize our optimization algorithm with any arbitrary value of θ. For instance, points A and B in the figure represent two different initialization values.
Potential problems with Gradient Descent
The Gradient Descent algorithm is effective because it can help us obtain an approximate solution for any convex function.
If the function we are trying to optimize is convex, for any value of ϵ there is some step size η such that gradient descent will converge to θ* within ϵ of the true optimal θ. [1]
However, as you might have guessed, this is not perfect. The algorithm might converge, but that doesn't guarantee that we'll find a global minimum.
The main challenges of gradient descent are the following:
Arbitrary initialization has an impact on the results
Using different initialization values, we could encounter local minima instead of global minima.
For example, starting at point B instead of point A in the previous figure above.
Or, a less evident case, converging to a plateau (vanishing gradient problem) as shown by the blue line in the figure below.