Gradient Descent Algorithm 101
Beginner-friendly guide
Gradient Descent Algorithm 101

Imagine you are a drop of water on top of a mountain, and your goal is to get to the lake situated right at the base of the mountain. That tall mountain has different slopes and obstacles, so going down following a straight line might not be the best solution. How would you approach this problem? The best solution would arguably be taking little steps, one at a time, always heading toward the direction that brings you closer to your end goal.
Gradient Descent (GD) is the algorithm that does just that, and it is essential for any data scientist to understand. It's basic and rather simple but crucial, and anyone willing to enter the field should be able to explain what it is.
In this post, my goal is to make a complete and beginner-friendly guide to make everyone understand what GD is, what's it used for, how it works, and mention different variations of it.
As always, you'll find the resources section at the end of the post.
But first things first.
Introduction
Using Wikipedia's definition[1], Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Even though it's surely not the most effective method, it's commonly used in Machine Learning and Deep Learning, especially in Neural Networks.
It's basically used to minimize the value of a function by updating a set of parameters on each iteration. Mathematically speaking, it uses the derivative (gradient) to gradually decrease (descent) its value.
But there's a catch: not all functions are optimizable. We require a function – either uni or multivariate – that's differentiable, which means derivatives exist at each point in the function's domain, and convex (U-shape or similar).
Now, after this simple introduction, we can start digging a little bit deeper into the math behind it.
Practical Case
Because all gets clearer when going beyond the theory, let's use real numbers and values to understand what it does.
Let's use a common Data Science case in which we want to develop a regression model.
Disclaimer: I have totally invented this and there's no logical reasoning behind using these functions, all came randomly. The goal is to show the process itself.
The cost function or loss function in any data science problem is the function we want to optimize. As we're using regression, we're going to use this one:

The goal is to find the optimal minimum of f(x,y). Let me plot what it looks like:

Now our goal is to get the proper values for "x" and "y" that let us find the optimal values of this cost function. We can already see it graphically:
- y=0
- x being either -1 or 1
Onto the GD itself, because we want to make our machine learn to do the same.
The Algorithm
As said, gradient descent is an iterative process in which we compute the gradient and move in the opposite direction. The reasoning behind this is that the gradient of a function is used to determine the slope of that function. As we want to move down, not up, then we move in the opposite way.
It's a simple process in which we update x and y in each iteration, by following the next approach:

Explained in words, at iteration k:
- Compute the gradient using the values of x and y at that iteration.
- For each of those variables – x and y – multiply its gradient times lambda (