Hands On Optimization with Expected Improvement and Gaussian Process Regression, in Python

Author:Murphy | View: 25675 | Time: 2025-03-22 21:19:58

Disclaimer: Gauss has born in 1777 and he was way more brilliant than me. A lot of people have written about this before me and have done an amazing job doing so. A good reference for this article is this one, very well written by @okanyenigun.

My wife is a business major, and if you ask her: "What is Artificial Intelligence?"

She says:

"Something that is taught a bunch of information to develop thoughts and freaking (she didn't say )…I don't know…ideas."

And I would say that this is pretty in line with how people describe AI. With the given adjustments, that is not far away from the truth also.

Now, if you talk to a mathematician or a physicist, the answer would be way more technical and probably boring (I'm a physicist so I'm allowed to say that). If I were to define AI I would say:

"Artificial Intelligence is a collection of algorithms that use data to minimize a loss function"

…I told you it was more boring. What do I mean by that?

0. Machine Learning as a minimization problem

Let's say we are using Artificial Intelligence to predict the price of a house. Before you do any Artificial Intelligence, there is a human being that actually evaluates the price of a house based on multiple parameters like location, size, year of construction, … . As we all know, Artificial Intelligence needs data, so this thing has to be done a large amount of times to have our training set.Something like this:

This collection of houses and the corresponding prices are used to optimize the Machine Learning model. For example, let's say that, in our very simple imaginary world we live in, the price of the house is only dependent on the size. For example, if the size is x and the corresponding price is y, we have that:

So a house that is big x = 2 has price 50×2 = 100 (arbitrary units).

Now let's say that we are using a Machine Learning model like:

For a given new house with size = x = 1, for example, we have that the plot Error vs w is the following:

Now, of course, we would like our error to be 0. We want to have a Machine Learning model that predicts the exact price of the house.

The thing is that, in real life, we don't really have a plot like the one you are seeing in this picture. What happens in real life is that we pick a random w (or a random vector w) and we optimize to find the minimum error.

If you are a Machine Learning enthusiast this story is not new to you. For example, you might have heard of something called Gradient Descent.

1. Gradient Descent and its Limitations

Gradient descent is a technique that iteratively minimizes the error of a function by following the opposite direction of the gradient. Kind of like this:

I don't want to bore you with any equation, but the logic of the algorithm is the following:

Start by a random parameter vector. In the case above the parameter is only one (1D vector), but the dimension of a model is usually way larger (millions or billions dimensional vector)
Estimate the gradient of the loss function with respect to the random parameter
Move a step in the direction of the gradient (like you see in the figure above)
If "reach convergence" break, otherwise repeat step 2 and 3.

As we see from the picture, in a few words this means that if the gradient tells us where a function is growing, "minus the gradient" tells us where the function is decreasing: if we follow the decreasing path of the loss function (which is the function to minimize) we have won the game.

But have we though? Not quite, or, at least, not always. The trick is in the "if reach convergence" statement of step 4. What do we mean by "convergence"? We mean "local convergence" or if you want "convergence to a local minimum". Very confusing. I know. I'll try to talk human for a second.

The function that you see above has many simplifications. One simplification is that it is only 1D (we already talked about it). The other BIG (more HUGE than BIG) simplification is that that function has only ONE minimum. Many (I'd say all) cases in Machine Learning have loss functions with multiple minima, and when we run the gradient descent algorithm we get "trapped in a local minimum", as we can see from the Non convex function here:

So in the red case, the blue cross (end of the algorithm) is indeed a minimum but it is a local minimum. In other words, if we are trying to minimize the error we are only partially doing so: we are not able to find the global minimum of the problem.

This is a very well-known problem in the Machine Learning community, but it is actually a more general problem: