This post is part of a series:
In the previous post, we left off with the observation that for the following function, we can’t apply an analytical approach in order to determine the minimum of the function.
See slide 1
So, the question then was: How do we now find the minimum? And that’s what we are going to answer now.
Numerical Approach for finding Extrema – Brute-force
Namely, instead of using an analytical approach, we are going to use a numerical approach. So, instead of manipulating the equations themselves, we basically just test out different values for our variable and see what the function outputs.
See slide 2
And the most basic numerical approach would probably be a brute-force approach where we simply calculate the y for a huge number of different w (which is basically how I created the graph).
See slide 3
And then, we check which of those values results in the lowest y.
See slide 4
This way, however, we can only determine an approximation for the minimum of the function. And that’s because if we would have used, for example, 3 digits after the decimal point for our w, then we could have found a w that results in an even lower y.
See slide 5
And if we would have used 4 digits, we would get an even lower y, and so on.
So, a general characteristic of numerical approaches is that the solution will always be just an approximation and not an exact value like we got with the analytical approach.
See slide 6
Okay, so having said that, you can probably imagine that this brute-force approach is not very practical for a more complex function like our mean squared error (MSE) which has way more variables.
See slide 7
Because if you would for example just use 100 different values for each of the 14 weights in our neural net, then the number of calculations that you would have to do is 10014. So, it would simply take too long to find that one combination of weights that will result in the lowest MSE. And that’s despite the fact that this is even just a very small neural net. So, for larger ones with many more weights, this brute-force approach is even less practical.
Numerical Approach for finding Extrema – Gradient Descent
So, clearly, we need a different numerical approach for finding the minimum of the MSE. So, what could such an approach look like?
Well, what we could do is, we could simply pick a random value for our w, for example 1.9.
See slide 8
And then, we somehow adjust this w. So, we either increase or decrease it in order to reduce the function somewhat. And for that adjustment of the w, we can take advantage of the fact that the derivative represents the slope of a function and tells us how our y changes if we increase the w by a tiny amount.
See slide 9
So, at this point here, if we increase w by a little amount, then y will increase by 25.66 times that little amount. And this is exactly the opposite of what we want to do. So, if we instead decrease our w, then y should decrease, and we should get closer to the minimum.
But, by how much should we decrease the w?
Well, let’s just take the value of the derivative at this point (so the derivative of y evaluated at w).
See slide 10
So, let’s calculate the new w.
See slide 11
And here, as you can see in this example, if we use the update rule as it is and reduce our w by 25.66, then we actually overshoot the minimum (which is roughly at w=0.75 as you can see in the graph). So, we are somewhere way to the left of the diagram. And our y accordingly would be extremely high.
See slide 12
So, we are actually further away from the minimum than before. And even worse, at this new w, the magnitude of the new slope is even bigger. And the update of w would look like this:
See slide 13
So here, we would be moving in the right direction towards the minimum because we subtract the negative derivative from our w. So, we would increase our w. But the update is way too big and we overshoot the minimum even more. And this just keeps escalating and we would be getting further and further away from the minimum.
So, after all, we can’t just take the slope or derivative as it is, but we have to scale it down somewhat.
See slide 14
So, in our update equation, we simply multiply the derivative with some factor which we denote as alpha (α). And this factor is called the learning rate. And typical values for the learning rate are for example 0.1, 0.01, 0.001 and so on.
And finding the right learning rate is part of the training process which we will cover in a later post of this series in more detail. If it is too big, as we have already seen, we get further and further away from the minimum. If it is too small, then we just take tiny steps towards the minimum which only get smaller and smaller as we get closer to the minimum. So, reaching the minimum would just take too many iterations and too much time. Thus, one has to find the right balance. The learning rate shouldn’t be too big, so that we never reach the minimum. But it also shouldn’t be too small, so that it takes too long to reach the minimum.
So, in our case, let’s set it to 0.01.
See slide 15
So, the new w will be 1.64. And now, if we execute the update rule,then we move in the graph from the point where w is equal to 1.9, to the point where w is equal to 1.64.
See slide 16
And here, the slope is 12.5. And then, we simply repeat the update rule with the new w-value.
See slide 17
So, the updated w will then be 1.52. And if we execute the update rule, then we move from the point in the graph where w is equal to 1.64, to the point where w is equal to 1.52.
See slide 18
And then, after many more iterations of this update rule, we step by step get closer to the minimum and then we stop at some point.
See slide 19
So, using the derivative to update our w has the advantage that if the slope is steep, so when we are far away from the minimum, then we are going to change w by some bigger amount. And if the slope is not so steep, so when we are somewhere close to the minimum, then we are going to change w only by some small amount.
So, this how we can find the minimum of this function. And again, the solution that we get for w is only an approximation of the minimum.
And this process of starting at a random point (because we could have started at any point in the graph) and then using the derivative to step-by-step get closer to the minimum is called gradient descent.
See slide 20
Namely, what we do is, we use the gradient (which is just another word for slope) to descent the function towards a minimum.
And this gradient descent algorithm, we also have to apply to our more complex function.
See slide 21
And that’s because, here too, we can’t determine the minimum of the function analytically. So, gradient descent is the second element that we need to determine the right parameters for the algorithm.
See slide 22
The only difference here is, that we don’t just have one weight (i.e. variable), but we have many weights. So, we have to update each individual weight based on the derivative of the MSE with respect to (w.r.t.) that particular weight.
See slide 23
And therefor, we have to make us of something called “partial derivatives”, instead of just using the regular derivative. And partial derivatives are not denoted with a normal “d”, but instead they are denoted with a stylized version of a d: “∂”.
See slide 24
So now, let’s look again at a simpler function than our MSE to understand the concept of partial derivatives.
See slide 25
So here, y is a function of w1 and w2. And because there are these two independent variables the graph now is a 3D graph.
See slide 26
And because of those two variables, we also have to take 2 partial derivatives.
See slide 27
The partial derivative of y w.r.t. w1 which tells us how y changes if we slightly increase w1. And the partial derivative of y w.r.t. w2 which tells us how y changes if we slightly increase w2. And now, let’s see how we actually determine the equations for those partial derivatives. And actually, this is pretty easy to do because for the partial derivative of y w.r.t. w1 you simply treat w2 as a constant. And for the partial derivative of y w.r.t. w2, you simply treat w1 as a constant. So, the equations look like this:
See slide 28
Okay so now, to understand what those functions tell us, let’s have a look at an example point. So, let’s say w1 is -8 and w2 is 7.
See slide 29
If we plug in those numbers into the equations, then we get the following values:
See slide 30
And, similar to the regular derivative, those partial derivatives now tell us something about the slope of the function at that particular point. To understand exactly what, let’s look at the interactive version of the plot.
See interactive plot 1
In contrast to the regular derivative, you can’t really ask: What is the slope of the surface at this point? And that’s because, now you can move in 360°. So, in one direction, the function might increase by a certain amount. But in another direction, it might increase by another amount.
So, you can only ask: What is the slope at this point in a certain direction? And this is what the partial derivatives tell us.
The partial derivative of y w.r.t. w1 tells us the slope in the direction w1. So, it tells us how our y changes if we move along the w1-axis and slightly increase w1. And as you can see by rotating the plot, the slope in that direction is negative and pretty steep which matches the value of -32.
And the partial derivative of y w.r.t. w2 tells us the slope in the direction of w2. So, it tells us how our y changes if we move along the w2-axis and slightly increase w2. And as you can see by rotating the plot, the slope in that direction is positive and not so steep as was the case when we moved in the direction of w1. And this matches the value of 14.
Okay, so now that we know what partial derivatives are and how we can interpret them, let’s use the gradient descent algorithm to find the minimum of our function.
See slide 31
Here, we could actually very easily determine the minimum of the function analytically by setting the equations of the partial derivatives equal to zero and then solving them.
See slide 32
But we are now going to use gradient descent to determine the minimum.
See slide 33
So, in contrast to before, we are not just updating one variable with each gradient step, but we are going to update all of our variables simultaneously with each step. And if we do that a couple of times, then the steps, that we take to get to the minimum, look like this:
See interactive plot 2
Side note: In the legend, you can click on the individual elements to activate or deactivate the different paths. So, for now, just activate the path “Gradient Descent”.
And here too, as you can see, in the beginning we take big steps and then, as we get closer and closer to the minimum, the steps get smaller and smaller. And the final point is again just an approximation of the minimum. So, w1 and w2 are not exactly zero, as we saw with the analytical approach. They are only very close to zero.
And the path, that you can see in the plot, is actually the fastest path to the minimum. You could imagine for example that if you hold a ball at our original point and then drop it, then it would approximately roll down this path. And the curve in the path happens because, again, the surface is much steeper in the w1-direction than it is in the w2-direction.
And now, to get an intuition why this is actually might be the fastest path, let’s see what happens to our initial point when we don’t update both our variables simultaneously but just one at a time.
See interactive plot 2
Side note: Now, also activate the last two paths in the interactive plot.
If you look at the red path, then you can see what the new position of the point would be if we only update w1 meaning we are only moving along w1-axis. And if you look at the purple path, then you can see what the new position of the point would be if we only update w2 meaning we are only moving along w2-axis.
And, as you can see, both points are actually at a higher point on the surface than the point where we updated both variables simultaneously (first gradient descent step). And intuitively this makes sense, because if we can decrease y somewhat if we move along the w1-axis and if we can decrease y somewhat if we move along w2-axis, then if we do both of those movements, y should decrease even more. And that’s what we are doing with our first gradient descent step. We are just adding the two vectors together.
See interactive plot 2
Side note: Now, also activate the “one Gradient Descent Step” path and deactivate the last two paths again in the interactive plot.
So, the direction of the first gradient descent step is the fastest way to reduce y given that we are at our initial starting point.
Okay, so that’s how gradient descent works with partial derivatives. So, let’s now go back to our overview graphic:
See slide 34
So, what we are now going to have to do is, we have to randomly assign values to our weights. And then, to execute one gradient step, we simultaneously update all those weights. And if we do many of those steps, then we can determine the minimum of the MSE function which was our goal to begin with.
And one thing that will make all this easier to do, is the fact that all our weights are actually stored in matrices. So, we don’t have to update each weight individually (and therefore determine all the 14 respective partial derivatives). But instead, we only have to update our weight matrices.
See slide 35
So, we only have to determine the partial derivative of MSE w.r.t W1 and w.r.t. W2.
See slide 36
And the good thing about that is that no matter how big those matrices are, we still only have to determine those two partial derivatives. So, let’s say for example that there would be 4 nodes in the hidden layer instead of just 2.
See slide 37
In this case, there would be 14 additional weights. 8 going from the inputs to the 2 additional nodes in the hidden layer. And 6 going from those nodes to the nodes in the output layer. So, in total, we now would have to determine 28 partial derivatives. But since we have stored everything in matrices, we still actually only need to determine 2 partial derivatives. So again, linear algebra makes our lives much, much easier.
Okay, so now let’s apply the gradient descent algorithm to our cost function.
See slide 38