Basics of Deep Learning p.7 - Backpropagation explained: Gradient Descent and Partial Derivatives

1/15/2020

This post is part of a series:

Part 1: Introduction
Part 2: Feedforward Algorithm explained
Part 3: Implementing the Feedforward Algorithm in pure Python
Part 4: Implementing the Feedforward Algorithm in pure Python cont'd
Part 5: Implementing the Feedforward Algorithm with NumPy
Part 6: Backpropagation explained - Cost Function and Derivatives
Part 7: Backpropagation explained - Gradient Descent and Partial Derivatives
Part 8: Backpropagation explained - Chain Rule and Activation Function
Part 9: Backpropagation explained Step by Step
Part 10: Backpropagation explained Step by Step cont'd
Part 11: Backpropagation explained Step by Step cont'd
Part 12: Backpropagation explained Step by Step cont'd
Part 13: Implementing the Backpropagation Algorithm with NumPy
Part 14: How to train a Neural Net

Here are the corresponding interactive plots for this post:

Here are the corresponding slides for this post:

basics_of_dl_7.pdf
File Size:	439 kb
File Type:	pdf

Download File

In the previous post, we left off with the observation that for the following function, we can’t apply an analytical approach in order to determine the minimum of the function.

See slide 1

So, the question then was: How do we now find the minimum? And that’s what we are going to answer now.

Numerical Approach for finding Extrema – Brute-force

Namely, instead of using an analytical approach, we are going to use a numerical approach. So, instead of manipulating the equations themselves, we basically just test out different values for our variable and see what the function outputs.

               See slide 2

And the most basic numerical approach would probably be a brute-force approach where we simply calculate the y for a huge number of different w (which is basically how I created the graph).

               See slide 3

And then, we check which of those values results in the lowest y.

               See slide 4

This way, however, we can only determine an approximation for the minimum of the function. And that’s because if we would have used, for example, 3 digits after the decimal point for our w, then we could have found a w that results in an even lower y.

               See slide 5

And if we would have used 4 digits, we would get an even lower y, and so on.

So, a general characteristic of numerical approaches is that the solution will always be just an approximation and not an exact value like we got with the analytical approach.

               See slide 6

Okay, so having said that, you can probably imagine that this brute-force approach is not very practical for a more complex function like our mean squared error (MSE) which has way more variables.

               See slide 7

Because if you would for example just use 100 different values for each of the 14 weights in our neural net, then the number of calculations that you would have to do is 100¹⁴. So, it would simply take too long to find that one combination of weights that will result in the lowest MSE. And that’s despite the fact that this is even just a very small neural net. So, for larger ones with many more weights, this brute-force approach is even less practical.

Numerical Approach for finding Extrema – Gradient Descent

So, clearly, we need a different numerical approach for finding the minimum of the MSE. So, what could such an approach look like?

Well, what we could do is, we could simply pick a random value for our w, for example 1.9.

               See slide 8

And then, we somehow adjust this w. So, we either increase or decrease it in order to reduce the function somewhat. And for that adjustment of the w, we can take advantage of the fact that the derivative represents the slope of a function and tells us how our y changes if we increase the w by a tiny amount.

               See slide 9

So, at this point here, if we increase w by a little amount, then y will increase by 25.66 times that little amount. And this is exactly the opposite of what we want to do. So, if we instead decrease our w, then y should decrease, and we should get closer to the minimum.

But, by how much should we decrease the w?

Well, let’s just take the value of the derivative at this point (so the derivative of y evaluated at w).

               See slide 10

So, let’s calculate the new w.

               See slide 11

And here, as you can see in this example, if we use the update rule as it is and reduce our w by 25.66, then we actually overshoot the minimum (which is roughly at w=0.75 as you can see in the graph). So, we are somewhere way to the left of the diagram. And our y accordingly would be extremely high.

               See slide 12

So, we are actually further away from the minimum than before. And even worse, at this new w, the magnitude of the new slope is even bigger. And the update of w would look like this:

               See slide 13

So here, we would be moving in the right direction towards the minimum because we subtract the negative derivative from our w. So, we would increase our w. But the update is way too big and we overshoot the minimum even more. And this just keeps escalating and we would be getting further and further away from the minimum.

So, after all, we can’t just take the slope or derivative as it is, but we have to scale it down somewhat.

               See slide 14

So, in our update equation, we simply multiply the derivative with some factor which we denote as alpha (α). And this factor is called the learning rate. And typical values for the learning rate are for example 0.1, 0.01, 0.001 and so on.

And finding the right learning rate is part of the training process which we will cover in a later post of this series in more detail. If it is too big, as we have already seen, we get further and further away from the minimum. If it is too small, then we just take tiny steps towards the minimum which only get smaller and smaller as we get closer to the minimum. So, reaching the minimum would just take too many iterations and too much time. Thus, one has to find the right balance. The learning rate shouldn’t be too big, so that we never reach the minimum. But it also shouldn’t be too small, so that it takes too long to reach the minimum.

So, in our case, let’s set it to 0.01.

               See slide 15

So, the new w will be 1.64. And now, if we execute the update rule,then we move in the graph from the point where w is equal to 1.9, to the point where w is equal to 1.64.

               See slide 16

And here, the slope is 12.5. And then, we simply repeat the update rule with the new w-value.

               See slide 17

So, the updated w will then be 1.52. And if we execute the update rule, then we move from the point in the graph where w is equal to 1.64, to the point where w is equal to 1.52.

               See slide 18

And then, after many more iterations of this update rule, we step by step get closer to the minimum and then we stop at some point.

               See slide 19

So, using the derivative to update our w has the advantage that if the slope is steep, so when we are far away from the minimum, then we are going to change w by some bigger amount. And if the slope is not so steep, so when we are somewhere close to the minimum, then we are going to change w only by some small amount.

So, this how we can find the minimum of this function. And again, the solution that we get for w is only an approximation of the minimum.

And this process of starting at a random point (because we could have started at any point in the graph) and then using the derivative to step-by-step get closer to the minimum is called gradient descent.

               See slide 20

Namely, what we do is, we use the gradient (which is just another word for slope) to descent the function towards a minimum.

Partial Derivatives

And this gradient descent algorithm, we also have to apply to our more complex function.

               See slide 21

And that’s because, here too, we can’t determine the minimum of the function analytically. So, gradient descent is the second element that we need to determine the right parameters for the algorithm.

               See slide 22

The only difference here is, that we don’t just have one weight (i.e. variable), but we have many weights. So, we have to update each individual weight based on the derivative of the MSE with respect to (w.r.t.) that particular weight.

               See slide 23

And therefor, we have to make us of something called “partial derivatives”, instead of just using the regular derivative. And partial derivatives are not denoted with a normal “d”, but instead they are denoted with a stylized version of a d: “∂”.

               See slide 24

So now, let’s look again at a simpler function than our MSE to understand the concept of partial derivatives.

               See slide 25

So here, y is a function of w₁ and w₂. And because there are these two independent variables the graph now is a 3D graph.

See slide 26

And because of those two variables, we also have to take 2 partial derivatives.

See slide 27

The partial derivative of y w.r.t. w₁ which tells us how y changes if we slightly increase w₁. And the partial derivative of y w.r.t. w₂ which tells us how y changes if we slightly increase w₂. And now, let’s see how we actually determine the equations for those partial derivatives. And actually, this is pretty easy to do because for the partial derivative of y w.r.t. w₁ you simply treat w₂ as a constant. And for the partial derivative of y w.r.t. w₂, you simply treat w₁ as a constant. So, the equations look like this:

See slide 28

Okay so now, to understand what those functions tell us, let’s have a look at an example point. So, let’s say w₁ is -8 and w₂ is 7.

See slide 29

If we plug in those numbers into the equations, then we get the following values:

See slide 30

And, similar to the regular derivative, those partial derivatives now tell us something about the slope of the function at that particular point. To understand exactly what, let’s look at the interactive version of the plot.

See interactive plot 1

In contrast to the regular derivative, you can’t really ask: What is the slope of the surface at this point? And that’s because, now you can move in 360°. So, in one direction, the function might increase by a certain amount. But in another direction, it might increase by another amount.

So, you can only ask: What is the slope at this point in a certain direction? And this is what the partial derivatives tell us.

The partial derivative of y w.r.t. w₁ tells us the slope in the direction w₁. So, it tells us how our y changes if we move along the w₁-axis and slightly increase w₁. And as you can see by rotating the plot, the slope in that direction is negative and pretty steep which matches the value of -32.

And the partial derivative of y w.r.t. w₂ tells us the slope in the direction of w₂. So, it tells us how our y changes if we move along the w₂-axis and slightly increase w₂. And as you can see by rotating the plot, the slope in that direction is positive and not so steep as was the case when we moved in the direction of w1. And this matches the value of 14.

Okay, so now that we know what partial derivatives are and how we can interpret them, let’s use the gradient descent algorithm to find the minimum of our function.

               See slide 31

Here, we could actually very easily determine the minimum of the function analytically by setting the equations of the partial derivatives equal to zero and then solving them.

               See slide 32

But we are now going to use gradient descent to determine the minimum.

               See slide 33

So, in contrast to before, we are not just updating one variable with each gradient step, but we are going to update all of our variables simultaneously with each step. And if we do that a couple of times, then the steps, that we take to get to the minimum, look like this:

               See interactive plot 2

Side note: In the legend, you can click on the individual elements to activate or deactivate the different paths. So, for now, just activate the path “Gradient Descent”.

And here too, as you can see, in the beginning we take big steps and then, as we get closer and closer to the minimum, the steps get smaller and smaller. And the final point is again just an approximation of the minimum. So, w1 and w2 are not exactly zero, as we saw with the analytical approach. They are only very close to zero.

And the path, that you can see in the plot, is actually the fastest path to the minimum. You could imagine for example that if you hold a ball at our original point and then drop it, then it would approximately roll down this path. And the curve in the path happens because, again, the surface is much steeper in the w1-direction than it is in the w2-direction.

And now, to get an intuition why this is actually might be the fastest path, let’s see what happens to our initial point when we don’t update both our variables simultaneously but just one at a time.

See interactive plot 2

Side note: Now, also activate the last two paths in the interactive plot.

If you look at the red path, then you can see what the new position of the point would be if we only update w₁ meaning we are only moving along w₁-axis. And if you look at the purple path, then you can see what the new position of the point would be if we only update w₂ meaning we are only moving along w₂-axis.

And, as you can see, both points are actually at a higher point on the surface than the point where we updated both variables simultaneously (first gradient descent step). And intuitively this makes sense, because if we can decrease y somewhat if we move along the w₁-axis and if we can decrease y somewhat if we move along w₂-axis, then if we do both of those movements, y should decrease even more. And that’s what we are doing with our first gradient descent step. We are just adding the two vectors together.

See interactive plot 2

Side note: Now, also activate the “one Gradient Descent Step” path and deactivate the last two paths again in the interactive plot.

So, the direction of the first gradient descent step is the fastest way to reduce y given that we are at our initial starting point.

Okay, so that’s how gradient descent works with partial derivatives. So, let’s now go back to our overview graphic:

See slide 34

So, what we are now going to have to do is, we have to randomly assign values to our weights. And then, to execute one gradient step, we simultaneously update all those weights. And if we do many of those steps, then we can determine the minimum of the MSE function which was our goal to begin with.

And one thing that will make all this easier to do, is the fact that all our weights are actually stored in matrices. So, we don’t have to update each weight individually (and therefore determine all the 14 respective partial derivatives). But instead, we only have to update our weight matrices.

See slide 35

So, we only have to determine the partial derivative of MSE w.r.t W₁ and w.r.t. W₂.

See slide 36

And the good thing about that is that no matter how big those matrices are, we still only have to determine those two partial derivatives. So, let’s say for example that there would be 4 nodes in the hidden layer instead of just 2.

See slide 37

In this case, there would be 14 additional weights. 8 going from the inputs to the 2 additional nodes in the hidden layer. And 6 going from those nodes to the nodes in the output layer. So, in total, we now would have to determine 28 partial derivatives. But since we have stored everything in matrices, we still actually only need to determine 2 partial derivatives. So again, linear algebra makes our lives much, much easier.

Okay, so now let’s apply the gradient descent algorithm to our cost function.

See slide 38

Therefor, we have to determine the formulas for the partial derivative of MSE w.r.t. W₁ and w.r.t. W₂. And how to do that, is the topic of the next post.

0 Comments