Basics of Deep Learning p.8 - Backpropagation explained: Chain Rule and Activation Function

1/15/2020

This post is part of a series:

Part 1: Introduction
Part 2: Feedforward Algorithm explained
Part 3: Implementing the Feedforward Algorithm in pure Python
Part 4: Implementing the Feedforward Algorithm in pure Python cont'd
Part 5: Implementing the Feedforward Algorithm with NumPy
Part 6: Backpropagation explained - Cost Function and Derivatives
Part 7: Backpropagation explained - Gradient Descent and Partial Derivatives
Part 8: Backpropagation explained - Chain Rule and Activation Function
Part 9: Backpropagation explained Step by Step
Part 10: Backpropagation explained Step by Step cont'd
Part 11: Backpropagation explained Step by Step cont'd
Part 12: Backpropagation explained Step by Step cont'd
Part 13: Implementing the Backpropagation Algorithm with NumPy
Part 14: How to train a Neural Net

Here are the corresponding slides for this post:

basics_of_dl_8.pdf
File Size:	192 kb
File Type:	pdf

Download File

In the previous post, we left off with the observation that if we want to apply the gradient descent algorithm to our cost function, then we have to determine the partial derivatives of the MSE with respect to (w.r.t). W₁ and w.r.t. W₂.

See slide 1

And this is what we are going to do now.

Chain Rule

But, as you can see, our formula for the MSE is quite complicated. So, determining the formulas for the two partial derivatives is consequently also going to be quite complicated to do in just one step.

But, fortunately, we don’t have to do that in one step. And that’s because, as we have seen in an earlier post, our cost function is composed of many intermediate functions where we put the result of one step into the next step. And this allows us to use the chain rule to determine the partial derivatives.

So, once again, let’s look at a simpler function to understand that concept.

               See slide 2

So, z is a function of y. And y, in turn, is a function of x. And now, we want to determine the derivative of z w.r.t. x.

               See slide 3

To do that, according to the chain rule, we have to multiply the derivative of z w.r.t. y with the derivative of y w.r.t. x.

               See slide 4

So basically, what you do with the chain rule is to simply multiply together the derivatives of the two functions.

And intuitively, this makes sense because if you slightly increase x, then this will have an effect on y. And this effect on y, in turn, will have an effect on z. So, if you want to know how z changes if you slightly increase x, then it makes sense to multiply these two effects together. The effect that a slight increase in y has on z is, so to say, weighted with the effect that a slight increase in x has on y.

Side note: Here, we use the regular derivative again because the functions only depend on one variable respectively.

So that’s how the chain rule generally works. And hereby, it doesn’t really matter how many functions there are to begin with. You can arbitrarily expand this formula. So, let’s say, for example, that x would additionally be a function of v.

               See slide 5

Then, you would determine the derivative of z w.r.t. v as follows:

               See slide 6

Okay, so now that we know what the chain rule is and how it works, let’s look at a concrete example. But for simplicity, let’s use just the two functions again.

               See slide 7

First, let’s write down the equation for the derivatives of these two functions.

               See slide 8

And now, let’s say x is equal to 2.

               See slide 9

In this case, y would be 4 and z would be 16.

               See slide 10

And those two steps, so to say, resemble the feedforward through the neural net. And now, we want to determine the derivative of z w.r.t. x when x is equal to 2.

               See slide 11

To do that, we, so to say, move backwards through the equations. So, those steps then resemble the backpropagation through the neural net.

So first, we need to determine the derivative of z w.r.t. y evaluated at the point when y is equal to 4 (because if x is 2, then y is going to be 4).

               See slide 12

This is 8.

               See slide 13

Then, we need to determine the derivative of y w.r.t. x evaluated at x being equal to 2.

               See slide 14

And here, it wouldn’t actually matter what value x is because the derivative is always 3.

               See slide 15

And now, we simply multiply 8 with 3 which gives us a derivative of 24.

               See slide 16

So, if we increase x by a tiny amount, then z will increase by 24 times that amount.

And as a side note, just to show that the chain rule really works, if you put in the formula for y into z, then you get this equation:

               See slide 17

And if you multiply it out, you get this equation:

               See slide 18

The equation for the derivative of this function looks like this:

               See slide 19

And if you evaluate that at the point where x is equal to 2, then you again get the value of 24.

               See slide 20

So, as you can see, the chain rule really works.

And now, let’s go back to our overview graphic

               See slide 21

Here, we can now also use the chain rule to easily determine the formulas for our two partial derivatives. So, we actually don’t need the complicated formula anymore.

               See slide 22

So, let’s start with the partial derivative of MSE w.r.t. W₂. Therefore, we first have to determine the partial derivative of MSE w.r.t. O_out.

See slide 23

Then, we multiply that with the partial derivative of O_out w.r.t. O_in.

See slide 24

And then finally, we multiply that with the partial derivative of O_in w.r.t. W₂.

See slide 25

And just to be clear, here we use partial derivatives because we are determining the derivative w.r.t. to a matrix which contains many variables and not just one.

And in the same way, we can determine the partial derivative of MSE w.r.t. W₁.

See slide 26

And here, the first two expressions are actually the same as in the partial derivative of MSE w.r.t. W₂. But then, to get to W₁, we have to keep going through the equations. And with that, we now know how we can determine those two partial derivatives.

So, the third and final element we need to answer the question of how we can find the right parameters for the algorithm, is the chain rule.

See slide 27

And those three elements taken together, are what make up the backpropagation algorithm.

See slide 28

So now, we could basically start to see what the actual equations for the partial derivatives look like. But before we do that, we have to make one more adjustment.

Activation Function

Namely, we have to modify our activation function, which is a step function.

See slide 29

Due to the step, the function is actually not differentiable at x=2. And what this means is that we can’t calculate the derivative of this function.

So, in our formulas for updating the weight matrices, we can’t calculate the partial derivative of O_out w.r.t. O_in and the partial derivative of H_out w.r.t. H_in.

See slide 30

And therefore, we can’t calculate how we should update our weight matrices.

But even if the step function would be differentiable despite the step, then we still couldn’t use it. And that’s because, as you can see in the graph, its derivative or slope is always zero. So, in our formulas for updating the weight matrices, at least one expression would be equal to zero. And since we multiply all the different partial derivatives together, our updates for the weight matrices would also always be zero. So, we wouldn’t actually update them.

So, to solve this problem we now need a different activation function. And that new function should look similar, but it shouldn’t have a step (so that it is actually differentiable). And its slope should be non-zero.

And this function is going to be the sigmoid function.

               See slide 31

And for now, the actual formula of this function is not really of interest to us, but we are more interested in the shape of this function.

And as you can see, it looks very similar to the step function. It is only shifted slightly to the left. But that’s only because we arbitrarily set the threshold level to 2. If we set it to 0, then the functions look basically the same.

               See slide 32

For large negative numbers the sigmoid function also approaches 0. For large positive numbers it approaches 1. And in between that, there is a smooth transition which means that it is actually differentiable. And, moving from left to right, the slope of this function goes from being almost 0 to 0.25 when x is equal to zero (which is its maximum slope) and then it goes back to being almost 0.

So, let’s go back to our overview graphic:

               See slide 33

And now, we can replace the step function with the sigmoid function.

               See slide 34

So, the neural net is not going to output a 0 or 1, but instead it is now going to output any value between 0 and 1 (0 and 1 excluded).

               See slide 35

And with that, we are finally ready to check what the actual equations for the partial derivatives of MSE w.r.t. W₁ and w.r.t. W₂ look like. And this will be the topic of the next post.

0 Comments