Basics of Deep Learning p.8  Backpropagation explained: Chain Rule and Activation Function1/15/2020
This post is part of a series:
In the previous post, we left off with the observation that if we want to apply the gradient descent algorithm to our cost function, then we have to determine the partial derivatives of the MSE with respect to (w.r.t). W_{1} and w.r.t. W_{2}.
See slide 1
And this is what we are going to do now. Chain Rule
But, as you can see, our formula for the MSE is quite complicated. So, determining the formulas for the two partial derivatives is consequently also going to be quite complicated to do in just one step.
But, fortunately, we don’t have to do that in one step. And that’s because, as we have seen in an earlier post, our cost function is composed of many intermediate functions where we put the result of one step into the next step. And this allows us to use the chain rule to determine the partial derivatives. So, once again, let’s look at a simpler function to understand that concept. See slide 2 So, z is a function of y. And y, in turn, is a function of x. And now, we want to determine the derivative of z w.r.t. x. See slide 3 To do that, according to the chain rule, we have to multiply the derivative of z w.r.t. y with the derivative of y w.r.t. x. See slide 4 So basically, what you do with the chain rule is to simply multiply together the derivatives of the two functions. And intuitively, this makes sense because if you slightly increase x, then this will have an effect on y. And this effect on y, in turn, will have an effect on z. So, if you want to know how z changes if you slightly increase x, then it makes sense to multiply these two effects together. The effect that a slight increase in y has on z is, so to say, weighted with the effect that a slight increase in x has on y. Side note: Here, we use the regular derivative again because the functions only depend on one variable respectively. So that’s how the chain rule generally works. And hereby, it doesn’t really matter how many functions there are to begin with. You can arbitrarily expand this formula. So, let’s say, for example, that x would additionally be a function of v. See slide 5 Then, you would determine the derivative of z w.r.t. v as follows: See slide 6 Okay, so now that we know what the chain rule is and how it works, let’s look at a concrete example. But for simplicity, let’s use just the two functions again. See slide 7 First, let’s write down the equation for the derivatives of these two functions. See slide 8 And now, let’s say x is equal to 2. See slide 9 In this case, y would be 4 and z would be 16. See slide 10 And those two steps, so to say, resemble the feedforward through the neural net. And now, we want to determine the derivative of z w.r.t. x when x is equal to 2. See slide 11 To do that, we, so to say, move backwards through the equations. So, those steps then resemble the backpropagation through the neural net. So first, we need to determine the derivative of z w.r.t. y evaluated at the point when y is equal to 4 (because if x is 2, then y is going to be 4). See slide 12 This is 8. See slide 13 Then, we need to determine the derivative of y w.r.t. x evaluated at x being equal to 2. See slide 14 And here, it wouldn’t actually matter what value x is because the derivative is always 3. See slide 15 And now, we simply multiply 8 with 3 which gives us a derivative of 24. See slide 16 So, if we increase x by a tiny amount, then z will increase by 24 times that amount. And as a side note, just to show that the chain rule really works, if you put in the formula for y into z, then you get this equation: See slide 17 And if you multiply it out, you get this equation: See slide 18 The equation for the derivative of this function looks like this: See slide 19 And if you evaluate that at the point where x is equal to 2, then you again get the value of 24. See slide 20 So, as you can see, the chain rule really works. And now, let’s go back to our overview graphic See slide 21 Here, we can now also use the chain rule to easily determine the formulas for our two partial derivatives. So, we actually don’t need the complicated formula anymore. See slide 22
So, let’s start with the partial derivative of MSE w.r.t. W_{2}. Therefore, we first have to determine the partial derivative of MSE w.r.t. O_{out}.
See slide 23
Then, we multiply that with the partial derivative of O_{out} w.r.t. O_{in}.
See slide 24
And then finally, we multiply that with the partial derivative of O_{in} w.r.t. W_{2}.
See slide 25
And just to be clear, here we use partial derivatives because we are determining the derivative w.r.t. to a matrix which contains many variables and not just one.
And in the same way, we can determine the partial derivative of MSE w.r.t. W_{1}.
See slide 26
And here, the first two expressions are actually the same as in the partial derivative of MSE w.r.t. W_{2}. But then, to get to W_{1}, we have to keep going through the equations. And with that, we now know how we can determine those two partial derivatives.
So, the third and final element we need to answer the question of how we can find the right parameters for the algorithm, is the chain rule.
See slide 27 And those three elements taken together, are what make up the backpropagation algorithm. See slide 28 So now, we could basically start to see what the actual equations for the partial derivatives look like. But before we do that, we have to make one more adjustment. Activation Function
Namely, we have to modify our activation function, which is a step function.
See slide 29 Due to the step, the function is actually not differentiable at x=2. And what this means is that we can’t calculate the derivative of this function.
So, in our formulas for updating the weight matrices, we can’t calculate the partial derivative of O_{out} w.r.t. O_{in} and the partial derivative of H_{out} w.r.t. H_{in}.
See slide 30
And therefore, we can’t calculate how we should update our weight matrices. But even if the step function would be differentiable despite the step, then we still couldn’t use it. And that’s because, as you can see in the graph, its derivative or slope is always zero. So, in our formulas for updating the weight matrices, at least one expression would be equal to zero. And since we multiply all the different partial derivatives together, our updates for the weight matrices would also always be zero. So, we wouldn’t actually update them. So, to solve this problem we now need a different activation function. And that new function should look similar, but it shouldn’t have a step (so that it is actually differentiable). And its slope should be nonzero. And this function is going to be the sigmoid function. See slide 31 And for now, the actual formula of this function is not really of interest to us, but we are more interested in the shape of this function. And as you can see, it looks very similar to the step function. It is only shifted slightly to the left. But that’s only because we arbitrarily set the threshold level to 2. If we set it to 0, then the functions look basically the same. See slide 32 For large negative numbers the sigmoid function also approaches 0. For large positive numbers it approaches 1. And in between that, there is a smooth transition which means that it is actually differentiable. And, moving from left to right, the slope of this function goes from being almost 0 to 0.25 when x is equal to zero (which is its maximum slope) and then it goes back to being almost 0. So, let’s go back to our overview graphic: See slide 33 And now, we can replace the step function with the sigmoid function. See slide 34 So, the neural net is not going to output a 0 or 1, but instead it is now going to output any value between 0 and 1 (0 and 1 excluded). See slide 35
And with that, we are finally ready to check what the actual equations for the partial derivatives of MSE w.r.t. W_{1} and w.r.t. W_{2} look like. And this will be the topic of the next post.

AuthorJust someone trying to explain his understanding of data science concepts Archives
February 2020
Categories
