This post is part of a series:
In the previous post, we left off with the observation that if we want to apply the gradient descent algorithm to our cost function, then we have to determine the partial derivatives of the MSE with respect to (w.r.t). W1 and w.r.t. W2.
See slide 1
And this is what we are going to do now.
But, as you can see, our formula for the MSE is quite complicated. So, determining the formulas for the two partial derivatives is consequently also going to be quite complicated to do in just one step.
But, fortunately, we don’t have to do that in one step. And that’s because, as we have seen in an earlier post, our cost function is composed of many intermediate functions where we put the result of one step into the next step. And this allows us to use the chain rule to determine the partial derivatives.
So, once again, let’s look at a simpler function to understand that concept.
See slide 2
So, z is a function of y. And y, in turn, is a function of x. And now, we want to determine the derivative of z w.r.t. x.
See slide 3
To do that, according to the chain rule, we have to multiply the derivative of z w.r.t. y with the derivative of y w.r.t. x.
See slide 4
So basically, what you do with the chain rule is to simply multiply together the derivatives of the two functions.
And intuitively, this makes sense because if you slightly increase x, then this will have an effect on y. And this effect on y, in turn, will have an effect on z. So, if you want to know how z changes if you slightly increase x, then it makes sense to multiply these two effects together. The effect that a slight increase in y has on z is, so to say, weighted with the effect that a slight increase in x has on y.
Side note: Here, we use the regular derivative again because the functions only depend on one variable respectively.
So that’s how the chain rule generally works. And hereby, it doesn’t really matter how many functions there are to begin with. You can arbitrarily expand this formula. So, let’s say, for example, that x would additionally be a function of v.
See slide 5
Then, you would determine the derivative of z w.r.t. v as follows:
See slide 6
Okay, so now that we know what the chain rule is and how it works, let’s look at a concrete example. But for simplicity, let’s use just the two functions again.
See slide 7
First, let’s write down the equation for the derivatives of these two functions.
See slide 8
And now, let’s say x is equal to 2.
See slide 9
In this case, y would be 4 and z would be 16.
See slide 10
And those two steps, so to say, resemble the feedforward through the neural net. And now, we want to determine the derivative of z w.r.t. x when x is equal to 2.
See slide 11
To do that, we, so to say, move backwards through the equations. So, those steps then resemble the backpropagation through the neural net.
So first, we need to determine the derivative of z w.r.t. y evaluated at the point when y is equal to 4 (because if x is 2, then y is going to be 4).
See slide 12
This is 8.
See slide 13
Then, we need to determine the derivative of y w.r.t. x evaluated at x being equal to 2.
See slide 14
And here, it wouldn’t actually matter what value x is because the derivative is always 3.
See slide 15
And now, we simply multiply 8 with 3 which gives us a derivative of 24.
See slide 16
So, if we increase x by a tiny amount, then z will increase by 24 times that amount.
And as a side note, just to show that the chain rule really works, if you put in the formula for y into z, then you get this equation:
See slide 17
And if you multiply it out, you get this equation:
See slide 18
The equation for the derivative of this function looks like this:
See slide 19
And if you evaluate that at the point where x is equal to 2, then you again get the value of 24.
See slide 20
So, as you can see, the chain rule really works.
And now, let’s go back to our overview graphic
See slide 21
Here, we can now also use the chain rule to easily determine the formulas for our two partial derivatives. So, we actually don’t need the complicated formula anymore.
See slide 22
So, let’s start with the partial derivative of MSE w.r.t. W2. Therefore, we first have to determine the partial derivative of MSE w.r.t. Oout.
See slide 23
Then, we multiply that with the partial derivative of Oout w.r.t. Oin.
See slide 24
And then finally, we multiply that with the partial derivative of Oin w.r.t. W2.
See slide 25
And just to be clear, here we use partial derivatives because we are determining the derivative w.r.t. to a matrix which contains many variables and not just one.
And in the same way, we can determine the partial derivative of MSE w.r.t. W1.
See slide 26
And here, the first two expressions are actually the same as in the partial derivative of MSE w.r.t. W2. But then, to get to W1, we have to keep going through the equations. And with that, we now know how we can determine those two partial derivatives.
So, the third and final element we need to answer the question of how we can find the right parameters for the algorithm, is the chain rule.
See slide 27
And those three elements taken together, are what make up the backpropagation algorithm.
See slide 28
So now, we could basically start to see what the actual equations for the partial derivatives look like. But before we do that, we have to make one more adjustment.
Namely, we have to modify our activation function, which is a step function.
See slide 29
Due to the step, the function is actually not differentiable at x=2. And what this means is that we can’t calculate the derivative of this function.
So, in our formulas for updating the weight matrices, we can’t calculate the partial derivative of Oout w.r.t. Oin and the partial derivative of Hout w.r.t. Hin.
See slide 30
And therefore, we can’t calculate how we should update our weight matrices.
But even if the step function would be differentiable despite the step, then we still couldn’t use it. And that’s because, as you can see in the graph, its derivative or slope is always zero. So, in our formulas for updating the weight matrices, at least one expression would be equal to zero. And since we multiply all the different partial derivatives together, our updates for the weight matrices would also always be zero. So, we wouldn’t actually update them.
So, to solve this problem we now need a different activation function. And that new function should look similar, but it shouldn’t have a step (so that it is actually differentiable). And its slope should be non-zero.
And this function is going to be the sigmoid function.
See slide 31
And for now, the actual formula of this function is not really of interest to us, but we are more interested in the shape of this function.
And as you can see, it looks very similar to the step function. It is only shifted slightly to the left. But that’s only because we arbitrarily set the threshold level to 2. If we set it to 0, then the functions look basically the same.
See slide 32
For large negative numbers the sigmoid function also approaches 0. For large positive numbers it approaches 1. And in between that, there is a smooth transition which means that it is actually differentiable. And, moving from left to right, the slope of this function goes from being almost 0 to 0.25 when x is equal to zero (which is its maximum slope) and then it goes back to being almost 0.
So, let’s go back to our overview graphic:
See slide 33
And now, we can replace the step function with the sigmoid function.
See slide 34
So, the neural net is not going to output a 0 or 1, but instead it is now going to output any value between 0 and 1 (0 and 1 excluded).
See slide 35