In the previous post, we left off at the point where we wanted to determine the equation for the partial derivative of MSE with respect to (w.r.t.) W_{1}.
See slide 1
And that’s what we are going to do now.
Derivative of MSE w.r.t. Weight Matrix 1
Therefor, let’s look again at the formula that we have written down in a previous post.
See slide 2
And actually, there is not much new going on in this formula compared to the formula for the partial derivative of MSE w.r.t. W_{2}.
The first two expressions are, in fact, exactly the same. So, we can simply use the O_{delta}, that we have already determined, and keep going from this point on through the chain rule.
See slide 3
And then, the last two expressions, they are also not really new because they are technically the same as the last two expressions in the formula for the partial derivative of MSE w.r.t. W_{2}.
In the formula for W_{2}, we have the partial derivative of O_{out} w.r.t. O_{in}. This, so to say, represents the movement from the output layer outputs to the output layer inputs.
See slides 4-5
And in the formula for W_{1}, we have the partial derivative of H_{out} w.r.t. H_{in}. This, so to say, represents the movement from the hidden layer outputs to the hidden layer inputs.
See slide 6-7
So, those expressions, generally speaking, represent the movement backward through the nodes. And if we look again at the formulas of the feedforward algorithm, then we can see that those expressions are simply the derivative of the sigmoid function.
See slide 8
And we already know how to determine that. So, the equation for H_{delta} is going to look like this:
See slide 9
And now, let’s look at the last expressions of both formulas, respectively.
In the formula for W_{2}, we have the partial derivative of O_{in} w.r.t. W_{2}. This, so to say, represents the movement from the output layer inputs to the weights of weight matrix 2.
See slides 10-11
In the formula for W_{1}, we have the partial derivative of H_{in} w.r.t. W_{1}. This, so to say, represents the movement from the hidden layer inputs to the weights of weight matrix 1.
See slides 12-13
So, those expressions, generally speaking, represent the movement from the layer inputs to the weights. And if we look again at the formulas of the feedforward algorithm, then we can see that those expressions are simply the derivative of a matrix multiplication.
See slide 14
And we also already know how to determine that. Namely, it is simply the “opposite, corresponding element”. And since the opposite, corresponding element of W_{1} is X, the equation for W_{1-update} looks like this:
See slide 15
So, the only new expression, really, is the one in the middle, namely the partial derivative of O_{in} w.r.t. H_{out}. This represents the movement from the output layer inputs to the hidden layer outputs.
See slides 16-17
So, it represents the movement from one layer to the other. That is something that we haven’t seen, yet. But even this expression is not entirely new because, because looking at the equations for the feedforward algorithm, we can see that it is again just the derivative of a matrix multiplication.
And here, the opposite, corresponding element of H_{out} is W_{2}. So, in order to determine H_{error}, we somehow have to multiply O_{delta} with W_{2}.
See slide 18
The only question now is: How do we multiply these two matrices together? So, in which order should we multiply them? And should we use transposes of any of those matrices?
And to answer those questions, let’s look again at the scalar-notation of the equations for the feedforward algorithm.
See slide 19
Scalar-Notation revisited
So far, we have only looked at the upper part of the neural net because we were only concerned with updating the weights of W_{2}. But now, we have to consider the whole neural net because we want to determine the partial derivative of MSE w.r.t. W_{1}.
See slide 20
So, let’s now rewrite the remaining two equations of the feedforward algorithm in scalar-notation. Therefore, we need a little bit more space.
See slide 21
Okay so now, let’s start with determining the input value of the first node in the hidden layer for example 1.
See slide 22
Therefor, we simply need to calculate the dot product of the x-values of example 1 and the respective weights that lead to this node.
See slide 23
In a similar way we can determine h_{in} _{2}^{(1)}.
See slide 24
And then, we do the same for the second example.
See slide 25
After that, we put all those h_{in} into the sigmoid function to get the respective h_{out}.
See slide 26
So, nothing special going on really, just a lot of variables.
And now, if we want to know how to determine the partial derivative of MSE w.r.t. W_{1}, let’s see again how we can determine the partial derivative of MSE w.r.t. a particular weight in W_{1}. So, let’s take w_{1-11}.
See slide 27
In order to determine this partial derivative, we need to again use the multivariable chain rule.
See slide 28
So therefore, let’s also again visualize the dependencies of the scalar-notation functions with a tree.
See slide 29
And, as you can see, the tree is so big that I actually couldn’t depict all branches. So, I just drew all the branches for the first example. For the second example, however, they look basically the same, just with a superscript of 2 instead of 1. So, it is not really a problem.
Okay so now, if we want to determine the partial derivative of MSE w.r.t. w_{1-11}, then we need to consider these 3 paths:
See slide 30
And obviously, there are three additional paths for the second example, but let’s just consider the first example, for now. So, let’s write down the derivatives that we need to determine for the first path (the one on the left).
See slide 31
Again, nothing really new is going on. So, let’s also write down the expressions for the other two paths.
See slide 32
And again, obviously we would also have to add the other 3 paths from the second example.
See slide 33
But I don’t have any more space, so we are just going to consider example 1.
So now, we can write down the actual equation for the partial derivative of MSE w.r.t. w
_{1-11}. And therefor, let’s colorize the individual expressions like we did in an earlier
post when we transferred the scalar-notation back to matrix-notation in order to see how to determine the partial derivative of MSE w.r.t. W
_{2}.
See slide 34
And there is also nothing really new going on. We basically already know how to determine all those expressions. The blue ones are the derivative of the cost function, the orange ones are the derivative of the sigmoid function and the green ones are the derivative of the dot product. So, let’s quickly go through the expressions of the first path.
The first two expressions, we have already written out before and they look like this:
See slide 35
Then, we need to multiply this with the partial derivative of o_{in} _{1}^{(1)} w.r.t. h_{out} _{1}^{(1)}. So, therefore we need to know what the opposite, corresponding element of h_{out} _{1}^{(1)} is.
See slide 36
And, as you can see, it is w_{2-11}.
See slide 37
So then, we need to multiply that with the derivative of h_{out} _{1}^{(1)} which is simply the derivative of the sigmoid function.
See slide 38
And finally, we need to multiply that with the opposite, corresponding element of w_{1-11}.
See slide 39
So, that are the equations for the first path. And in a similar way, we can determine the equations for the other two paths of the first example.
See slide 40
And that are now what the equations for the first example look like. And for the second example, they look basically the same, just with a superscript of 2. So, let’s just keep working with these equations.
See slide 41
And now, let’s rewrite them again in a more general way by using sigma notation. Therefore, let’s first factor out the 1/N.
See slide 42
And then, we can rewrite the function with sigma notation where the sigma runs over all examples e.
See slide 43
And again, we can do that like this because the equations for the second example look exactly the same except that they have a superscript of 2.
So, this is now the final equation for how we determine the partial derivative of MSE w.r.t. w_{1-11}. But remember, our only goal is to understand how to determine H_{error}.
See slide 44
Therefor, we don’t need to consider the whole equation, only a specific part of it. So, let’s rewrite it a little bit so that it is then easier to see how we can translate this scalar-notation equation back to dealing with matrices.
First off, we can see that the last couple of expressions in each summand are the same.
See slide 45
So, we can factor them out.
See slide 46
In terms of the matrix-notation, these expressions refer to everything that comes after H_{error}. And we already know how to determine that. So, we are not interested in this part of the scalar-notation equation.
And then, let’s look at the first couple of expressions in each summand.
See slide 47
In terms of the matrix-notation, these expressions represent the first two steps, i.e. they represent O_{delta}. So, we can summarize them accordingly since our goal is to understand how we should multiply O_{delta} with W_{2}.
See slide 48
So, this is now our equation for determining the partial derivative of MSE w.r.t. w_{1-11}. And the only part that is really of interest to us, is what is going on in the brackets. And that’s because we want to understand how we should multiply O_{delta} with W_{2}.
But to really understand that, having the equation of the partial derivative of MSE w.r.t. only one weight is not enough. So, let’s write down this equation for another weight. But which one should we choose?
Well, we are not going to pick one of the weights of W_{1} that also “goes” to node 1 in the hidden layer, i.e. w_{1-21}, w_{1-31} or w_{1-41}. And that’s because if would choose one of those, then the only thing that would change in the equation is the subscript of the x (since the path in the tree would be exactly the same except for the last step). And, as already said before, we are not interested in this part of the equation.
What we are interested in, is how to multiply o_{delta} with a weight of W_{2}. So, in the tree, we are interested in the fork below o_{in}. So, that’s where we need to take a different path. Therefore, we are going to pick one of the weights of W_{1} that “goes” to node 2 in the hidden layer, i.e. w_{1-12}, w_{1-22}, w_{1-32} or w_{1-42}. And we choose w_{1-12}.
See slide 49
And the equation then looks like this:
See slide 50
So, the o_{delta} are going to be the same. And that’s because the paths in the tree up until the respective o_{in} are the same for the partial derivatives of MSE w.r.t. w_{1-11} and w.r.t. w_{1-12}.
But after that, we take a different path than before. Namely, we go to the right instead of to the left. So, we need to determine the partial derivative of the particular o_{in} w.r.t. h_{out} _{2} (and not h_{out} _{1)}. This is a derivative of the dot product. So, let’s look again at the feedforward equations to see what the opposite, corresponding element of h_{out} _{2} is.
See slide 51
So, as you can see, for o_{in} _{1} the opposite, corresponding element of h_{out} _{2} is w_{2-21}. So, accordingly, we multiply o_{delta} _{1} with w_{2-21}. And the opposite, corresponding elements of o_{delta} _{2} and o_{delta} _{3} are w_{2-22} and w_{2-23}, respectively.
And then, for the sake of completeness, let’ also look at the rest of the equation (even though we don’t need in order to understand how to determine H_{error}). So, for the next step in the tree, we need to determine the derivative of the sigmoid function. Only this time, we use h_{out} _{2} and not h_{out} _{1}. And then, finally we multiply that with the opposite, corresponding element of w_{1-12} which is again x_{1}.
See slide 52
Transfer to Matrices
Okay so, those are the two functions that are now going to help us understand how we should multiply O_{delta} with W_{2} in order to calculate H_{error}.
See slide 53
So, let’s go back to our overview graphic and let’s see how we can transfer the scalar-notation equations back to dealing with matrices.
See slide 54
So, let’s start with the equation for the partial derivative of MSE w.r.t. w_{1-11}.
Here, in the parentheses, we first want to multiply odelta 1 of a specific example e with w_{2-11}. So, when we are looking at example 1, then we want to multiply together these two values of O_{delta} and W_{2}, respectively:
See slide 55
And looking at the depiction of the neural net, what we want to multiply together are those two parts of the neural net:
See slide 56
To that, we add o_{delta} _{2}^{(1)} times w_{2-12}.
See slide 57
And to that, we add o_{delta} _{3}^{(1)} times w_{2-13}.
See slide 58
So, for example 1, we basically just want to calculate the dot product of the first row of O_{delta} and the first row of W_{2}.
See slide 59
And looking at the neural net, we can see that the weights, which we use in this calculation, are leading to node 1 in the hidden layer. So, the result of the dot product is going to be h_{error} _{1} of example 1.
See slide 60
Side note: Intuitively it makes sense that we would do the calculations in this way. Namely, when we determine the partial derivative of MSE w.r.t. w_{1-11}, then we want to know how the MSE changes if we slightly increase w_{1-11} (so the weight going from node 1 in the input layer to node 1 in the hidden layer). And since a slight increase of w_{1-11} would influence the MSE over the three highlighted paths, it makes sense that we need to add them together. And if that reminds you somewhat of the multivariable chain rule, then that’s no coincidence because that’s what it basically is.
And since we want to do these calculations for each of our examples e, we simply loop over the rows of O_{delta} to calculate the other h_{error} _{1}.
See slides 61-62
And with that, we can now rewrite the whole expression in the brackets as h_{error} _{1}^{(e)}.
See slide 63
So now, let’s look at the equation for the partial derivative of MSE w.r.t. w_{1-12}. And here, we basically do the same thing. The only difference is that we use the second row of W_{2} (and not the first).
See slides 64-67
So, to recap, in order to calculate H_{error}, we want to calculate every possible dot product using the rows of O_{delta} and W_{2}. So again, it sounds a lot like a matrix multiplication. And in fact, we can do these calculations with a matrix multiplication (so that we can then take advantage of the speed of NumPy in our code). Therefor, we have to take the transpose of W_{2}.
See slide 68
And now, we can multiply O_{delta} with the transpose of W_{2} to get H_{error}.
See slide 69
And now, we have finally all the functions that we need to calculate W_{1-update}. So, let’s do that.
See slide 70
So, first to calculate H_{error}, we multiply this O_{delta} with the transpose of W_{2}.
See slide 71
Then, we need to element-wise multiply H_{error} with H_{out} and then with “1 minus H_{out}” to get H_{delta}.
See slide 72
And then in the last step, we multiply the transpose of X with H_{delta} and the divide the resulting matrix by N to get W_{1-update}.
See slide 73
And then, to execute the gradient step, we would multiply W_{1-update} with our learning rate and then subtract it from W_{1} to get our new updated W_{1}.
See slide 74
And if we do that for both of our weight matrices, then we have executed one gradient step.
And with that, we have now executed one iteration of the feedforward and backpropagation algorithm. And since we have already implemented the feedforward algorithm in code, let’s now implement the backpropagation algorithm in code. And this will be the topic of the next
post.