This post is part of a series:
In the previous 3 posts, we have answered the question of how we determine the right parameters for the deep learning algorithm.
See slide 1
Namely, we need to run the backpropagation algorithm. And therefor, we need to determine the partial derivative of MSE with respect to (w.r.t.) W1 and W2. And that is what we are going to start doing in this post.
Visual Interpretation of the Equations
But before we do that, however, let’s first see how all the equations relate to our depiction of the neural net.
Namely, whereas the equations for the feedforward algorithm represent the movement forward through the net, the formulas for the backpropagation algorithm, so to say, represent the movement backward through the net.
So, for example, the equation “Oin = HoutW2” represents the movement from the hidden layer to the output layer (more precisely: from the hidden layer outputs to the output layer inputs). And the equation “Oout = sigmoid(Oin)” represents the movement through the nodes in the output layer.
See slide 2
On the other hand, the partial derivative of Oout w.r.t Oin represents the movement back through the nodes in the output layer. And the partial derivative of Oin w.r.t. Hout represents the movement back from the output layer to the hidden layer (more precisely: from the output layer inputs to the hidden layer outputs).
See slide 3
And that’s why I like using such a diagram of the neural net as a reference point. Namely, by looking at the neural net, you can basically see all the calculations that you need to do for the feedforward and the backpropagation algorithms (which might seem complicated and hard to remember if you just look at the formulas alone).
Okay so now, let’s see what the equations for the partial derivatives of MSE w.r.t. W1 and W2 actually look like. And with that knowledge, we can then implement the backpropagation algorithm in code.
See slide 4
So, as you can see, in all the individual expressions in the formulas, there are capital letters indicating that we are dealing with matrices. So accordingly, in each of those expressions, we have to determine the partial derivative w.r.t. a matrix not just w.r.t. a regular scalar variable (like we have done in the previous posts where we introduced derivatives and partial derivatives).
And this means in order to determine the partial derivatives of MSE w.r.t. W1 and w.r.t. W2, we have to make use of matrix calculus. But this is something that you normally don’t learn in school and it can also get quite complicated, because you have to handle many components all at once.
And for that reason, we are now going to take a step back and represent all of our equations, that we have so far, using scalar variables. So, variables that actually just represent a single number. And then, we are not going to determine for example the partial derivative of MSE w.r.t. W2, but instead we are going to determine the partial derivatives of MSE w.r.t. all the individual weights in W2.
That way, we can understand what’s going on in the formulas even without knowing about matrix calculus. And then, once we are done, we can transfer what we have learned, back to dealing with the actual matrices because that way we can take advantage of the speed of NumPy in our code.
Okay, so let’s now rewrite the equations for the feedforward algorithm using scalar-variables.
See slide 5
And since we are going to depict the neural net again is this upright representation, let’s flip around the equations so that they match the operations in the neural net.
See slide 6
And now, let’s first just look at the upper part of the neural net in more detail.
See slide 7
And therefore, let’s space out the respective functions so that we more room.
See slide 8
So, those are now the functions that we need to rewrite using scalar-variables.
So, to calculate the incoming value for the node on the far left in the output layer (oin 1), we need to multiply the output value of the node on the left in the hidden layer (hout 1) with the value of the weight that connects those two nodes (w2-11).
Side note: The “2” in w2-11 is just my way of indicating that this weight belongs to W2. So, it’s the weight of W2 that goes from node 1 to node 1 (and not the weight of W1 that goes from node 1 to node 1 – which we will be dealing with later).
See slide 9
To that, we then add hout 2 multiplied with w2-21.
See slide 10
So, that’s how we calculate the incoming value for output node 1. And in a similar fashion, we can calculate the incoming values for the other two output nodes.
See slide 11
Then, in the next step, to calculate the output of those nodes, we simply put the respective oin into our sigmoid function.
See slide 12
And those equations so far only consider one example. So, if we want to consider for instance two examples, then we need to do all of these calculations again for the second example.
See slide 13
And to indicate that these are actually different examples, we are going to use a superscript.
See slide 14
So, for example, to calculate oin 3 for the second example, we need to use hout 1 and hout 2 of the second example. The weights, however, are the same. So, you could imagine for example that we run our neural net twice to get the outputs for both examples.
And now, to see that the matrix- and scalar-notation are actually equivalent, let’s quickly take a look again at the matrix multiplication where we multiply Hout with W2 to get Oin. For the case, where we have two examples, the matrices would look like this:
See slide 15
And the respective dot products that we calculate with this matrix multiplication look like this:
See slides 16-21
And, as you can see, the calculations are exactly the same as the ones that we have written down using the scalar-notation. So, the different notations are actually equivalent.
Okay so now, to finally calculate the MSE, we put all the oout and their respective y into the formula.
See slide 22
And if we write out the sigma that takes the sum over all the nodes n, then the formula looks like this:
See slide 23
And if we then also write out the sigma that takes the sum over all the examples e, then the formula looks like this:
See slide 24
Side note: Here, I just want to point out that if you put in the values for these variables and actually compute the MSE, then the MSE is just a regular number (as we have seen in the post where we introduced it). So, even though it’s denoted with capital letters, that is not an indication that it is a matrix, which we also have denoted with capital letters. It is just an abbreviation and it looked kind of strange when I used lowercase letters. That’s why I am using capital letters. So, just keep that in mind. One easy way to remember that is that the matrices only consist of one letter, whereas the abbreviation consists of several.
Multivariable Chain Rule
So, this is now the function for which we want to determine the partial derivative w.r.t. w2-11, w2-12, w2-21 and so on.
See slide 25
And just as a reminder, we want to do that because those are all the weights of W2.
See slide 26
So, if we know how to determine those partial derivatives, then we should be able to transfer that knowledge over to working with matrices again. And therefore, we should be able to determine the partial derivative of MSE w.r.t. W2 which was our initial goal.
See slide 27
Okay, so let’s start with the partial of MSE w.r.t. w2-11 which tells us how the MSE will change if we slightly increase this weight.
See slide 28
And just like before when we were dealing with matrices, we can use the chain rule to determine this partial derivative because we have the intermediate steps where put the result of one function into the next function.
But this time, because there are multiple variables in the MSE function and not just one, we have to use a special case of the chain rule, namely the multivariable chain rule.
See slide 29
So, if you have a function z that is dependent on two variables, x and y. And those variables, in turn, also depend on other variables, u and v in this case, then you can calculate the partial derivative of z w.r.t. u and w.r.t. v like this:
See slide 30
This seems to be complicated at first sight, but there is an easy way to remember and work with this rule. Namely, you should visualize the dependencies of the functions with a tree diagram.
See slide 31
So, z is dependent on x and y (i.e. it is a function of x and y). And x and y, in turn, are both dependent on u and v.
And then, if we want to determine, for example the partial derivative of z w.r.t. u, we simply have to consider the two paths that lead from z to u.
See slide 32
So, we first multiply the partial derivative of z w.r.t. x with the partial derivative of x w.r.t. u. And then, we add the other path to that. So, we add the partial derivative of z w.r.t. y multiplied with the partial derivative of y w.r.t. u.
And even intuitively this rule makes sense because if we want to take the partial derivative of z w.r.t. u, then what we want to know is: How does z change, when we slightly increase u. And since u effects z over those two paths, it makes sense that we have to add them up.
So, that’s how the multivariable chain rule works and now let’s apply it to our partial derivative of MSE w.r.t. w2-11. Therefore, let’s also create such a tree diagram for our scalar-notation functions.
So first, as you can see, the MSE is dependent on oout 1, oout 2 and oout 3 of the first example and of the second example.
See slide 33
And that’s because the labels y are given. So, we can’t influence them by changing our weights.
Then, in the next step, those oout are dependent on their respective oin.
See slide 34
And then, oin 1 for both examples is dependent on w2-11 and w2-21.
See slide 35
And that’s because, right now, we are only looking at the upper part of the neural net.
See slide 36
So, we consider the hout as given since we don’t make any adjustments to the weights from W1 for now. Okay, so having said that, we can then see that oin 2 is dependent on w2-12 and w2-22.
See slide 37
And oin 3 is dependent on w2-13 and w2-23.
See slide 38
So, if we now want to determine the partial derivative of MSE w.r.t. w2-11, we have to consider those two paths:
See slide 39
So first, we have to take the partial derivative of MSE w.r.t. oout 1.
See slide 40
Then we multiply it with derivative of oout 1 .
See slide 41
And here, we use the normal derivative because oout 1 depends only on one variable. And then finally, we multiply that with the partial derivative of oin 1 w.r.t. w2-11.
See slide 42
And to that whole expression, we add the second path.
See slide 43
So, this is now the final formula for calculating the partial derivative of MSE w.r.t. w2-11. And now, let’s see what the equations for those individual expressions actually look like. And this will be the topic of the next post.