This post is part of a series:
In the previous 3 posts, we have answered the question of how we determine the right parameters for the deep learning algorithm.
See slide 1
Namely, we need to run the backpropagation algorithm. And therefor, we need to determine the partial derivative of MSE with respect to (w.r.t.) W_{1} and W_{2}. And that is what we are going to start doing in this post.
Visual Interpretation of the Equations
But before we do that, however, let’s first see how all the equations relate to our depiction of the neural net.
Namely, whereas the equations for the feedforward algorithm represent the movement forward through the net, the formulas for the backpropagation algorithm, so to say, represent the movement backward through the net.
So, for example, the equation “O_{in} = H_{out}W_{2}” represents the movement from the hidden layer to the output layer (more precisely: from the hidden layer outputs to the output layer inputs). And the equation “O_{out} = sigmoid(O_{in})” represents the movement through the nodes in the output layer.
See slide 2
On the other hand, the partial derivative of O_{out} w.r.t O_{in} represents the movement back through the nodes in the output layer. And the partial derivative of O_{in} w.r.t. H_{out} represents the movement back from the output layer to the hidden layer (more precisely: from the output layer inputs to the hidden layer outputs).
See slide 3
And that’s why I like using such a diagram of the neural net as a reference point. Namely, by looking at the neural net, you can basically see all the calculations that you need to do for the feedforward and the backpropagation algorithms (which might seem complicated and hard to remember if you just look at the formulas alone).
Matrix Calculus
Okay so now, let’s see what the equations for the partial derivatives of MSE w.r.t. W_{1} and W_{2} actually look like. And with that knowledge, we can then implement the backpropagation algorithm in code.
See slide 4
So, as you can see, in all the individual expressions in the formulas, there are capital letters indicating that we are dealing with matrices. So accordingly, in each of those expressions, we have to determine the partial derivative w.r.t. a matrix not just w.r.t. a regular scalar variable (like we have done in the previous posts where we introduced derivatives and partial derivatives).
And this means in order to determine the partial derivatives of MSE w.r.t. W_{1} and w.r.t. W_{2}, we have to make use of matrix calculus. But this is something that you normally don’t learn in school and it can also get quite complicated, because you have to handle many components all at once.
And for that reason, we are now going to take a step back and represent all of our equations, that we have so far, using scalar variables. So, variables that actually just represent a single number. And then, we are not going to determine for example the partial derivative of MSE w.r.t. W_{2}, but instead we are going to determine the partial derivatives of MSE w.r.t. all the individual weights in W_{2}.
That way, we can understand what’s going on in the formulas even without knowing about matrix calculus. And then, once we are done, we can transfer what we have learned, back to dealing with the actual matrices because that way we can take advantage of the speed of NumPy in our code.
Scalar Variables
Okay, so let’s now rewrite the equations for the feedforward algorithm using scalarvariables.
See slide 5
And since we are going to depict the neural net again is this upright representation, let’s flip around the equations so that they match the operations in the neural net.
See slide 6
And now, let’s first just look at the upper part of the neural net in more detail.
See slide 7
And therefore, let’s space out the respective functions so that we more room.
See slide 8
So, those are now the functions that we need to rewrite using scalarvariables.
So, to calculate the incoming value for the node on the far left in the output layer (o_{in} _{1}), we need to multiply the output value of the node on the left in the hidden layer (h_{out} _{1}) with the value of the weight that connects those two nodes (w_{211}).
Side note: The “2” in w_{211} is just my way of indicating that this weight belongs to W_{2}. So, it’s the weight of W_{2} that goes from node 1 to node 1 (and not the weight of W_{1} that goes from node 1 to node 1 – which we will be dealing with later).
See slide 9
To that, we then add h_{out} _{2} multiplied with w_{221}.
See slide 10
So, that’s how we calculate the incoming value for output node 1. And in a similar fashion, we can calculate the incoming values for the other two output nodes.
See slide 11
Then, in the next step, to calculate the output of those nodes, we simply put the respective o_{in} into our sigmoid function.
See slide 12
And those equations so far only consider one example. So, if we want to consider for instance two examples, then we need to do all of these calculations again for the second example.
See slide 13
And to indicate that these are actually different examples, we are going to use a superscript.
See slide 14
So, for example, to calculate o_{in} _{3} for the second example, we need to use h_{out} _{1} and h_{out} _{2} of the second example. The weights, however, are the same. So, you could imagine for example that we run our neural net twice to get the outputs for both examples.
And now, to see that the matrix and scalarnotation are actually equivalent, let’s quickly take a look again at the matrix multiplication where we multiply H_{out} with W_{2} to get O_{in}. For the case, where we have two examples, the matrices would look like this:
See slide 15
And the respective dot products that we calculate with this matrix multiplication look like this:
See slides 1621
And, as you can see, the calculations are exactly the same as the ones that we have written down using the scalarnotation. So, the different notations are actually equivalent.
Okay so now, to finally calculate the MSE, we put all the o_{out} and their respective y into the formula.
See slide 22
And if we write out the sigma that takes the sum over all the nodes n, then the formula looks like this:
See slide 23
And if we then also write out the sigma that takes the sum over all the examples e, then the formula looks like this:
See slide 24
Side note: Here, I just want to point out that if you put in the values for these variables and actually compute the MSE, then the MSE is just a regular number (as we have seen in the post where we introduced it). So, even though it’s denoted with capital letters, that is not an indication that it is a matrix, which we also have denoted with capital letters. It is just an abbreviation and it looked kind of strange when I used lowercase letters. That’s why I am using capital letters. So, just keep that in mind. One easy way to remember that is that the matrices only consist of one letter, whereas the abbreviation consists of several.
Multivariable Chain Rule
So, this is now the function for which we want to determine the partial derivative w.r.t. w_{211}, w_{212}, w_{221} and so on.
See slide 25
And just as a reminder, we want to do that because those are all the weights of W_{2}.
See slide 26
So, if we know how to determine those partial derivatives, then we should be able to transfer that knowledge over to working with matrices again. And therefore, we should be able to determine the partial derivative of MSE w.r.t. W_{2} which was our initial goal.
See slide 27
Okay, so let’s start with the partial of MSE w.r.t. w_{211} which tells us how the MSE will change if we slightly increase this weight.
See slide 28
And just like before when we were dealing with matrices, we can use the chain rule to determine this partial derivative because we have the intermediate steps where put the result of one function into the next function.
But this time, because there are multiple variables in the MSE function and not just one, we have to use a special case of the chain rule, namely the multivariable chain rule.
See slide 29
So, if you have a function z that is dependent on two variables, x and y. And those variables, in turn, also depend on other variables, u and v in this case, then you can calculate the partial derivative of z w.r.t. u and w.r.t. v like this:
See slide 30
This seems to be complicated at first sight, but there is an easy way to remember and work with this rule. Namely, you should visualize the dependencies of the functions with a tree diagram.
See slide 31
So, z is dependent on x and y (i.e. it is a function of x and y). And x and y, in turn, are both dependent on u and v.
And then, if we want to determine, for example the partial derivative of z w.r.t. u, we simply have to consider the two paths that lead from z to u.
See slide 32
So, we first multiply the partial derivative of z w.r.t. x with the partial derivative of x w.r.t. u. And then, we add the other path to that. So, we add the partial derivative of z w.r.t. y multiplied with the partial derivative of y w.r.t. u.
And even intuitively this rule makes sense because if we want to take the partial derivative of z w.r.t. u, then what we want to know is: How does z change, when we slightly increase u. And since u effects z over those two paths, it makes sense that we have to add them up.
So, that’s how the multivariable chain rule works and now let’s apply it to our partial derivative of MSE w.r.t. w_{211}. Therefore, let’s also create such a tree diagram for our scalarnotation functions.
So first, as you can see, the MSE is dependent on o_{out} _{1}, o_{out} _{2} and o_{out} _{3} of the first example and of the second example.
See slide 33
And that’s because the labels y are given. So, we can’t influence them by changing our weights.
Then, in the next step, those o_{out} are dependent on their respective o_{in}.
See slide 34
And then, o_{in} _{1} for both examples is dependent on w_{211} and w_{221}.
See slide 35
And that’s because, right now, we are only looking at the upper part of the neural net.
See slide 36
So, we consider the h_{out} as given since we don’t make any adjustments to the weights from W_{1} for now. Okay, so having said that, we can then see that o_{in} _{2} is dependent on w_{212} and w_{222}.
See slide 37
And o_{in} _{3} is dependent on w_{213} and w_{223}.
See slide 38
So, if we now want to determine the partial derivative of MSE w.r.t. w_{211}, we have to consider those two paths:
See slide 39
So first, we have to take the partial derivative of MSE w.r.t. o_{out} _{1}.
See slide 40
Then we multiply it with derivative of o_{out} _{1} .
See slide 41
And here, we use the normal derivative because o_{out} _{1} depends only on one variable. And then finally, we multiply that with the partial derivative of o_{in} _{1} w.r.t. w_{211}.
See slide 42
And to that whole expression, we add the second path.
See slide 43
So, this is now the final formula for calculating the partial derivative of MSE w.r.t. w_{211}. And now, let’s see what the equations for those individual expressions actually look like. And this will be the topic of the next post.

AuthorJust someone trying to explain his understanding of data science concepts Archives
February 2020
Categories
