Basics of Deep Learning p.12 - Backpropagation explained Step by Step cont'd

1/21/2020

This post is part of a series:

Part 1: Introduction
Part 2: Feedforward Algorithm explained
Part 3: Implementing the Feedforward Algorithm in pure Python
Part 4: Implementing the Feedforward Algorithm in pure Python cont'd
Part 5: Implementing the Feedforward Algorithm with NumPy
Part 6: Backpropagation explained - Cost Function and Derivatives
Part 7: Backpropagation explained - Gradient Descent and Partial Derivatives
Part 8: Backpropagation explained - Chain Rule and Activation Function
Part 9: Backpropagation explained Step by Step
Part 10: Backpropagation explained Step by Step cont'd
Part 11: Backpropagation explained Step by Step cont'd
Part 12: Backpropagation explained Step by Step cont'd
Part 13: Implementing the Backpropagation Algorithm with NumPy
Part 14: How to train a Neural Net

Here are the corresponding slides for this post:

basics_of_dl_12.pdf
File Size:	394 kb
File Type:	pdf

Download File

In the previous post, we left off at the point where we wanted to determine the equation for the partial derivative of MSE with respect to (w.r.t.) W₁.

See slide 1

And that’s what we are going to do now.

Derivative of MSE w.r.t. Weight Matrix 1

Therefor, let’s look again at the formula that we have written down in a previous post.

See slide 2

And actually, there is not much new going on in this formula compared to the formula for the partial derivative of MSE w.r.t. W₂.

The first two expressions are, in fact, exactly the same. So, we can simply use the O_delta, that we have already determined, and keep going from this point on through the chain rule.

See slide 3

And then, the last two expressions, they are also not really new because they are technically the same as the last two expressions in the formula for the partial derivative of MSE w.r.t. W₂.

In the formula for W₂, we have the partial derivative of O_out w.r.t. O_in. This, so to say, represents the movement from the output layer outputs to the output layer inputs.

See slides 4-5

And in the formula for W₁, we have the partial derivative of H_out w.r.t. H_in. This, so to say, represents the movement from the hidden layer outputs to the hidden layer inputs.

See slide 6-7

So, those expressions, generally speaking, represent the movement backward through the nodes. And if we look again at the formulas of the feedforward algorithm, then we can see that those expressions are simply the derivative of the sigmoid function.

See slide 8

And we already know how to determine that. So, the equation for H_delta is going to look like this:

See slide 9

And now, let’s look at the last expressions of both formulas, respectively.

In the formula for W₂, we have the partial derivative of O_in w.r.t. W₂. This, so to say, represents the movement from the output layer inputs to the weights of weight matrix 2.

See slides 10-11

In the formula for W₁, we have the partial derivative of H_in w.r.t. W₁. This, so to say, represents the movement from the hidden layer inputs to the weights of weight matrix 1.

See slides 12-13

So, those expressions, generally speaking, represent the movement from the layer inputs to the weights. And if we look again at the formulas of the feedforward algorithm, then we can see that those expressions are simply the derivative of a matrix multiplication.

See slide 14

And we also already know how to determine that. Namely, it is simply the “opposite, corresponding element”. And since the opposite, corresponding element of W₁ is X, the equation for W_1-update looks like this:

See slide 15

So, the only new expression, really, is the one in the middle, namely the partial derivative of O_in w.r.t. H_out. This represents the movement from the output layer inputs to the hidden layer outputs.

See slides 16-17

So, it represents the movement from one layer to the other. That is something that we haven’t seen, yet. But even this expression is not entirely new because, because looking at the equations for the feedforward algorithm, we can see that it is again just the derivative of a matrix multiplication.

And here, the opposite, corresponding element of H_out is W₂. So, in order to determine H_error, we somehow have to multiply O_delta with W₂.

See slide 18

The only question now is: How do we multiply these two matrices together? So, in which order should we multiply them? And should we use transposes of any of those matrices?

And to answer those questions, let’s look again at the scalar-notation of the equations for the feedforward algorithm.

See slide 19

Scalar-Notation revisited

So far, we have only looked at the upper part of the neural net because we were only concerned with updating the weights of W₂. But now, we have to consider the whole neural net because we want to determine the partial derivative of MSE w.r.t. W₁.

See slide 20

So, let’s now rewrite the remaining two equations of the feedforward algorithm in scalar-notation. Therefore, we need a little bit more space.

See slide 21

Okay so now, let’s start with determining the input value of the first node in the hidden layer for example 1.

See slide 22

Therefor, we simply need to calculate the dot product of the x-values of example 1 and the respective weights that lead to this node.

See slide 23

In a similar way we can determine h_in ₂⁽¹⁾.

See slide 24

And then, we do the same for the second example.

See slide 25

After that, we put all those h_in into the sigmoid function to get the respective h_out.

See slide 26

So, nothing special going on really, just a lot of variables.

And now, if we want to know how to determine the partial derivative of MSE w.r.t. W₁, let’s see again how we can determine the partial derivative of MSE w.r.t. a particular weight in W₁. So, let’s take w_1-11.

See slide 27

In order to determine this partial derivative, we need to again use the multivariable chain rule.

See slide 28

So therefore, let’s also again visualize the dependencies of the scalar-notation functions with a tree.

See slide 29

And, as you can see, the tree is so big that I actually couldn’t depict all branches. So, I just drew all the branches for the first example. For the second example, however, they look basically the same, just with a superscript of 2 instead of 1. So, it is not really a problem.

Okay so now, if we want to determine the partial derivative of MSE w.r.t. w_1-11, then we need to consider these 3 paths:

See slide 30

And obviously, there are three additional paths for the second example, but let’s just consider the first example, for now. So, let’s write down the derivatives that we need to determine for the first path (the one on the left).

See slide 31

Again, nothing really new is going on. So, let’s also write down the expressions for the other two paths.

See slide 32

And again, obviously we would also have to add the other 3 paths from the second example.

See slide 33

But I don’t have any more space, so we are just going to consider example 1.

So now, we can write down the actual equation for the partial derivative of MSE w.r.t. w_1-11. And therefor, let’s colorize the individual expressions like we did in an earlier post when we transferred the scalar-notation back to matrix-notation in order to see how to determine the partial derivative of MSE w.r.t. W₂.

See slide 34

And there is also nothing really new going on. We basically already know how to determine all those expressions. The blue ones are the derivative of the cost function, the orange ones are the derivative of the sigmoid function and the green ones are the derivative of the dot product. So, let’s quickly go through the expressions of the first path.

The first two expressions, we have already written out before and they look like this:

See slide 35

Then, we need to multiply this with the partial derivative of o_in ₁⁽¹⁾ w.r.t. h_out ₁⁽¹⁾. So, therefore we need to know what the opposite, corresponding element of h_out ₁⁽¹⁾ is.

See slide 36

And, as you can see, it is w_2-11.

See slide 37

So then, we need to multiply that with the derivative of h_out ₁⁽¹⁾ which is simply the derivative of the sigmoid function.

See slide 38

And finally, we need to multiply that with the opposite, corresponding element of w_1-11.

See slide 39

So, that are the equations for the first path. And in a similar way, we can determine the equations for the other two paths of the first example.

See slide 40

And that are now what the equations for the first example look like. And for the second example, they look basically the same, just with a superscript of 2. So, let’s just keep working with these equations.

See slide 41

And now, let’s rewrite them again in a more general way by using sigma notation. Therefore, let’s first factor out the 1/N.

See slide 42

And then, we can rewrite the function with sigma notation where the sigma runs over all examples e.

See slide 43

And again, we can do that like this because the equations for the second example look exactly the same except that they have a superscript of 2.

So, this is now the final equation for how we determine the partial derivative of MSE w.r.t. w_1-11. But remember, our only goal is to understand how to determine H_error.

See slide 44

Therefor, we don’t need to consider the whole equation, only a specific part of it. So, let’s rewrite it a little bit so that it is then easier to see how we can translate this scalar-notation equation back to dealing with matrices.

First off, we can see that the last couple of expressions in each summand are the same.

See slide 45

So, we can factor them out.

See slide 46

In terms of the matrix-notation, these expressions refer to everything that comes after H_error. And we already know how to determine that. So, we are not interested in this part of the scalar-notation equation.

And then, let’s look at the first couple of expressions in each summand.

See slide 47

In terms of the matrix-notation, these expressions represent the first two steps, i.e. they represent O_delta. So, we can summarize them accordingly since our goal is to understand how we should multiply O_delta with W₂.

See slide 48

So, this is now our equation for determining the partial derivative of MSE w.r.t. w_1-11. And the only part that is really of interest to us, is what is going on in the brackets. And that’s because we want to understand how we should multiply O_delta with W₂.

But to really understand that, having the equation of the partial derivative of MSE w.r.t. only one weight is not enough. So, let’s write down this equation for another weight. But which one should we choose?

Well, we are not going to pick one of the weights of W₁ that also “goes” to node 1 in the hidden layer, i.e. w_1-21, w_1-31 or w_1-41. And that’s because if would choose one of those, then the only thing that would change in the equation is the subscript of the x (since the path in the tree would be exactly the same except for the last step). And, as already said before, we are not interested in this part of the equation.

What we are interested in, is how to multiply o_delta with a weight of W₂. So, in the tree, we are interested in the fork below o_in. So, that’s where we need to take a different path. Therefore, we are going to pick one of the weights of W₁ that “goes” to node 2 in the hidden layer, i.e. w_1-12, w_1-22, w_1-32 or w_1-42. And we choose w_1-12.

See slide 49

And the equation then looks like this:

See slide 50

So, the o_delta are going to be the same. And that’s because the paths in the tree up until the respective o_in are the same for the partial derivatives of MSE w.r.t. w_1-11 and w.r.t. w_1-12.

But after that, we take a different path than before. Namely, we go to the right instead of to the left. So, we need to determine the partial derivative of the particular o_in w.r.t. h_out ₂ (and not h_out ₁₎. This is a derivative of the dot product. So, let’s look again at the feedforward equations to see what the opposite, corresponding element of h_out ₂ is.

See slide 51

So, as you can see, for o_in ₁ the opposite, corresponding element of h_out ₂ is w_2-21. So, accordingly, we multiply o_delta ₁ with w_2-21. And the opposite, corresponding elements of o_delta ₂ and o_delta ₃ are w_2-22 and w_2-23, respectively.

And then, for the sake of completeness, let’ also look at the rest of the equation (even though we don’t need in order to understand how to determine H_error). So, for the next step in the tree, we need to determine the derivative of the sigmoid function. Only this time, we use h_out ₂ and not h_out ₁. And then, finally we multiply that with the opposite, corresponding element of w_1-12 which is again x₁.

See slide 52

Transfer to Matrices

Okay so, those are the two functions that are now going to help us understand how we should multiply O_delta with W₂ in order to calculate H_error.

See slide 53

So, let’s go back to our overview graphic and let’s see how we can transfer the scalar-notation equations back to dealing with matrices.

See slide 54

So, let’s start with the equation for the partial derivative of MSE w.r.t. w_1-11.

Here, in the parentheses, we first want to multiply odelta 1 of a specific example e with w_2-11. So, when we are looking at example 1, then we want to multiply together these two values of O_delta and W₂, respectively:

See slide 55

And looking at the depiction of the neural net, what we want to multiply together are those two parts of the neural net:

See slide 56

To that, we add o_delta ₂⁽¹⁾ times w_2-12.

See slide 57

And to that, we add o_delta ₃⁽¹⁾ times w_2-13.

See slide 58

So, for example 1, we basically just want to calculate the dot product of the first row of O_delta and the first row of W₂.

See slide 59

And looking at the neural net, we can see that the weights, which we use in this calculation, are leading to node 1 in the hidden layer. So, the result of the dot product is going to be h_error ₁ of example 1.

See slide 60

Side note: Intuitively it makes sense that we would do the calculations in this way. Namely, when we determine the partial derivative of MSE w.r.t. w_1-11, then we want to know how the MSE changes if we slightly increase w_1-11 (so the weight going from node 1 in the input layer to node 1 in the hidden layer). And since a slight increase of w_1-11 would influence the MSE over the three highlighted paths, it makes sense that we need to add them together. And if that reminds you somewhat of the multivariable chain rule, then that’s no coincidence because that’s what it basically is.

And since we want to do these calculations for each of our examples e, we simply loop over the rows of O_delta to calculate the other h_error ₁.

See slides 61-62

And with that, we can now rewrite the whole expression in the brackets as h_error ₁^(e).

See slide 63

So now, let’s look at the equation for the partial derivative of MSE w.r.t. w_1-12. And here, we basically do the same thing. The only difference is that we use the second row of W₂ (and not the first).

See slides 64-67

So, to recap, in order to calculate H_error, we want to calculate every possible dot product using the rows of O_delta and W₂. So again, it sounds a lot like a matrix multiplication. And in fact, we can do these calculations with a matrix multiplication (so that we can then take advantage of the speed of NumPy in our code). Therefor, we have to take the transpose of W₂.

See slide 68

And now, we can multiply O_delta with the transpose of W₂ to get H_error.

See slide 69

And now, we have finally all the functions that we need to calculate W_1-update. So, let’s do that.

See slide 70

So, first to calculate H_error, we multiply this O_delta with the transpose of W₂.

See slide 71

Then, we need to element-wise multiply H_error with H_out and then with “1 minus H_out” to get H_delta.

See slide 72

And then in the last step, we multiply the transpose of X with H_delta and the divide the resulting matrix by N to get W_1-update.

See slide 73

And then, to execute the gradient step, we would multiply W_1-update with our learning rate and then subtract it from W₁ to get our new updated W₁.

See slide 74

And if we do that for both of our weight matrices, then we have executed one gradient step.

And with that, we have now executed one iteration of the feedforward and backpropagation algorithm. And since we have already implemented the feedforward algorithm in code, let’s now implement the backpropagation algorithm in code. And this will be the topic of the next post.

0 Comments