Basics of Deep Learning p.9 - Backpropagation explained Step by Step

1/21/2020

This post is part of a series:

Part 1: Introduction
Part 2: Feedforward Algorithm explained
Part 3: Implementing the Feedforward Algorithm in pure Python
Part 4: Implementing the Feedforward Algorithm in pure Python cont'd
Part 5: Implementing the Feedforward Algorithm with NumPy
Part 6: Backpropagation explained - Cost Function and Derivatives
Part 7: Backpropagation explained - Gradient Descent and Partial Derivatives
Part 8: Backpropagation explained - Chain Rule and Activation Function
Part 9: Backpropagation explained Step by Step
Part 10: Backpropagation explained Step by Step cont'd
Part 11: Backpropagation explained Step by Step cont'd
Part 12: Backpropagation explained Step by Step cont'd
Part 13: Implementing the Backpropagation Algorithm with NumPy
Part 14: How to train a Neural Net

Here are the corresponding slides for this post:

basics_of_dl_9.pdf
File Size:	217 kb
File Type:	pdf

Download File

In the previous 3 posts, we have answered the question of how we determine the right parameters for the deep learning algorithm.

See slide 1

Namely, we need to run the backpropagation algorithm. And therefor, we need to determine the partial derivative of MSE with respect to (w.r.t.) W₁ and W₂. And that is what we are going to start doing in this post.

Visual Interpretation of the Equations

But before we do that, however, let’s first see how all the equations relate to our depiction of the neural net.

Namely, whereas the equations for the feedforward algorithm represent the movement forward through the net, the formulas for the backpropagation algorithm, so to say, represent the movement backward through the net.

So, for example, the equation “O_in = H_outW₂” represents the movement from the hidden layer to the output layer (more precisely: from the hidden layer outputs to the output layer inputs). And the equation “O_out = sigmoid(O_in)” represents the movement through the nodes in the output layer.

See slide 2

On the other hand, the partial derivative of O_out w.r.t O_in represents the movement back through the nodes in the output layer. And the partial derivative of O_in w.r.t. H_out represents the movement back from the output layer to the hidden layer (more precisely: from the output layer inputs to the hidden layer outputs).

See slide 3

And that’s why I like using such a diagram of the neural net as a reference point. Namely, by looking at the neural net, you can basically see all the calculations that you need to do for the feedforward and the backpropagation algorithms (which might seem complicated and hard to remember if you just look at the formulas alone).

Matrix Calculus

Okay so now, let’s see what the equations for the partial derivatives of MSE w.r.t. W₁ and W₂ actually look like. And with that knowledge, we can then implement the backpropagation algorithm in code.

See slide 4

So, as you can see, in all the individual expressions in the formulas, there are capital letters indicating that we are dealing with matrices. So accordingly, in each of those expressions, we have to determine the partial derivative w.r.t. a matrix not just w.r.t. a regular scalar variable (like we have done in the previous posts where we introduced derivatives and partial derivatives).

And this means in order to determine the partial derivatives of MSE w.r.t. W₁ and w.r.t. W₂, we have to make use of matrix calculus. But this is something that you normally don’t learn in school and it can also get quite complicated, because you have to handle many components all at once.

And for that reason, we are now going to take a step back and represent all of our equations, that we have so far, using scalar variables. So, variables that actually just represent a single number. And then, we are not going to determine for example the partial derivative of MSE w.r.t. W₂, but instead we are going to determine the partial derivatives of MSE w.r.t. all the individual weights in W₂.

That way, we can understand what’s going on in the formulas even without knowing about matrix calculus. And then, once we are done, we can transfer what we have learned, back to dealing with the actual matrices because that way we can take advantage of the speed of NumPy in our code.

Scalar Variables

Okay, so let’s now rewrite the equations for the feedforward algorithm using scalar-variables.

See slide 5

And since we are going to depict the neural net again is this upright representation, let’s flip around the equations so that they match the operations in the neural net.

See slide 6

And now, let’s first just look at the upper part of the neural net in more detail.

See slide 7

And therefore, let’s space out the respective functions so that we more room.

See slide 8

So, those are now the functions that we need to rewrite using scalar-variables.

So, to calculate the incoming value for the node on the far left in the output layer (o_in ₁), we need to multiply the output value of the node on the left in the hidden layer (h_out ₁) with the value of the weight that connects those two nodes (w_2-11).

Side note: The “2” in w_2-11 is just my way of indicating that this weight belongs to W₂. So, it’s the weight of W₂ that goes from node 1 to node 1 (and not the weight of W₁ that goes from node 1 to node 1 – which we will be dealing with later).

See slide 9

To that, we then add h_out ₂ multiplied with w_2-21.

See slide 10

So, that’s how we calculate the incoming value for output node 1. And in a similar fashion, we can calculate the incoming values for the other two output nodes.

See slide 11

Then, in the next step, to calculate the output of those nodes, we simply put the respective o_in into our sigmoid function.

See slide 12

And those equations so far only consider one example. So, if we want to consider for instance two examples, then we need to do all of these calculations again for the second example.

See slide 13

And to indicate that these are actually different examples, we are going to use a superscript.

See slide 14

So, for example, to calculate o_in ₃ for the second example, we need to use h_out ₁ and h_out ₂ of the second example. The weights, however, are the same. So, you could imagine for example that we run our neural net twice to get the outputs for both examples.

And now, to see that the matrix- and scalar-notation are actually equivalent, let’s quickly take a look again at the matrix multiplication where we multiply H_out with W₂ to get O_in. For the case, where we have two examples, the matrices would look like this:

See slide 15

And the respective dot products that we calculate with this matrix multiplication look like this:

See slides 16-21

And, as you can see, the calculations are exactly the same as the ones that we have written down using the scalar-notation. So, the different notations are actually equivalent.

Okay so now, to finally calculate the MSE, we put all the o_out and their respective y into the formula.

See slide 22

And if we write out the sigma that takes the sum over all the nodes n, then the formula looks like this:

See slide 23

And if we then also write out the sigma that takes the sum over all the examples e, then the formula looks like this:

See slide 24

Side note: Here, I just want to point out that if you put in the values for these variables and actually compute the MSE, then the MSE is just a regular number (as we have seen in the post where we introduced it). So, even though it’s denoted with capital letters, that is not an indication that it is a matrix, which we also have denoted with capital letters. It is just an abbreviation and it looked kind of strange when I used lowercase letters. That’s why I am using capital letters. So, just keep that in mind. One easy way to remember that is that the matrices only consist of one letter, whereas the abbreviation consists of several.

Multivariable Chain Rule

So, this is now the function for which we want to determine the partial derivative w.r.t. w_2-11, w_2-12, w_2-21 and so on.

See slide 25

And just as a reminder, we want to do that because those are all the weights of W₂.

See slide 26

So, if we know how to determine those partial derivatives, then we should be able to transfer that knowledge over to working with matrices again. And therefore, we should be able to determine the partial derivative of MSE w.r.t. W₂ which was our initial goal.

See slide 27

Okay, so let’s start with the partial of MSE w.r.t. w_2-11 which tells us how the MSE will change if we slightly increase this weight.

See slide 28

And just like before when we were dealing with matrices, we can use the chain rule to determine this partial derivative because we have the intermediate steps where put the result of one function into the next function.

But this time, because there are multiple variables in the MSE function and not just one, we have to use a special case of the chain rule, namely the multivariable chain rule.

See slide 29

So, if you have a function z that is dependent on two variables, x and y. And those variables, in turn, also depend on other variables, u and v in this case, then you can calculate the partial derivative of z w.r.t. u and w.r.t. v like this:

See slide 30

This seems to be complicated at first sight, but there is an easy way to remember and work with this rule. Namely, you should visualize the dependencies of the functions with a tree diagram.

See slide 31

So, z is dependent on x and y (i.e. it is a function of x and y). And x and y, in turn, are both dependent on u and v.

And then, if we want to determine, for example the partial derivative of z w.r.t. u, we simply have to consider the two paths that lead from z to u.

See slide 32

So, we first multiply the partial derivative of z w.r.t. x with the partial derivative of x w.r.t. u. And then, we add the other path to that. So, we add the partial derivative of z w.r.t. y multiplied with the partial derivative of y w.r.t. u.

And even intuitively this rule makes sense because if we want to take the partial derivative of z w.r.t. u, then what we want to know is: How does z change, when we slightly increase u. And since u effects z over those two paths, it makes sense that we have to add them up.

So, that’s how the multivariable chain rule works and now let’s apply it to our partial derivative of MSE w.r.t. w_2-11. Therefore, let’s also create such a tree diagram for our scalar-notation functions.

So first, as you can see, the MSE is dependent on o_out ₁, o_out ₂ and o_out ₃ of the first example and of the second example.

See slide 33

And that’s because the labels y are given. So, we can’t influence them by changing our weights.

Then, in the next step, those o_out are dependent on their respective o_in.

See slide 34

And then, o_in ₁ for both examples is dependent on w_2-11 and w_2-21.

See slide 35

And that’s because, right now, we are only looking at the upper part of the neural net.

See slide 36

So, we consider the h_out as given since we don’t make any adjustments to the weights from W₁ for now. Okay, so having said that, we can then see that o_in ₂ is dependent on w_2-12 and w_2-22.

See slide 37

And o_in ₃ is dependent on w_2-13 and w_2-23.

See slide 38

So, if we now want to determine the partial derivative of MSE w.r.t. w_2-11, we have to consider those two paths:

See slide 39

So first, we have to take the partial derivative of MSE w.r.t. o_out ₁.

See slide 40

Then we multiply it with derivative of o_out ₁ .

See slide 41

And here, we use the normal derivative because o_out ₁ depends only on one variable. And then finally, we multiply that with the partial derivative of o_in ₁ w.r.t. w_2-11.

See slide 42

And to that whole expression, we add the second path.

See slide 43

So, this is now the final formula for calculating the partial derivative of MSE w.r.t. w_2-11. And now, let’s see what the equations for those individual expressions actually look like. And this will be the topic of the next post.

0 Comments