Basics of Deep Learning p.11 - Backpropagation explained Step by Step cont'd

1/21/2020

This post is part of a series:

Part 1: Introduction
Part 2: Feedforward Algorithm explained
Part 3: Implementing the Feedforward Algorithm in pure Python
Part 4: Implementing the Feedforward Algorithm in pure Python cont'd
Part 5: Implementing the Feedforward Algorithm with NumPy
Part 6: Backpropagation explained - Cost Function and Derivatives
Part 7: Backpropagation explained - Gradient Descent and Partial Derivatives
Part 8: Backpropagation explained - Chain Rule and Activation Function
Part 9: Backpropagation explained Step by Step
Part 10: Backpropagation explained Step by Step cont'd
Part 11: Backpropagation explained Step by Step cont'd
Part 12: Backpropagation explained Step by Step cont'd
Part 13: Implementing the Backpropagation Algorithm with NumPy
Part 14: How to train a Neural Net

Here are the corresponding slides for this post:

basics_of_dl_11.pdf
File Size:	289 kb
File Type:	pdf

Download File

In the previous post, we left off at the point where we have written down the formula which represents the scalar-notation for finding the partial derivatives of MSE with respect to (w.r.t.) all weights in W₂.

See slide 1

So now, we want to think about how we can transfer this equation back to the context of using matrices.

Transfer from Scalar-Notation to Matrix-Notation

Therefore, let’s go back to this overview diagram of which we have already seen a version in one of the earlier posts:

See slide 2

And let’s also quickly go through the feedforward algorithm again. So first, we multiply the input matrix X with W₁ to get H_in. And thereby we move from the inputs to the hidden layer inputs.

See slide 3

Then, to move through the nodes from the hidden layer inputs to the hidden layer outputs, we element-wise apply the sigmoid function to H_in to get H_out.

See slide 4

After that, we multiply H_out with W₂ to get O_in and move from the hidden layer outputs to the output layer inputs

See slide 5

And then again, we move through the nodes by element-wise applying the sigmoid function to O_in get O_out.

See slide 6

So that’s the feedforward algorithm. And then, to see how good the neural net currently is at making predictions, we put the output layer outputs together with the labels into the MSE function.

See slide 7

And in this case, we then get an MSE of 0.138.

See slide 8

And now, we want to know how we should adjust our W₂ (during the backpropagation algorithm) so that the MSE decreases and the neural net gets somewhat better at making decisions. So, we want to know the partial derivative of MSE w.r.t. W₂.

See slide 9

Therefor, we are going look at the scalar-notation equation, which tells us how to determine the partial derivatives of MSE w.r.t. all the individual weights in W₂.

See slide 10

So now, we simply have to look at what we actually want to calculate in this equation and then see how that translates to dealing with the matrices.

See slide 11

Before we do that, however, I want to rearrange the matrix-notation formula a little bit. And that’s because we are not going to calculate it in just one step like the formula suggests. Instead what we are going to do, is to calculate it one step at a time (similar to what we did during the feedforward algorithm).

So, the first step or matrix we are going to call O_error.

See slide 12

The second step is then called O_delta.

See slide 13

And then finally, in the last step, we determine the W_2-update. So, the values that we use to execute one gradient descent step for W₂.

See slide 14

And we can do the calculations in this way because, in the original formula, we are just multiplying together the different expressions. So, we can also just simply do one multiplication at a time, instead of doing them all at once.

So, those are now the individual steps that we want to understand using the scalar-notation equation.

See slide 15

And to better see which part of the scalar-notation equation represents which step in the matrix-notation formulas, let’s colorize them accordingly.

See slide 16

1. Step of the Chain Rule: Output-Error-Matrix

Okay, so now let’s see what we actually want to calculate in the scalar-notation equation and how we can transfer it to dealing with matrices.

And as already said in the previous post, what we do in the scalar-notation equation is that we first multiply together all the “different-colored” expressions (which represent, generally speaking, the derivatives of the cost function, the sigmoid function and the dot product). And only then, once we have done that, we sum up over all examples e and then divide by N. And this is also what we are going to do with the matrices.

So first, let’s just look at the blue-colored parts which represent the first step of the chain rule.

And just as a reminder, in the scalar-notation equation, the “o_out _n^(e)” represents one specific element from the output layer outputs matrix O_out. Namely, the element of node n and example e. So, for instance, o_out ₁⁽³⁾ is the element in O_out in the third row and first column.

See slide 17

And o_out ₃⁽²⁾ is the element in O_out in the second row and third column.

See slide 18

The same logic applies to “y_n^(e)” and the label matrix Y.

So, what we want to do in the scalar-notation equation, is to simply subtract a specific y from the respective o_out. So, for example, from o_out ₁⁽¹⁾ we want to subtract y₁⁽¹⁾. And what that means in terms of the matrices is that from each element in O_out we want to subtract the respective element in Y. So, for example, from the element in row 1 and column 1 in O_out, we want to subtract the element from row 1 and column 1 in Y.

See slide 19

So, in terms of the matrix-notation what we want to do, is to simply element-wise subtract the label matrix from the output layer outputs matrix.

See slide 20

So, let’s calculate O_error.

See slide 21

And by executing this calculation, we, so to say, move from this point in the neural net:

See slide 22

To this point in the neural net:

See slide 23

So, just to clarify: During the feedforward algorithm, we call this point in the neural net O_out. And during the backpropagation algorithm, we call this point in the neural net O_error.

2. step of chain rule

So now, let’s look at the orange-colored parts of the equations which represent the second step of the chain rule. And therefore, to better understand what is going on in this step, let’s rewrite the blue expression in the scalar-notation equation as o_error _n^(e).

See slide 24

So, what we want to do in this step is similar to what we did in the first step. Namely, we want to multiply a specific element of our O_error matrix with the respective element in the O_out matrix. And then, we multiply that with “1 minus the respective element in the O_out matrix”.

So, for example, if we take the element from row 1 and column 1 again, then we want to calculate “-0.38 * 0.62 * (1 – 0.62)”.

See slide 25

So, in terms of the matrix-notation what we want to do, is an element-wise multiplication between those matrices.

See slide 26

And, as you can see, the element-wise matrix multiplication has a special symbol (a dot with a circle around it). So, it’s not a regular matrix multiplication where we calculate dot products. When doing an element-wise matrix multiplication, you really just multiply together the respective elements of the matrices (which means that they have to have the same dimensions).

Side note: Here, I have to mention that the “1”, in the equation for calculating O_delta, is supposed to represent a matrix which has the same dimensions as O_out and whose elements are all of value “1”. This is maybe not the most precise form of notation, but I think for our purposes it will do.

So, let’s calculate O_delta.

See slide 27

And by executing this calculation, we, so to say, move from this point in the neural net:

See slide 28

Back through the nodes to this point in the neural net:

See slide 29

So, during the feedforward algorithm, we call this point in the neural net O_in. And during the backpropagation algorithm, we call this point in the neural net O_delta.

3. step of chain rule

So now, let’s look at the green-colored parts of the equations which represent the third step of the chain rule. And therefore, to again better understand what is going on in this step, let’s rewrite the scalar-notation equation so that it now includes o_delta _n^(e).

See slide 30

So, what we want to do here is, for a particular example e, we want to multiply o_delta of node n with h_out of node f. And which node n of o_delta we use or which node f of h_out we use, depends on w.r.t. which weight we want to determine the partial derivative of MSE of. So, from which node the weight is coming from to which node the weight is going to.

So, for example, let’s say we want to determine the partial derivative of MSE w.r.t. w_2-11.

See slide 31

So, in the neural net, it is the weight going from node 1 in the hidden layer to node 1 in the output layer.

See slide 32

In that case, if we plug in the numbers into the equation, then we can see that we want to multiply o_delta ₁ with h_out ₁.

See slide 33

And we want to do that for all examples e, respectively. So, we want to element-wise multiply column 1 of O_delta and column 1 of H_out.

See slide 34

So, this is what we basically do at this third step. But then, let’s not forget what we said about the scalar-notation equation when we started with the first step of the chain rule.

See slide 35

Namely, what we do in this equation, is that we first multiply together all the “different-colored” expressions. And then, once we have done that (which we now have after the third step), we sum up over all examples e. And then, we divide that sum by N.

So, after we have multiplied together the respective elements of column 1 of O_delta and column 1 of H_out, we want to add up those products. So, basically what we want to do, is to calculate the dot product of those two columns.

See slide 36

And the result of that dot product, we then divide by N. And this will give us the value with which we should update w_2-11.

See slide 37

And the same kind of logic applies to the other weights of W₂, respectively.

See slides 38-42

So basically, what we want to do to determine W_2-update, for each column in O_delta we want to calculate the dot product with each column in H_out. And this looks very similar to a matrix multiplication, right?

So, what can we do so that it actually becomes a true matrix multiplication (so that we then can take advantage of the speed of NumPy in our code)?

Well, we can simply take the transpose of O_delta (so that the rows become the columns).

See slide 43

And now, we can simply multiply O_delta^T with H_out to calculate all those dot products that we want.

See slide 44

And then, we divide each element of the resulting matrix by N to get W_2-update.

See slide 45

So, this is how we can transfer the knowledge from the scalar-notation equation over to dealing with matrices again.

There is, however, a small problem with these calculations. Namely, the result of multiplying O_delta^T with H_out would be a 3x2 matrix.

See slide 46

And it would look like this:

See slide 47

So, we can’t simply subtract it from our W₂ which is what we want to do to execute one gradient step.

See slide 48

To be able to do that we would have to again take the transpose of this W_2-update matrix.

See slide 49

This matrix we can element-wise subtract from W₂ and thereby update the respective weights. So, to calculate W_2-update, we have to take the transpose of the multiplication of O_delta^T with H_out.

See slide 50

So, this is now finally how we can determine W_2-update. But, as you can see there are two transposes which makes it kind of complicated. So, let’s simplify it somewhat.

Namely, instead of taking the transpose of O_delta, we can take the transpose of H_out.

See slide 51

And now, we multiply H_out^T with O_delta (instead of doing it the other way around).

See slide 52

This way, we are calculating the exact same dot products as before. But the resulting matrix is already in the correct shape. So, we only have to take one transpose which is computationally more efficient (and also a little less complicated to remember).

So, that’s now how we are going to determine W_2-update.

See slide 53

So, let’s calculate it.

See slide 54

And what we then would have to do to execute one gradient step for W₂, is to multiply W_2-update with our learning rate and then we would subtract it from our initial W₂.

See slide 55

And thereby we should be able to reduce the MSE somewhat.

But, obviously, for each gradient descent step we have to update both of our weight matrices simultaneously. So, let’s now have a look at how we determine the equation for the partial derivative of MSE w.r.t. W₁.

See slide 56

And this will be the topic of the next post.

1 Comment

Livio

2/23/2022 02:52:12 am

Surfing on backpropagation explaination i've never found something as clear as this post. Thank you very much for your work and for this perfect implementation ready to use! I finally understood this algorithm. Do you also have the same application but using Tanh activation function? i tried to derivate it to find how to update weight but it is quite difficult. can you help me?
thank you very much!