Basics of Deep Learning p.10 - Backpropagation explained Step by Step cont'd

1/21/2020

This post is part of a series:

Part 1: Introduction
Part 2: Feedforward Algorithm explained
Part 3: Implementing the Feedforward Algorithm in pure Python
Part 4: Implementing the Feedforward Algorithm in pure Python cont'd
Part 5: Implementing the Feedforward Algorithm with NumPy
Part 6: Backpropagation explained - Cost Function and Derivatives
Part 7: Backpropagation explained - Gradient Descent and Partial Derivatives
Part 8: Backpropagation explained - Chain Rule and Activation Function
Part 9: Backpropagation explained Step by Step
Part 10: Backpropagation explained Step by Step cont'd
Part 11: Backpropagation explained Step by Step cont'd
Part 12: Backpropagation explained Step by Step cont'd
Part 13: Implementing the Backpropagation Algorithm with NumPy
Part 14: How to train a Neural Net

Here are the corresponding slides for this post:

basics_of_dl_10.pdf
File Size:	339 kb
File Type:	pdf

Download File

In the previous post, we left off at the point where we have written down how to determine the partial derivative of MSE with respect (w.r.t.) w_2-11.

See slide 1

And now, we want to see what the equations for all those individual expressions actually look like.

Derivative of the Cost Function

So, first off, we are going to determine the equation for the partial derivative of MSE w.r.t. o_out ₁⁽¹⁾.

See slide 2

Or more generally speaking, we are going to determine the derivative of our cost function.

See slide 3

And here, to make things easier to understand, we multiply out the bracket. So, we multiply “1/N” with each expression in the bracket.

See slide 4

And now, looking at this function, we can see that if we want to take the partial derivative of it w.r.t. o_out ₁⁽¹⁾, then all the other variables are treated like a constant. So, their derivatives are going to be zero. So, we actually only need to focus on the first expression to determine the partial derivative of MSE w.r.t. o_out ₁⁽¹⁾.

See slide 5

And here, since there is an outer function (the squared) and an inner function (the expression in the parentheses), we have to again use the chain rule. So, to make it really clear how that works, let’s break down this expression into two functions.

See slide 6

So, function “f” represents the inner function and function “g” represents the outer function. And now, to use the chain rule we need the derivatives of those two functions.

See slide 7

And they are is easy to determine.

See slide 8

And because there is this “2/N” in the derivative of g, you can sometimes see that the MSE error is defined with “1/2N” instead of “1/N”.

See slide 9

And that’s because this way, there is then this “2/2N “in the derivative of g.

See slide 10

So, the 2 cancels out and we are left with “1/N”.

See slide 11

This simply makes the derivative look somewhat nicer. So, from now on, we are going to define the MSE with “1/2N”. And we can do that, by the way, because it doesn’t really change anything about the behavior of the MSE itself. Its magnitude is just scaled down somewhat.

Okay so now, to determine the derivative of g w.r.t. o_out ₁⁽¹⁾, according to the chain rule, we have to multiply the derivatives together.

See slide 12

And now, we can simply replace f with its actual equation to get our final function.

See slide 13

So, this is what the partial derivative of MSE w.r.t. o_out ₁⁽¹⁾ looks like, or more generally speaking, what the derivative of our cost function looks like.

See slide 14

Derivative of the Activation Function

So now, we can tackle the second expression: the derivative of o_out ₁⁽¹⁾ w.r.t. o_in ₁⁽¹⁾.

See slide 15

Or more generally speaking, we are going to determine the derivative of our activation function, the sigmoid function. And often times, you will see that the sigmoid function is denoted with the lowercase Greek letter “sigma”. So, that is what we are also going to do here:

See slide 16

Side note: Here, I use x as the variable, just for convenience. That way, I don’t have to write o_in ₁⁽¹⁾ all the time and make it more confusing looking than it needs to be.

Okay so, for this function, we now want to determine the derivative.

See slide 17

And this might seem to be pretty complicated, but as it turns out, by doing some clever algebraic manipulations, the derivative of the sigmoid function is actually very simple. So, let’s see why.

First, we rewrite the function itself like this:

See slide 18

This means the same thing. And now to determine the derivative, we again have to use the chain rule and multiply the derivative of the outer function with the derivative of the inner function.

So, for the derivative of the outer function we bring down the minus one, multiply it with the expression in the parentheses and reduce the exponent by one.

See slide 19

And this we multiply with the derivative of the inner function which looks like this:

See slide 20

And then, we can multiply the “-1” with “-e^-x”.

See slide 21

This, we can rewrite in the same way as we did before with the sigmoid function itself, but just the other way around.

See slide 22

Here, we can, so to say, get rid of the square by writing the function like this:

See slide 23

And now, we do our clever manipulation because for the second expression we are going to add a 1 and subtract a 1.

See slide 24

This doesn’t change anything about the expression but now we can rewrite it like this:

See slide 25

And here, the first expression in the parentheses is actually just one.

See slide 26

And then, the other two expressions left in this function, they look exactly like the sigmoid function itself.

See slide 27

So, we can calculate the derivative of the sigmoid function (i.e. its slope) by using the output of the sigmoid function itself.

So, for example, let’s say x is equal to zero.

See slide 28

If we now want to know the derivative or slope of the function at that point, then we first simply calculate the output of the sigmoid function at that point.

See slide 29

And then, we use that value to calculate the derivative.

See slide 30

Side note: The value of 0.25, by the way, is actually the biggest slope for this function. At every other point, the slope is smaller than this.

And the fact that the derivative of the sigmoid function can be calculated with the output of the sigmoid function comes in pretty handy. Because during the feedforward algorithm we actually calculate the output of the sigmoid function.

See slide 31

So, in order to determine the equation for the derivative of o_out ₁⁽¹⁾ w.r.t. o_in ₁⁽¹⁾, we simply need to make use of o_out ₁⁽¹⁾.

See slide 32

Derivative of the Dot Product

And with that, we can now get to the third expression, namely the derivative of o_in ₁⁽¹⁾ w.r.t. w_2-11.

See slide 33

Or more generally speaking, we are going to see what the derivative of the dot product looks like.

See slide 34

And this one is pretty easy because we simply have to derive this function w.r.t. w_2-11. So, we treat w_2-21 as a constant and the derivative is simply h_out ₁⁽¹⁾.

See slide 35

So, we simply have to multiply the expressions, that we already have, with h_out ₁⁽¹⁾.

See slide 36

And now, before we write out the remaining expressions of the partial derivative of MSE w.r.t. w_2-11, I would like to point out a certain pattern with regards to determining the partial derivative of the dot product.

Namely, if we determine the partial derivative of o_in ₁⁽¹⁾ w.r.t. w_2-21, then we can see that it is simply h_out ₂⁽¹⁾.

See slide 37

And, as we will see later on, in order to know how we should update the weights in W₁, we will also have to determine the partial derivative of o_in ₁⁽¹⁾ w.r.t. h_out ₁⁽¹⁾ and h_out ₂⁽¹⁾. And these derivatives are simply w_2-11 and w_2-21.

See slide 38

And the pattern that I wanted to point out here is, that the partial derivative of the dot product w.r.t. any of those variables is always, so to say, the “opposite, corresponding element”.

So, for example, if we take the partial derivative w.r.t. to a weight, then the partial derivative is an h_out. If we take the partial derivative w.r.t. an h_out, then the partial derivative is a weight. So, it is always the opposite element.

But it is not just any opposite element, it is the opposite, corresponding element. And what I mean by that is that in the formula of the dot product, we multiply for example h_out ₁⁽¹⁾ with w_2-11. So, if we then take for example the partial derivative of o_in ₁⁽¹⁾ w.r.t. w_2-11, then the partial derivative is h_out ₁⁽¹⁾ because that’s the corresponding element of w_2-11 and not h_out ₂⁽¹⁾. And the same kind of reasoning applies to all the other partial derivatives.

And this pattern of the “opposite, corresponding element” will be important when we later on deal with matrices again. So, you can keep that in mind.

Finalizing the Partial Derivative of our first Weight in Weight Matrix 2

Okay, so now let’s write out the remaining expressions of the partial derivative of MSE w.r.t. w_2-11.

See slide 39

And now that we know what the derivatives of the cost function, the sigmoid function and the dot product generally look like, this is pretty easy to do.

See slide 40

So, the expressions look basically the same. The only difference is that we use the elements from example 2 (so the elements with a superscript of 2).

So, this is now the final equation for determining the partial derivative of MSE w.r.t. w_2-11 for the case that we have 2 examples. And, obviously, if we would have more examples, then we would have to add more of such similar-looking expressions where the only difference is the superscript. And then, it would become unpractical to write them all out. So, let’s rewrite this equation in a more general way so that it is more concise.

Therefor, let’s first factor out the “1/N”.

See slide 41

And now, in a way, it is not directly connected to the partial derivatives anymore and that’s why I depicted it in black.

And then, to write the summation in the brackets in a more general way, we can use sigma notation again. So, since all those terms are the same except for the superscript, which runs over the number of examples that we have, we can rewrite the bracket with sigma notation like this:

See slide 42

So, that’s what the equation for the partial derivative of MSE w.r.t. w_2-11 looks like in a more general way. And what the function does is, for each individual example, it multiplies the different expressions together. And then, in the end, once that is done, those products are add up over all the examples e. And this final sum is then divided by N.

And this order – first multiplying the partial derivatives and only then, kind of separated from that, adding up the results of those multiplications and dividing by N – is really important when we then transfer this equation back to dealing with matrices.

The Partial Derivatives of all the Weights in Weight Matrix 2

So, that’s how you calculate the partial derivative of MSE w.r.t. w_2-11.

See slide 43

But this is just one weight of W₂. So, let’s now determine the next one which is going to be w_2-21.

See slide 44

Therefor, let’s look again at the tree diagram.

See slide 45

So, here are the paths again that we needed to consider when we determined the partial derivative of MSE w.r.t. w_2-11. And now, if we want to determine the partial derivative of MSE w.r.t. w_2-21, then the paths look almost the same.

See slide 46

The only difference is at the last step where we go to w_2-21 and not w_2-11. So, the partial derivatives also look basically the same, except for the last expressions where we determine the partial derivative of o_in ₁⁽¹⁾ w.r.t. w_2-21 and the partial derivative of o_in ₁⁽²⁾ w.r.t. w_2-21, respectively.

See slide 47

So then, accordingly, the equation for the partial derivative of MSE w.r.t. w_2-21 also looks basically the same.

See slide 48

The only difference is that, at the end, we multiply with h_out ₂^(e) instead of h_out ₁^(e). And that’s because h_out ₂ is the opposite, corresponding element of w_2-21. You can also see that in the neural net.

See slide 49

The opposite, corresponding element of the weight that goes from node 2 in the hidden layer to node 1 in the output layer, is obviously node 2 of the hidden layer and not node 1 of the hidden layer. Because node 1 has nothing to do with this weight.

Okay, so that’s what the equation for the partial derivative of MSE w.r.t. w_2-21 looks like. And in a similar way, we can determine the equations of the partial derivatives of MSE w.r.t. the remaining 4 weights of W₂.

See slide 50

And, as you can see, the same pattern with h_out ₁^(e) and h_out ₂^(e) is present. Otherwise, the only differences are that we use the values of the nodes to which the respective weights lead to.

So, for example, w_2-12 goes from node 1 in the previous layer to node 2 in the following layer. And w_2-22 goes from node 2 to 2.

See slide 51

And then, accordingly, we use o_out of the respective node in the formulas (and obviously also its respective label).

See slide 52

And the same reasoning applies to the weights w_2-13 and w_2-23.

So, those are now all the calculations that we need to do in order to determine the gradients for all the weights of W₂.

See slide 53

And since those formulas follow a clear pattern, we can rewrite them in a more general way where we then just have one function that represents all those individual functions.

See slide 54

And here, for the more general formula, I replaced the subscripts of w. The “f” stands for: where the weight is coming “from”. And the “n” stands for: to which “node” the weight is going to.

So, I probably also could have chosen to use “f” and “t” as the subscripts to indicate “from” and “to”. But I wanted to use “n” because it basically has the same meaning as the “n” in the MSE formula (matrix-notation). So, using the “n”, I think, will make things clearer when we transfer the formula back to dealing with matrices.

Okay, so now, let’s write out the more general formula.

See slide 55

And if you now, for example, put w_2-23 into this equation, then you will see that it looks exactly the same as the equation above the line.

So, this equation now is the scalar-notation for finding the partial derivative of MSE w.r.t. all weights in W₂.

See slide 56

And since our initial goal was to find the partial derivative of MSE w.r.t. W₂, we can now think about how we can transfer this equation to the context of using matrices.

See slide 56

And this will be the topic of the next post.

1 Comment