This post is part of a series:
In the previous post, we left off at the point where we have written down how to determine the partial derivative of MSE with respect (w.r.t.) w_{211}.
See slide 1
And now, we want to see what the equations for all those individual expressions actually look like.
Derivative of the Cost Function
So, first off, we are going to determine the equation for the partial derivative of MSE w.r.t. o_{out} _{1}^{(1)}.
See slide 2
Or more generally speaking, we are going to determine the derivative of our cost function.
See slide 3
And here, to make things easier to understand, we multiply out the bracket. So, we multiply “1/N” with each expression in the bracket.
See slide 4
And now, looking at this function, we can see that if we want to take the partial derivative of it w.r.t. o_{out} _{1}^{(1)}, then all the other variables are treated like a constant. So, their derivatives are going to be zero. So, we actually only need to focus on the first expression to determine the partial derivative of MSE w.r.t. o_{out} _{1}^{(1)}.
See slide 5
And here, since there is an outer function (the squared) and an inner function (the expression in the parentheses), we have to again use the chain rule. So, to make it really clear how that works, let’s break down this expression into two functions.
See slide 6
So, function “f” represents the inner function and function “g” represents the outer function. And now, to use the chain rule we need the derivatives of those two functions.
See slide 7
And they are is easy to determine.
See slide 8
And because there is this “2/N” in the derivative of g, you can sometimes see that the MSE error is defined with “1/2N” instead of “1/N”.
See slide 9
And that’s because this way, there is then this “2/2N “in the derivative of g.
See slide 10
So, the 2 cancels out and we are left with “1/N”.
See slide 11
This simply makes the derivative look somewhat nicer. So, from now on, we are going to define the MSE with “1/2N”. And we can do that, by the way, because it doesn’t really change anything about the behavior of the MSE itself. Its magnitude is just scaled down somewhat.
Okay so now, to determine the derivative of g w.r.t. o_{out} _{1}^{(1)}, according to the chain rule, we have to multiply the derivatives together.
See slide 12
And now, we can simply replace f with its actual equation to get our final function.
See slide 13
So, this is what the partial derivative of MSE w.r.t. o_{out} _{1}^{(1)} looks like, or more generally speaking, what the derivative of our cost function looks like.
See slide 14
Derivative of the Activation Function
So now, we can tackle the second expression: the derivative of o_{out} _{1}^{(1)} w.r.t. o_{in} _{1}^{(1)}.
See slide 15
Or more generally speaking, we are going to determine the derivative of our activation function, the sigmoid function. And often times, you will see that the sigmoid function is denoted with the lowercase Greek letter “sigma”. So, that is what we are also going to do here:
See slide 16
Side note: Here, I use x as the variable, just for convenience. That way, I don’t have to write o_{in} _{1}^{(1)} all the time and make it more confusing looking than it needs to be.
Okay so, for this function, we now want to determine the derivative.
See slide 17
And this might seem to be pretty complicated, but as it turns out, by doing some clever algebraic manipulations, the derivative of the sigmoid function is actually very simple. So, let’s see why.
First, we rewrite the function itself like this:
See slide 18
This means the same thing. And now to determine the derivative, we again have to use the chain rule and multiply the derivative of the outer function with the derivative of the inner function.
So, for the derivative of the outer function we bring down the minus one, multiply it with the expression in the parentheses and reduce the exponent by one.
See slide 19
And this we multiply with the derivative of the inner function which looks like this:
See slide 20
And then, we can multiply the “1” with “e^{x}”.
See slide 21
This, we can rewrite in the same way as we did before with the sigmoid function itself, but just the other way around.
See slide 22
Here, we can, so to say, get rid of the square by writing the function like this:
See slide 23
And now, we do our clever manipulation because for the second expression we are going to add a 1 and subtract a 1.
See slide 24
This doesn’t change anything about the expression but now we can rewrite it like this:
See slide 25
And here, the first expression in the parentheses is actually just one.
See slide 26
And then, the other two expressions left in this function, they look exactly like the sigmoid function itself.
See slide 27
So, we can calculate the derivative of the sigmoid function (i.e. its slope) by using the output of the sigmoid function itself.
So, for example, let’s say x is equal to zero.
See slide 28
If we now want to know the derivative or slope of the function at that point, then we first simply calculate the output of the sigmoid function at that point.
See slide 29
And then, we use that value to calculate the derivative.
See slide 30
Side note: The value of 0.25, by the way, is actually the biggest slope for this function. At every other point, the slope is smaller than this.
And the fact that the derivative of the sigmoid function can be calculated with the output of the sigmoid function comes in pretty handy. Because during the feedforward algorithm we actually calculate the output of the sigmoid function.
See slide 31
So, in order to determine the equation for the derivative of o_{out} _{1}^{(1)} w.r.t. o_{in} _{1}^{(1)}, we simply need to make use of o_{out} _{1}^{(1)}.
See slide 32
Derivative of the Dot Product
And with that, we can now get to the third expression, namely the derivative of o_{in} _{1}^{(1)} w.r.t. w_{211}.
See slide 33
Or more generally speaking, we are going to see what the derivative of the dot product looks like.
See slide 34
And this one is pretty easy because we simply have to derive this function w.r.t. w_{211}. So, we treat w_{221} as a constant and the derivative is simply h_{out} _{1}^{(1)}.
See slide 35
So, we simply have to multiply the expressions, that we already have, with h_{out} _{1}^{(1)}.
See slide 36
And now, before we write out the remaining expressions of the partial derivative of MSE w.r.t. w_{211}, I would like to point out a certain pattern with regards to determining the partial derivative of the dot product.
Namely, if we determine the partial derivative of o_{in} _{1}^{(1)} w.r.t. w_{221}, then we can see that it is simply h_{out} _{2}^{(1)}.
See slide 37
And, as we will see later on, in order to know how we should update the weights in W_{1}, we will also have to determine the partial derivative of o_{in} _{1}^{(1)} w.r.t. h_{out} _{1}^{(1)} and h_{out} _{2}^{(1)}. And these derivatives are simply w_{211} and w_{221}.
See slide 38
And the pattern that I wanted to point out here is, that the partial derivative of the dot product w.r.t. any of those variables is always, so to say, the “opposite, corresponding element”.
So, for example, if we take the partial derivative w.r.t. to a weight, then the partial derivative is an h_{out}. If we take the partial derivative w.r.t. an h_{out}, then the partial derivative is a weight. So, it is always the opposite element.
But it is not just any opposite element, it is the opposite, corresponding element. And what I mean by that is that in the formula of the dot product, we multiply for example h_{out} _{1}^{(1)} with w_{211}. So, if we then take for example the partial derivative of o_{in} _{1}^{(1)} w.r.t. w_{211}, then the partial derivative is h_{out} _{1}^{(1)} because that’s the corresponding element of w_{211} and not h_{out} _{2}^{(1)}. And the same kind of reasoning applies to all the other partial derivatives.
And this pattern of the “opposite, corresponding element” will be important when we later on deal with matrices again. So, you can keep that in mind.
Finalizing the Partial Derivative of our first Weight in Weight Matrix 2
Okay, so now let’s write out the remaining expressions of the partial derivative of MSE w.r.t. w_{211}.
See slide 39
And now that we know what the derivatives of the cost function, the sigmoid function and the dot product generally look like, this is pretty easy to do.
See slide 40
So, the expressions look basically the same. The only difference is that we use the elements from example 2 (so the elements with a superscript of 2).
So, this is now the final equation for determining the partial derivative of MSE w.r.t. w_{211} for the case that we have 2 examples. And, obviously, if we would have more examples, then we would have to add more of such similarlooking expressions where the only difference is the superscript. And then, it would become unpractical to write them all out. So, let’s rewrite this equation in a more general way so that it is more concise.
Therefor, let’s first factor out the “1/N”.
See slide 41
And now, in a way, it is not directly connected to the partial derivatives anymore and that’s why I depicted it in black.
And then, to write the summation in the brackets in a more general way, we can use sigma notation again. So, since all those terms are the same except for the superscript, which runs over the number of examples that we have, we can rewrite the bracket with sigma notation like this:
See slide 42
So, that’s what the equation for the partial derivative of MSE w.r.t. w_{211} looks like in a more general way. And what the function does is, for each individual example, it multiplies the different expressions together. And then, in the end, once that is done, those products are add up over all the examples e. And this final sum is then divided by N.
And this order – first multiplying the partial derivatives and only then, kind of separated from that, adding up the results of those multiplications and dividing by N – is really important when we then transfer this equation back to dealing with matrices.
The Partial Derivatives of all the Weights in Weight Matrix 2
So, that’s how you calculate the partial derivative of MSE w.r.t. w_{211}.
See slide 43
But this is just one weight of W_{2}. So, let’s now determine the next one which is going to be w_{221}.
See slide 44
Therefor, let’s look again at the tree diagram.
See slide 45
So, here are the paths again that we needed to consider when we determined the partial derivative of MSE w.r.t. w_{211}. And now, if we want to determine the partial derivative of MSE w.r.t. w_{221}, then the paths look almost the same.
See slide 46
The only difference is at the last step where we go to w_{221} and not w_{211}. So, the partial derivatives also look basically the same, except for the last expressions where we determine the partial derivative of o_{in} _{1}^{(1)} w.r.t. w_{221} and the partial derivative of o_{in} _{1}^{(2)} w.r.t. w_{221}, respectively.
See slide 47
So then, accordingly, the equation for the partial derivative of MSE w.r.t. w_{221} also looks basically the same.
See slide 48
The only difference is that, at the end, we multiply with h_{out} _{2}^{(e)} instead of h_{out} _{1}^{(e)}. And that’s because h_{out} _{2} is the opposite, corresponding element of w_{221}. You can also see that in the neural net.
See slide 49
The opposite, corresponding element of the weight that goes from node 2 in the hidden layer to node 1 in the output layer, is obviously node 2 of the hidden layer and not node 1 of the hidden layer. Because node 1 has nothing to do with this weight.
Okay, so that’s what the equation for the partial derivative of MSE w.r.t. w_{221} looks like. And in a similar way, we can determine the equations of the partial derivatives of MSE w.r.t. the remaining 4 weights of W_{2}.
See slide 50
And, as you can see, the same pattern with h_{out} _{1}^{(e)} and h_{out} _{2}^{(e)} is present. Otherwise, the only differences are that we use the values of the nodes to which the respective weights lead to.
So, for example, w_{212} goes from node 1 in the previous layer to node 2 in the following layer. And w_{222} goes from node 2 to 2.
See slide 51
And then, accordingly, we use o_{out} of the respective node in the formulas (and obviously also its respective label).
See slide 52
And the same reasoning applies to the weights w_{213} and w_{223}.
So, those are now all the calculations that we need to do in order to determine the gradients for all the weights of W_{2}.
See slide 53
And since those formulas follow a clear pattern, we can rewrite them in a more general way where we then just have one function that represents all those individual functions.
See slide 54
And here, for the more general formula, I replaced the subscripts of w. The “f” stands for: where the weight is coming “from”. And the “n” stands for: to which “node” the weight is going to.
So, I probably also could have chosen to use “f” and “t” as the subscripts to indicate “from” and “to”. But I wanted to use “n” because it basically has the same meaning as the “n” in the MSE formula (matrixnotation). So, using the “n”, I think, will make things clearer when we transfer the formula back to dealing with matrices.
Okay, so now, let’s write out the more general formula.
See slide 55
And if you now, for example, put w_{223} into this equation, then you will see that it looks exactly the same as the equation above the line.
So, this equation now is the scalarnotation for finding the partial derivative of MSE w.r.t. all weights in W_{2}.
See slide 56
And since our initial goal was to find the partial derivative of MSE w.r.t. W_{2}, we can now think about how we can transfer this equation to the context of using matrices.
See slide 56
And this will be the topic of the next post.

AuthorJust someone trying to explain his understanding of data science concepts Archives
February 2020
Categories
