This post is part of a series:
In the previous post, we left off at the point where we have written down the formula which represents the scalar-notation for finding the partial derivatives of MSE with respect to (w.r.t.) all weights in W2.
See slide 1
So now, we want to think about how we can transfer this equation back to the context of using matrices.
Transfer from Scalar-Notation to Matrix-Notation
Therefore, let’s go back to this overview diagram of which we have already seen a version in one of the earlier posts:
See slide 2
And let’s also quickly go through the feedforward algorithm again. So first, we multiply the input matrix X with W1 to get Hin. And thereby we move from the inputs to the hidden layer inputs.
See slide 3
Then, to move through the nodes from the hidden layer inputs to the hidden layer outputs, we element-wise apply the sigmoid function to Hin to get Hout.
See slide 4
After that, we multiply Hout with W2 to get Oin and move from the hidden layer outputs to the output layer inputs
See slide 5
And then again, we move through the nodes by element-wise applying the sigmoid function to Oin get Oout.
See slide 6
So that’s the feedforward algorithm. And then, to see how good the neural net currently is at making predictions, we put the output layer outputs together with the labels into the MSE function.
See slide 7
And in this case, we then get an MSE of 0.138.
See slide 8
And now, we want to know how we should adjust our W2 (during the backpropagation algorithm) so that the MSE decreases and the neural net gets somewhat better at making decisions. So, we want to know the partial derivative of MSE w.r.t. W2.
See slide 9
Therefor, we are going look at the scalar-notation equation, which tells us how to determine the partial derivatives of MSE w.r.t. all the individual weights in W2.
See slide 10
So now, we simply have to look at what we actually want to calculate in this equation and then see how that translates to dealing with the matrices.
See slide 11
Before we do that, however, I want to rearrange the matrix-notation formula a little bit. And that’s because we are not going to calculate it in just one step like the formula suggests. Instead what we are going to do, is to calculate it one step at a time (similar to what we did during the feedforward algorithm).
So, the first step or matrix we are going to call Oerror.
See slide 12
The second step is then called Odelta.
See slide 13
And then finally, in the last step, we determine the W2-update. So, the values that we use to execute one gradient descent step for W2.
See slide 14
And we can do the calculations in this way because, in the original formula, we are just multiplying together the different expressions. So, we can also just simply do one multiplication at a time, instead of doing them all at once.
So, those are now the individual steps that we want to understand using the scalar-notation equation.
See slide 15
And to better see which part of the scalar-notation equation represents which step in the matrix-notation formulas, let’s colorize them accordingly.
See slide 16
1. Step of the Chain Rule: Output-Error-Matrix
Okay, so now let’s see what we actually want to calculate in the scalar-notation equation and how we can transfer it to dealing with matrices.
And as already said in the previous post, what we do in the scalar-notation equation is that we first multiply together all the “different-colored” expressions (which represent, generally speaking, the derivatives of the cost function, the sigmoid function and the dot product). And only then, once we have done that, we sum up over all examples e and then divide by N. And this is also what we are going to do with the matrices.
So first, let’s just look at the blue-colored parts which represent the first step of the chain rule.
And just as a reminder, in the scalar-notation equation, the “oout n(e)” represents one specific element from the output layer outputs matrix Oout. Namely, the element of node n and example e. So, for instance, oout 1(3) is the element in Oout in the third row and first column.
See slide 17
And oout 3(2) is the element in Oout in the second row and third column.
See slide 18
The same logic applies to “yn(e)” and the label matrix Y.
So, what we want to do in the scalar-notation equation, is to simply subtract a specific y from the respective oout. So, for example, from oout 1(1) we want to subtract y1(1). And what that means in terms of the matrices is that from each element in Oout we want to subtract the respective element in Y. So, for example, from the element in row 1 and column 1 in Oout, we want to subtract the element from row 1 and column 1 in Y.
See slide 19
So, in terms of the matrix-notation what we want to do, is to simply element-wise subtract the label matrix from the output layer outputs matrix.
See slide 20
So, let’s calculate Oerror.
See slide 21
And by executing this calculation, we, so to say, move from this point in the neural net:
See slide 22
To this point in the neural net:
See slide 23
So, just to clarify: During the feedforward algorithm, we call this point in the neural net Oout. And during the backpropagation algorithm, we call this point in the neural net Oerror.
2. step of chain rule
So now, let’s look at the orange-colored parts of the equations which represent the second step of the chain rule. And therefore, to better understand what is going on in this step, let’s rewrite the blue expression in the scalar-notation equation as oerror n(e).
See slide 24
So, what we want to do in this step is similar to what we did in the first step. Namely, we want to multiply a specific element of our Oerror matrix with the respective element in the Oout matrix. And then, we multiply that with “1 minus the respective element in the Oout matrix”.
So, for example, if we take the element from row 1 and column 1 again, then we want to calculate “-0.38 * 0.62 * (1 – 0.62)”.
See slide 25
So, in terms of the matrix-notation what we want to do, is an element-wise multiplication between those matrices.
See slide 26
And, as you can see, the element-wise matrix multiplication has a special symbol (a dot with a circle around it). So, it’s not a regular matrix multiplication where we calculate dot products. When doing an element-wise matrix multiplication, you really just multiply together the respective elements of the matrices (which means that they have to have the same dimensions).
Side note: Here, I have to mention that the “1”, in the equation for calculating Odelta, is supposed to represent a matrix which has the same dimensions as Oout and whose elements are all of value “1”. This is maybe not the most precise form of notation, but I think for our purposes it will do.
So, let’s calculate Odelta.
See slide 27
And by executing this calculation, we, so to say, move from this point in the neural net:
See slide 28
Back through the nodes to this point in the neural net:
See slide 29
So, during the feedforward algorithm, we call this point in the neural net Oin. And during the backpropagation algorithm, we call this point in the neural net Odelta.
3. step of chain rule
So now, let’s look at the green-colored parts of the equations which represent the third step of the chain rule. And therefore, to again better understand what is going on in this step, let’s rewrite the scalar-notation equation so that it now includes odelta n(e).
See slide 30
So, what we want to do here is, for a particular example e, we want to multiply odelta of node n with hout of node f. And which node n of odelta we use or which node f of hout we use, depends on w.r.t. which weight we want to determine the partial derivative of MSE of. So, from which node the weight is coming from to which node the weight is going to.
So, for example, let’s say we want to determine the partial derivative of MSE w.r.t. w2-11.
See slide 31
So, in the neural net, it is the weight going from node 1 in the hidden layer to node 1 in the output layer.
See slide 32
In that case, if we plug in the numbers into the equation, then we can see that we want to multiply odelta 1 with hout 1.
See slide 33
And we want to do that for all examples e, respectively. So, we want to element-wise multiply column 1 of Odelta and column 1 of Hout.
See slide 34
So, this is what we basically do at this third step. But then, let’s not forget what we said about the scalar-notation equation when we started with the first step of the chain rule.
See slide 35
Namely, what we do in this equation, is that we first multiply together all the “different-colored” expressions. And then, once we have done that (which we now have after the third step), we sum up over all examples e. And then, we divide that sum by N.
So, after we have multiplied together the respective elements of column 1 of Odelta and column 1 of Hout, we want to add up those products. So, basically what we want to do, is to calculate the dot product of those two columns.
See slide 36
And the result of that dot product, we then divide by N. And this will give us the value with which we should update w2-11.
See slide 37
And the same kind of logic applies to the other weights of W2, respectively.
See slides 38-42
So basically, what we want to do to determine W2-update, for each column in Odelta we want to calculate the dot product with each column in Hout. And this looks very similar to a matrix multiplication, right?
So, what can we do so that it actually becomes a true matrix multiplication (so that we then can take advantage of the speed of NumPy in our code)?
Well, we can simply take the transpose of Odelta (so that the rows become the columns).
See slide 43
And now, we can simply multiply OdeltaT with Hout to calculate all those dot products that we want.
See slide 44
And then, we divide each element of the resulting matrix by N to get W2-update.
See slide 45
So, this is how we can transfer the knowledge from the scalar-notation equation over to dealing with matrices again.
There is, however, a small problem with these calculations. Namely, the result of multiplying OdeltaT with Hout would be a 3x2 matrix.
See slide 46
And it would look like this:
See slide 47
So, we can’t simply subtract it from our W2 which is what we want to do to execute one gradient step.
See slide 48
To be able to do that we would have to again take the transpose of this W2-update matrix.
See slide 49
This matrix we can element-wise subtract from W2 and thereby update the respective weights. So, to calculate W2-update, we have to take the transpose of the multiplication of OdeltaT with Hout.
See slide 50
So, this is now finally how we can determine W2-update. But, as you can see there are two transposes which makes it kind of complicated. So, let’s simplify it somewhat.
Namely, instead of taking the transpose of Odelta, we can take the transpose of Hout.
See slide 51
And now, we multiply HoutT with Odelta (instead of doing it the other way around).
See slide 52
This way, we are calculating the exact same dot products as before. But the resulting matrix is already in the correct shape. So, we only have to take one transpose which is computationally more efficient (and also a little less complicated to remember).
So, that’s now how we are going to determine W2-update.
See slide 53
So, let’s calculate it.
See slide 54
And what we then would have to do to execute one gradient step for W2, is to multiply W2-update with our learning rate and then we would subtract it from our initial W2.
See slide 55
And thereby we should be able to reduce the MSE somewhat.
But, obviously, for each gradient descent step we have to update both of our weight matrices simultaneously. So, let’s now have a look at how we determine the equation for the partial derivative of MSE w.r.t. W1.
See slide 56
And this will be the topic of the next post.
2/23/2022 02:52:12 am
Surfing on backpropagation explaination i've never found something as clear as this post. Thank you very much for your work and for this perfect implementation ready to use! I finally understood this algorithm. Do you also have the same application but using Tanh activation function? i tried to derivate it to find how to update weight but it is quite difficult. can you help me?
Leave a Reply.
Just someone trying to explain his understanding of data science concepts