This post is part of a series:
In the previous post, we left off at the point where we have written down the formula which represents the scalarnotation for finding the partial derivatives of MSE with respect to (w.r.t.) all weights in W_{2}.
See slide 1
So now, we want to think about how we can transfer this equation back to the context of using matrices.
Transfer from ScalarNotation to MatrixNotation
Therefore, let’s go back to this overview diagram of which we have already seen a version in one of the earlier posts:
See slide 2
And let’s also quickly go through the feedforward algorithm again. So first, we multiply the input matrix X with W_{1} to get H_{in}. And thereby we move from the inputs to the hidden layer inputs.
See slide 3
Then, to move through the nodes from the hidden layer inputs to the hidden layer outputs, we elementwise apply the sigmoid function to H_{in} to get H_{out}.
See slide 4
After that, we multiply H_{out} with W_{2} to get O_{in} and move from the hidden layer outputs to the output layer inputs
See slide 5
And then again, we move through the nodes by elementwise applying the sigmoid function to O_{in} get O_{out}.
See slide 6
So that’s the feedforward algorithm. And then, to see how good the neural net currently is at making predictions, we put the output layer outputs together with the labels into the MSE function.
See slide 7
And in this case, we then get an MSE of 0.138.
See slide 8
And now, we want to know how we should adjust our W_{2} (during the backpropagation algorithm) so that the MSE decreases and the neural net gets somewhat better at making decisions. So, we want to know the partial derivative of MSE w.r.t. W_{2}.
See slide 9
Therefor, we are going look at the scalarnotation equation, which tells us how to determine the partial derivatives of MSE w.r.t. all the individual weights in W_{2}.
See slide 10
So now, we simply have to look at what we actually want to calculate in this equation and then see how that translates to dealing with the matrices.
See slide 11
Before we do that, however, I want to rearrange the matrixnotation formula a little bit. And that’s because we are not going to calculate it in just one step like the formula suggests. Instead what we are going to do, is to calculate it one step at a time (similar to what we did during the feedforward algorithm).
So, the first step or matrix we are going to call O_{error}.
See slide 12
The second step is then called O_{delta}.
See slide 13
And then finally, in the last step, we determine the W_{2update}. So, the values that we use to execute one gradient descent step for W_{2}.
See slide 14
And we can do the calculations in this way because, in the original formula, we are just multiplying together the different expressions. So, we can also just simply do one multiplication at a time, instead of doing them all at once.
So, those are now the individual steps that we want to understand using the scalarnotation equation.
See slide 15
And to better see which part of the scalarnotation equation represents which step in the matrixnotation formulas, let’s colorize them accordingly.
See slide 16
1. Step of the Chain Rule: OutputErrorMatrix
Okay, so now let’s see what we actually want to calculate in the scalarnotation equation and how we can transfer it to dealing with matrices.
And as already said in the previous post, what we do in the scalarnotation equation is that we first multiply together all the “differentcolored” expressions (which represent, generally speaking, the derivatives of the cost function, the sigmoid function and the dot product). And only then, once we have done that, we sum up over all examples e and then divide by N. And this is also what we are going to do with the matrices.
So first, let’s just look at the bluecolored parts which represent the first step of the chain rule.
And just as a reminder, in the scalarnotation equation, the “o_{out} _{n}^{(e)}” represents one specific element from the output layer outputs matrix O_{out}. Namely, the element of node n and example e. So, for instance, o_{out} _{1}^{(3)} is the element in O_{out} in the third row and first column.
See slide 17
And o_{out} _{3}^{(2)} is the element in O_{out} in the second row and third column.
See slide 18
The same logic applies to “y_{n}^{(e)}” and the label matrix Y.
So, what we want to do in the scalarnotation equation, is to simply subtract a specific y from the respective o_{out}. So, for example, from o_{out} _{1}^{(1)} we want to subtract y_{1}^{(1)}. And what that means in terms of the matrices is that from each element in O_{out} we want to subtract the respective element in Y. So, for example, from the element in row 1 and column 1 in O_{out}, we want to subtract the element from row 1 and column 1 in Y.
See slide 19
So, in terms of the matrixnotation what we want to do, is to simply elementwise subtract the label matrix from the output layer outputs matrix.
See slide 20
So, let’s calculate O_{error}.
See slide 21
And by executing this calculation, we, so to say, move from this point in the neural net:
See slide 22
To this point in the neural net:
See slide 23
So, just to clarify: During the feedforward algorithm, we call this point in the neural net O_{out}. And during the backpropagation algorithm, we call this point in the neural net O_{error}.
2. step of chain rule
So now, let’s look at the orangecolored parts of the equations which represent the second step of the chain rule. And therefore, to better understand what is going on in this step, let’s rewrite the blue expression in the scalarnotation equation as o_{error} _{n}^{(e)}.
See slide 24
So, what we want to do in this step is similar to what we did in the first step. Namely, we want to multiply a specific element of our O_{error} matrix with the respective element in the O_{out} matrix. And then, we multiply that with “1 minus the respective element in the O_{out} matrix”.
So, for example, if we take the element from row 1 and column 1 again, then we want to calculate “0.38 * 0.62 * (1 – 0.62)”.
See slide 25
So, in terms of the matrixnotation what we want to do, is an elementwise multiplication between those matrices.
See slide 26
And, as you can see, the elementwise matrix multiplication has a special symbol (a dot with a circle around it). So, it’s not a regular matrix multiplication where we calculate dot products. When doing an elementwise matrix multiplication, you really just multiply together the respective elements of the matrices (which means that they have to have the same dimensions).
Side note: Here, I have to mention that the “1”, in the equation for calculating O_{delta}, is supposed to represent a matrix which has the same dimensions as O_{out} and whose elements are all of value “1”. This is maybe not the most precise form of notation, but I think for our purposes it will do.
So, let’s calculate O_{delta}.
See slide 27
And by executing this calculation, we, so to say, move from this point in the neural net:
See slide 28
Back through the nodes to this point in the neural net:
See slide 29
So, during the feedforward algorithm, we call this point in the neural net O_{in}. And during the backpropagation algorithm, we call this point in the neural net O_{delta}.
3. step of chain rule
So now, let’s look at the greencolored parts of the equations which represent the third step of the chain rule. And therefore, to again better understand what is going on in this step, let’s rewrite the scalarnotation equation so that it now includes o_{delta} _{n}^{(e)}.
See slide 30
So, what we want to do here is, for a particular example e, we want to multiply o_{delta} of node n with h_{out} of node f. And which node n of o_{delta} we use or which node f of h_{out} we use, depends on w.r.t. which weight we want to determine the partial derivative of MSE of. So, from which node the weight is coming from to which node the weight is going to.
So, for example, let’s say we want to determine the partial derivative of MSE w.r.t. w_{211}.
See slide 31
So, in the neural net, it is the weight going from node 1 in the hidden layer to node 1 in the output layer.
See slide 32
In that case, if we plug in the numbers into the equation, then we can see that we want to multiply o_{delta} _{1} with h_{out} _{1}.
See slide 33
And we want to do that for all examples e, respectively. So, we want to elementwise multiply column 1 of O_{delta} and column 1 of H_{out}.
See slide 34
So, this is what we basically do at this third step. But then, let’s not forget what we said about the scalarnotation equation when we started with the first step of the chain rule.
See slide 35
Namely, what we do in this equation, is that we first multiply together all the “differentcolored” expressions. And then, once we have done that (which we now have after the third step), we sum up over all examples e. And then, we divide that sum by N.
So, after we have multiplied together the respective elements of column 1 of O_{delta} and column 1 of H_{out}, we want to add up those products. So, basically what we want to do, is to calculate the dot product of those two columns.
See slide 36
And the result of that dot product, we then divide by N. And this will give us the value with which we should update w_{211}.
See slide 37
And the same kind of logic applies to the other weights of W_{2}, respectively.
See slides 3842
So basically, what we want to do to determine W_{2update}, for each column in O_{delta} we want to calculate the dot product with each column in H_{out}. And this looks very similar to a matrix multiplication, right?
So, what can we do so that it actually becomes a true matrix multiplication (so that we then can take advantage of the speed of NumPy in our code)?
Well, we can simply take the transpose of O_{delta} (so that the rows become the columns).
See slide 43
And now, we can simply multiply O_{delta}^{T} with H_{out} to calculate all those dot products that we want.
See slide 44
And then, we divide each element of the resulting matrix by N to get W_{2update}.
See slide 45
So, this is how we can transfer the knowledge from the scalarnotation equation over to dealing with matrices again.
There is, however, a small problem with these calculations. Namely, the result of multiplying O_{delta}^{T} with H_{out} would be a 3x2 matrix.
See slide 46
And it would look like this:
See slide 47
So, we can’t simply subtract it from our W_{2} which is what we want to do to execute one gradient step.
See slide 48
To be able to do that we would have to again take the transpose of this W_{2update} matrix.
See slide 49
This matrix we can elementwise subtract from W_{2} and thereby update the respective weights. So, to calculate W_{2update}, we have to take the transpose of the multiplication of O_{delta}^{T} with H_{out}.
See slide 50
So, this is now finally how we can determine W_{2update}. But, as you can see there are two transposes which makes it kind of complicated. So, let’s simplify it somewhat.
Namely, instead of taking the transpose of O_{delta}, we can take the transpose of H_{out}.
See slide 51
And now, we multiply H_{out}^{T} with O_{delta} (instead of doing it the other way around).
See slide 52
This way, we are calculating the exact same dot products as before. But the resulting matrix is already in the correct shape. So, we only have to take one transpose which is computationally more efficient (and also a little less complicated to remember).
So, that’s now how we are going to determine W_{2update}.
See slide 53
So, let’s calculate it.
See slide 54
And what we then would have to do to execute one gradient step for W_{2}, is to multiply W_{2update} with our learning rate and then we would subtract it from our initial W_{2}.
See slide 55
And thereby we should be able to reduce the MSE somewhat.
But, obviously, for each gradient descent step we have to update both of our weight matrices simultaneously. So, let’s now have a look at how we determine the equation for the partial derivative of MSE w.r.t. W_{1}.
See slide 56
And this will be the topic of the next post.
1 Comment
Livio
2/23/2022 02:52:12 am
Surfing on backpropagation explaination i've never found something as clear as this post. Thank you very much for your work and for this perfect implementation ready to use! I finally understood this algorithm. Do you also have the same application but using Tanh activation function? i tried to derivate it to find how to update weight but it is quite difficult. can you help me?
Reply
Leave a Reply. 
AuthorJust someone trying to explain his understanding of data science concepts Archives
November 2020
Categories
