This post is part of a series:
In the previous posts, we’ve answered the first question of how the deep learning algorithm makes a decision.
See slide 1
Namely, we put our input data into the neural net and then we feed it forward through all the layers until we reach the last layer which then produces the output of the neural net. And this process was accordingly called feedforward.
Cost Function – Sum of squared Errors
So now, we can start to tackle the second question which was: How do you determine the right parameters for the algorithm?
And for the deep learning algorithm, the parameters are the weights (represented by the lines in the diagram of the neural net). So, determining the right parameters means that we want to find such weights that the neural net is able to make correct predictions/decisions.
And therefore, obviously, it would be helpful to have a metric that tells us in a concrete and precise way how good the neural net actually is at making predictions. And in the context of deep learning, such a measure is called a cost function or a loss function.
See slide 2
So, what we are going to do now is, we are going to create such a cost function.
And one simple approach for doing that is to calculate the error. So, we simply subtract the label from our output.
See slide 3
Let’s say for the depicted neural net, we have determined the outputs of 4 examples. So, we have a 4x1 matrix.
See slide 4
From that, we simply element-wise subtract the label matrix (which, accordingly, is also a 4x1 matrix). And the result of that is the error matrix.
See slide 5
So, if our prediction is correct, meaning that the output and the respective label are the same, then the error is zero. And if our prediction is not correct, then the error is either 1 or -1.
And since we want a measure that tells us overall, how good the neural net is at making predictions, we simply sum up those errors. And then we have the “sum of errors” as our cost function.
See slide 6
But, as you can see in the example calculation, there is a problem with this metric. Namely, even though we predicted two of the four examples incorrectly, the error, nonetheless, is still zero. And that’s because the “1” and “-1” of the incorrect predictions cancel each other out. So, based on our current metric, the neural net seems to be perfect at making predictions. Whereas, in reality, it made two wrong predictions.
So, to get around this problem, we have to make sure that the errors of incorrect predictions are always positive. So, that they can’t cancel each other out. And one way to do that is to square the “output – label” calculation.
See slide 7
So, we simply calculate the “squared error”. And by doing that, our metric now actually indicates that there is some error associated with the predictions of the neural net. So, this “sum of squared errors” (SSE), is a metric that we could use as a cost function.
Cost Function – Mean Squared Error
But there is also a problem with this metric. Actually, it’s not really a problem but more of a slight inconsistency. Namely, the more examples we are going to have, the bigger the error is potentially going to be. And this might not be really what we want.
See slide 8
For example, let’s say there is a second scenario where our neural net makes predictions (scenario B). Here, there are 10 examples in total and 3 of those were predicted incorrectly. So, accordingly the SSE would be 3. So, it is higher than in scenario A. And this means that the neural net in scenario B seems to be worse at making predictions than in scenario A.
But, if you look at the percentages, then you can see that the neural net in scenario B predicted 30% of the examples incorrectly. Whereas, in scenario A it predicted 50% of the examples incorrectly. So, the neural net in scenario B is actually better at making predictions. And accordingly, this should be reflected in our cost function. So, in scenario B it should be lower than in scenario A.
And to make sure that that’s really the case, what we are going to do is, we are going to calculate the mean of the squared errors (MSE) and not just the sum.
See slide 9
So, in scenario A we divide the SSE by 4 (since there are 4 elements in the error matrix) and thereby we get a MSE of 0.5. And in scenario B we divide the SSE by 10 (since there are 10 elements in the error matrix) and thereby we get a MSE of 0.3. So now, our cost function actually indicates that the neural net in scenario B is better at making predictions than the neural net in scenario A.
And now, let’s think about how we calculate the MSE for the case when our neural net has several output nodes and not just one. After all, the neural net that we want to apply to the Iris flower data set has three output nodes.
See slide 10
So, in this case our output matrix doesn’t just have one column but three.
See slide 11
This, however, doesn’t actually change anything about the first part of the calculation where we determine the “squared error”.
See slide 12
We, again, simply element-wise subtract the label matrix from the output matrix. And then, we square the result of that to get the error matrix.
But now, when we sum up those squared errors over all the examples, then we still have a vector and not just a number like we did before where the neural net had just one output node.
See slide 13
To get just a number (or in linear algebra terms: a sclaor), we have to calculate another sum. And this time it’s the sum across all nodes.
See slide 14
So, in this case, we then get a 6. And this sum, we then divide by the number of elements of the error matrix to calculate the MSE.
See slide 15
And, as you can see, we get a MSE of 0.5.
And here, I want to point out that the “0.5” doesn’t mean that this neural net is equally good at making predictions than the neural net from scenario A.
See slide 16
Namely, if you look at the percentage of incorrect predictions, then you can see that in scenario C, 75% of the examples were predicted incorrectly whereas in scenario A, only 50% of the examples were predicted incorrectly. And this means that the neural net in scenario C is actually worse at making predictions than the neural net in scenario A.
So, to emphasize that, the MSE doesn’t represent the proportion of correct predictions, which is the impression you might have gotten when we just looked at scenarios A and B (slide 9). It is, just like the name says, the “mean” of the “squared errors”.
And in scenario C, for every error that the neural net makes, we get two “1” and not just one. And because of that, the neural net in scenario C just happens to have the same MSE as the neural net in scenario A.
So, one thing to keep in mind when dealing with the MSE is that you can only directly compare 2 neural nets with each other if they have the same number of output nodes.
Okay, so having said that, let’s now write down the calculations, that we did with the matrices, in a formal way.
So first, we subtracted the label matrix from the output matrix.
See slide 17
Then, we squared that.
See slide 18
Then, we took the sum over all the examples e.
See slide 19
And after that, we took the sum over all the nodes n.
See slide 20
And here, the order of the summations doesn’t actually matter. So, we could also switch them up and first take the sum over all the nodes n and then take the sum over all the examples e.
See slide 21
The result is the same.
And finally, we divided the sum by the number of elements in the error matrix. So, we divide by the number of examples times the number of nodes.
See slide 22
And to make the formula a little bit more concise, we are going to denote the number of elements of the error matrix with a capital “N” (which simply represents “e * n”).
See slide 23
One composite function
So, this is now our final metric that we are going to use as our cost function.
See slide 24
And remember, our initial goal was to find such weights that the neural net is able to make correct decisions. So, what this now means is that we want to find such weights that the MSE is minimized. And the question now is: How do we do that?
Well, if you look at the neural net, you could get the impression that it consists of successive steps where we calculate the inputs of the hidden layer (Hin) by multiplying the matrix that contains the data set (X) with the matrix that contains the weights going from the input layer to the hidden layer (W1).
See slide 25
And those hidden layer inputs, we then put into the step function to get the hidden layer outputs (Hout).
See slide 26
And those, in turn, we multiply with W2 to get the output layer inputs.
See slide 27
And finally, we put Oin into the step function to get the output layer outputs (Oout) which are then the decision of our neural net.
See slide 28
And those output layer outputs, we then put into the function of the MSE (together with the labels of the data set) to see how good the neural net actually is at making predictions.
See slide 29
So, it seems like there are these clearly distinguished, successive steps. But actually, because you are always putting the result of one step into the next step, all those individual steps can be expressed using just one big function.
See slide 30
So, first we multiply X with W1.
See slide 31
This, we put into the step function.
See slide 32
The result of that, we multiply with W2.
See slide 33
And this whole expression, we again put into the step function.
See slide 34
And then finally, this whole expression, we put into the MSE formula.
See slide 35
So, this now is the singular function that represents all the individual, successive steps.
And again, our goal was to make sure that the neural net is able to make correct predictions. So, what this now means is that we want to find the minimum of this function. Or in other words, we want to find the specific values for our variables, i.e. the weights, so that the function is minimized.
And I don’t know about you, but maybe that reminds you of something that you learned in school. Namely, one of the few things that I vaguely remember from math in school is something called derivatives and that you could use them to find the minima or maxima of a function.
So, let’s look at a simpler function than our formula for the MSE to refresh our memory of derivatives.
See slide 36
So, in this example, y is a function of w. And in the plot, you can see what it looks like.
And now, to denote the derivative of this function, we can either use Leibniz’ notation (“dy over dw”) or Lagrange’s notation (“f prime of w”).
See slide 37
In the context of deep learning and backpropagation, I personally prefer to use Leibniz’s notation. So, that’s what we are going to do moving forward.
Okay so now, let’s determine the derivative of the first expression of the function (w²). Therefor, we simply use the power rule to bring down the exponent in front of the variable. And then, we decrease the exponent by 1.
See slide 38
For the next expression (2w), we again use the power rule.
See slide 39
And since w raised to the power of zero is simply one, we are actually just left with the 2.
See slide 40
And then, the last expression of the function is the “3”. This is a constant and the derivative of a constant is simply zero. And this means that we don’t need to add anything to the function of the derivative.
So, that’s now what the function of the derivative of y looks like. And what this function tells us is the slope of y for any given point.
So, for example, let’s say we want to find y and dy/dw when w is equal to 1.
See slide 41
Then, if we plug the 1 into the equation for y, then we see that y is equal to 6.
See slide 42
So, in the graph, we are at this particular point:
See slide 43
And if we now plug in the 1 into the equation for dy/dw, then we see that the derivative of y with respect to (w.r.t.) w is 4.
See slide 44
So, the slope of the graph at the point is 4. A way to visualize that, is to plot the tangent line for the graph at this point.
See slide 45
This straight line also has a slope of 4.
And what the slope tells us, is how our y changes if we change the w by a certain amount. So, for example, for the tangent line, if we increase our w by 1, then the y increases by 4. And we move from 6 to 10. But if we increase our w only by 0.5, then the y increases by 4 times that 0.5, which is 2. So, we move from 6 to 8.
For our curve, however, the slope is not constant as is the case with the straight line. But instead, it is always changing. Near a w of -1, the graph is almost flat. And, as you move away from the -1 (in either direction), the graph gets steeper and steeper.
So, what the derivative of y really tells us, is the instantaneous rate of change at any given point. So, it tells us how our y changes if we would increase our w by an extremely tiny amount, almost zero.
So, for example, at the current point, if we increase our w by a tiny amount, then the y will increase by 4 times that tiny amount.
But let’s look at another point:
See slide 46
Here, when w is equal to 0, the derivative is only 2. So, if we increase w by a tiny amount, y will only increase by 2 times that amount.
Let’s see what happens when w is equal to 3:
See slide 47
Here, the derivative is 8. So, if we increase w by a tiny amount, y will increase by 8 times that amount.
And for the last example point, let’s look to the left side of the curve:
See slide 48
Here, the slope is negative. So, accordingly, if we increase w by a tiny amount, y will decrease by 4 times that amount.
So, to summarize, what the derivative of a function tells us is how our function changes if we slightly increase our variable. And hereby, it provides us two pieces of information. First, the sign of the derivative tells us the direction of the change.
See slide 49
So, in this case, it is negative which means that y would decrease if we slightly increase our w. If it is positive, then y would increase if we slightly increase our w.
See slide 50
And the second piece of information that the derivative of a function provides is the actual value of the derivative. This tells us something about the amount of the change.
See slide 51
So, it tells us if y changes a lot when we slightly increase the w.
See slide 52
Or if y changes not so much when we slightly increase the w.
See slide 53
And, in my opinion, this is probably the single most important concept for understanding how backpropagation works and what it does.
Analytical Approach for finding Extrema
Okay, so now that we know what derivatives are and how they work, let’s see how we can use them to find the minima or maxima of a function, which is why we introduced them in the first place.
And maybe you also remember from school how we can do that. Namely, the idea simply is that if you look at the graph, you can see that it reaches its minimum when w is equal to -1.
See slide 54
And the observation that is important here, is that the tangent line is flat. Thus, the slope of the graph at this point is 0.
So, if we set the equation of the derivative equal to 0, then we can solve it for w.
See slide 55
And that way, we can find the specific value for w where y is minimized or maximized (because at a maximum, the slope is also zero). And in this case, it is pretty straight forward.
See slide 56
So, as we have already seen in the graph, when w is -1, then the derivative or slope must be zero. And this means that y reaches a minimum or maximum when w is -1. And, as we can see in the graph, it is actually a minimum.
So, this is one way of how you can find the minimum of a function. And this way of doing it, is an analytical approach. And that’s because we are setting the derivative equal to zero and then we solve the resulting equation by algebraically manipulating it in such a way that the w is isolated on one site.
But there are also cases where you can’t find the minimum with such an analytical approach. For example, let’s look at this function.
See slide 57
The function for the derivative looks like this:
See slide 58
So, let’s set that to 0 and try to solve for w:
See slide 59
Here, we could maybe first bring the 1 to the other side. And then, maybe factor out the w on the left side. But then, you are kind of stuck and you can’t isolate the w on just one side like we did before. So, you can’t determine the minimum of this function analytically.
And this now begs the question: How do we find the minimum of this function (since we can see in the graph that there actually is a minimum)? And the answer to this question, we are going to cover in the next post.
PS: As it turned out after I have already created this tutorial, there is actually an analytical solution for the “x5 + x - 1 = 0”-equation.
What happened was, is that, originally, I used the “x5 - x + 1 = 0”-equation from this Wikipedia article. And for that, there is actually no analytical solution, but only an approximation. However, I then switched up the plus and minus sign in the equation, simply because the resulting graph suited my purpose of explaining gradient descent in the next post better.
So, I just assumed that there would also be no analytical solution for the “x5 + x - 1 = 0”-equation which turned out to not be the case. This fact, however, doesn’t change anything about the explanation of the gradient descent algorithm itself. It just means that you could have also solved the equation analytically.