This post is part of a series:
In the previous post, we left off with the observation that we need a metric that precisely tells us what the best feature is to split the data on.
See slide 1
So, what could such a metric look like?
“Best feature to split the data?”
Well, if we look back at where we made our first split decision, we split the data at a petal width of 0.8cm.
See slide 2
Why did we do that? We did it because it resulted in a pure separation of the Iris-setosa flowers, meaning that to the left of the line, there are only green dots. Or in other words, the proportion or probability of the green dots in that area is 1.
This is what we want, because let’s say we pick a new flower and it falls somewhere to the left of the decision boundary.
See slide 3
In this case, we can predict with 100% certainty that the new, unknown flower is an Iris-setosa. This changes, however, if we move the decision boundary slightly more to the right.
See slide 4
In this case, the probability of the green dots isn’t 1 anymore. Instead, it decreased to let’s say 0.9 since there are now some additional orange dots in the left area. Here, we would still predict that the new flower is an Iris-setosa since the majority of the dots in this area are green. But our certainty of that prediction has decreased because there is a small chance that the flower might actually be an Iris-versicolor.
So clearly, the probability of a class in a split area plays an important role for making predictions. Namely: the higher the probability, the higher the certainty of our predictions about new, unknown flowers that fall into that area.
See slide 5
And since the ultimate goal of our decision tree algorithm is to make such predictions, then clearly the metric we use to automatically build that decision tree must make use of probabilities in some way.
And in fact, there are many different such metrics and they have all one thing in common. They all manipulate the probability in some way to come up with a number that tells us where to best split the data.
An often-used metric and the one we are going to look at is called “Shannon Entropy”.
See slide 6
To calculate it, we multiply the probability of a class with the logarithm base 2 of that probability. And we do that for all the classes that we have and then we sum up those values. And lastly, we take the negative of that sum.
And this might seem complicated and unintuitive at first glance but let’s have a look what the different parts do and how we could interpret them.
As expected, the metric makes use of the probability of the classes. It even includes it twice in the formula. But only the second one is used in the way we thought about probability when we talked about how different probabilities affect the certainty of our predictions. So, let’s look at that first.
The formula does something strange with it. It calculates the logarithm base 2 of that probability. Why does it do that?
Well, earlier we stated that: the higher the probability of a class, the higher the certainty of our prediction (called statement 1 from now on). This can also be stated somewhat differently, namely: The higher the probability, the lower the uncertainty of our prediction (called statement 2 from now on).
See slide 7
Both statements ultimately have the same meaning, they are only expressed in a different way. And taking the logarithm of the probability represents that second statement. To be more precise, taking the negative of the logarithm of the probability. To understand why, let’s look at a diagram of the logarithm base 2.
See slide 8
And since we are dealing with probabilities, which can take on only values between 0 and 1, we only need to look at the part of the graph where x is between 0 and 1.
See slide 9
And what we can see here is that the logarithm function returns large negative numbers for low probabilities and as the probability increases, the function increases as well. So, it seems to actually represent statement 1.
But there is a problem. As the probability reaches its highest possible value, namely 1, the function returns a zero. So, for the decision boundary where there were only green dots on the left side of the split (scatter plot at the top), the certainty of our prediction would be zero, meaning that we have no certainty. This, obviously, makes no sense because this is the best possible scenario for making predictions and we are absolutely certain that the new flower is an Iris-setosa.
So, the “logarithm base 2”-graph, after all, doesn’t really represent the first statement. But if we flip it, simply by adding a minus in front of the logarithm, then it perfectly represents the second statement.
See slide 10
For low probabilities it returns high values. So, for example, in the left side of the bottom scatter plot, the probability of the orange dots is very low. And if we would predict the new flower to be an Iris-versicolor, then the uncertainty of that prediction would be very high. And that’s simply because it is much more likely to be an Iris-setosa. So, this scenario is appropriately reflected by the “negative logarithm base 2”-graph.
And then, as the probability increases, the “negative logarithm base 2”-graph decreases until it ultimately reaches 0 when the probability is 1. Meaning this time, that we have absolutely no uncertainty about our prediction. And that is a more accurate description of the situation that we encountered on the left side of the upper scatter plot.
So, as you can see, the “negative logarithm base 2”-graph perfectly represents the second statement. There is, however, one edge case and it is when the probability is 0. And that’s because you can’t take the logarithm of zero. So, by convention it is defined as being zero.
See slide 11
And if you think about it, it makes perfect sense. For instance, in the left side of the upper scatter plot, there are no orange and blue dots. So, their probability is 0. If we would now predict the new flower to be either an Iris-versicolor or an Iris-virginica, then our uncertainty about that prediction wouldn’t be infinitely high like the “negative logarithm base 2”-graph might suggest. But it would be 0 because we know that there is absolutely no chance that this flower is an Iris-versicolor or Iris-virginica.
So now, since this “negative logarithm base 2”-graph reflects statement 2, we can interpret the y-axis of the graph as the “uncertainty value” that is associated with predicting a new flower to be a certain class that has probability p.
See slide 12
But as you have probably already noticed, in the entropy formula, the minus sign is actually in front of the sigma and not in front of the logarithm. But this is not a problem because you could also just write the formula in such a way that the minus sign is in front of the logarithm.
See slide 13
This means the same thing because it doesn’t matter if you add up a bunch of negative numbers and then take the negative of this sum. Or if you just add up those numbers as positive values. The result is the same.
But writing the formula in this new way, we can now indicate that the expression with the second “p” in the formula can be interpreted as the uncertainty value.
See slide 14
So now, we know what the second “p” of the entropy formula does. But what about the first one? To explain that, we need to think about what the best and worst cases actually are for making predictions. So, which probabilities result in a low uncertainty and which probabilities result in a high uncertainty.
Best- and worst-case scenarios for making predictions?
The best case scenario for making predictions, we have already covered when we looked at the “uncertainty value”-part of the formula.
See slide 15
And it was either a probability of 0 or 1. And that’s because the resulting uncertainty would be zero. So, if we would just take that part of the formula as our final formula for the entropy, then the best case scenarios would be represented appropriately. But what about the worst case scenario?
Well, according to the “uncertainty value”-graph the worst case (where the uncertainty is the highest) would be a very low probability. This, however, doesn’t really make sense because we have to keep in mind that the probabilities of the different classes in a split area add up to one.
So, if there are for example just two classes (left side of the bottom scatter plot) and one of those classes has a low probability (orange dots), then the other inherently must have a high probability (green dots).
And obviously, we would then make our predictions based on the class with the high probability. And therefore, the uncertainty value of our prediction would be relatively low. So, after all, a low probability can’t be the worst-case scenario for making predictions.
But this begs the question: What is the worst-case scenario?
Well, if you think about it, the worst case really is when the probabilities of the different classes are equal. So, let’s say for example that there is an equal number of orange and blue dots on the right side of the upper scatter plot. If a new flower now falls into this area, then we are completely uncertain about our prediction because both types, Iris-versicolor and Iris-virginica, are equally likely.
And the situation gets even worse, the more classes there are. So, for example, if there would be an additional equal number of let’s say yellow dots in this area, then we would be even more uncertain about our prediction of what type this unknown flower is.
So, those are the two characteristics of the worst-case scenario for making predictions: the probabilities of the different classes being equal and, generally speaking, the more classes there are, the worse the situation. And accordingly, our metric should return higher values if those conditions are met.
And that’s what the first part of the entropy formula is for. It’s basically just a weighted sum.
See slide 16
And what it does is, it sums up the different uncertainty values associated with predicting all the different classes. But those values only contribute to the final sum proportionally to how prevalent the respective class is in a split area. So, if a class doesn’t appear that often, it shouldn’t have a large impact on the overall sum.
And this has an interesting effect because we know that if the probability of a class is low (orange dots on the left side of the bottom scatter plot), then the uncertainty of predicting that class is high. But we also know that if the probability of one class is low and if there are just two classes, then the probability of the other class must be high (green dots) and its uncertainty value correspondingly is low.
So, one class has a low weight but a high uncertainty value. And the other class has a high weight but a low uncertainty value. So, in a sense, they cancel each other out. And this results in our desired behavior that the entropy is maximized for a given number of different classes, when the probabilities of these classes are equal.
And since the formula is a sum over all the classes, it generally is larger, the more classes there are because you simply have to add up more elements. So, both conditions of the worst-case scenario for making predictions are being met by this formula.
And to show that this is really true, let’s look at the graph of the entropy for the case that there are just 2 classes.
See slide 17
At the bottom x-axis, the probability of the first class is depicted and at the top the probability of the second class is depicted. And as already said before they both have to add up to one. So, when the bottom probability is for example 0.1, the top has to be 0.9 and vice versa.
At the y-axis, the entropy is depicted which was simply calculated using the formula with the respective probabilities. And as you can see, it reaches its peak, which represents the worst case for making predictions, when both probabilities are equal. And as soon as one probability deviates from this point, the entropy starts to decrease until it reaches zero when either of the probabilities is 1 or 0. So, the graph also represents the best-case scenario for making predictions.
So, let’s now have a look at how the “max entropy”-value behaves as the number of classes increase.
See slide 18
As we have seen in the entropy-graph, “max entropy” means that the probabilities of the classes are equal. So, in case there are 3 classes, each class has a probability of 1/3. In case there are 4 classes, each has a probability of 1/4 and so on. And as you can see, the entropy values increase as the number of classes increases.
So, to conclude, the formula for the entropy appropriately reflects the best- and worst-case scenarios for making predictions. And with it we can, so to say, calculate an average uncertainty that is associated with making predictions in a split area.
See slide 19
So, for the left split area of the upper scatter plot, the average uncertainty is 0 because there is just one class. So, the probability is 1 and accordingly the entropy is 0. But there are obviously two such areas created by our decision boundary, so we have to calculate another entropy.
And therefore, we obviously need to know exactly how many orange and blue dots there are. And as you can see next to the legend, there is an equal number of orange and blue dots. Hence, we are exactly at the peak of the entropy diagram and thus, the entropy of that split area is 1.
And now, let’s also determine the entropy values for the decision boundary in the bottom scatter plot. Here, the probabilities aren’t so convenient that we can simply look up the entropy in the graph. So, we have to actually calculate them. Let’s start with the left side.
See slide 20
To calculate the entropy, you have to know that there are actually 6 orange dots and not just 5. And that’s because two have the exact same values and therefore they overlap. So, in total, there are 52 dots in the left area and accordingly the probability of the green dots is 46/52. And the probability of the orange dots is 6/52. And this equates to an entropy of 0.516.
In the same manner we can calculate the entropy of the right side.
See slide 21
So now, let’s compare the two decision boundaries with each other.
See slide 22
As we can see, the entropy on the left side of the split is lower in the upper scatter plot. Whereas, the entropy on the right side is lower in the bottom scatter plot. So, as we move the split more to the right (from a petal width of 0.8cm to 1.05cm), the situation for making predictions gets worse on the left side, but for the right side the situation gets better.
This is a problem. Namely, how do we now decide which of these splits is better?
And I know in this case it’s probably easy to say because the situation on the left side gets a lot worse, whereas the situation on the right side only improves slightly. But it might not always be this obvious, so we need a metric that exactly tells us which of these splits is better. And therefore, we calculate, so to say, the overall entropy for the entire graph that is the result of making a certain split.
See slide 23
And we compute it, similar as before, with a weighted sum. But this time it’s a sum over the two areas. So, the weights are the number of dots in the respective area divided by the number of dots in the whole plot. So, let’s calculate the overall entropy values for the decision boundary in the upper scatter plot.
See slide 24
And then, let’s do the same for the bottom scatter plot.
See slide 25
So, as we can see, the split in the upper scatter plot at a petal width of 0.8cm results in a lower overall entropy and therefore it is the better option out of those two splits.
Side note: I want to quickly mention that conventionally, this step of ultimately choosing between splits is done using a metric called “information gain”. However, calculating this overall entropy is essentially the same, but I think it has the advantage of being a little bit more intuitive than the information gain.
So now, we finally have a metric to decide which of these two splits is better. And this leaves only one final question to answer, namely: What are the splits that we should we compare with each other in the first place?
In our original analysis we decided to make the split at a petal width of 0.8. But we could have also chosen a petal width of 0.79 or 0.799 or 0.7999 and so on. So, there are infinitely many that we could potentially choose from which begs the question: How do we know which ones we should compare with each other?
Well, for a split to make sense, it must be between two dots. And ideally, it is exactly in the middle, so that it is as far away from both points as possible. So, a simple approach for finding potential splits is to consider all the values that lie exactly in the middle of two points as potential splits.
See slide 26
So, those are all the potential splits that you need to consider with respect to the petal width and length. And, obviously, you also need to do the same thing for the sepal length and width. And then, you simply calculate the overall entropy values for all those splits and choose the one that results in the lowest overall entropy.
“Best feature to split the data?” reviewed
So now, we can finally answer the question what it actually means to find the best feature to split the data.
See slide 27
Namely, first we need to determine all potential splits and then, we choose the one that results in the lowest overall entropy. And based on that, we split the data into two parts.
See slide 28
And this is now a decision tree algorithm that can be automated using code. And how to do that, you can check out in this series of posts.