This post is part of a series:
In this series of posts, we are going to cover how the decision tree algorithm works.
And since it is much easier to understand abstract concepts with an example, I am going to explain how the decision tree algorithm works by falling back on the often-used and well-known Iris flower data set. In fact, it is so well-known that it even has its own Wikipedia article.
But instead of just relying on the data set alone to explain the algorithm, I am going to give it some context by introducing a made-up scenario.
See slide 1
That way, I hope, it is easier to get an intuition about how the decision tree algorithm generally works.
So, imagine you are a flower grower and you are specialized in Iris flowers. You grow three different types of flowers. And even though they are different types, they look extremely similar to each other which means that you have to be a real expert to be able to distinguish them.
Let’s say you have a large field with thousands of these flowers and let’s also say that you can’t have separate fields for each type because they don’t grow well in monocultures. So, they have to be mixed-up with each other.
And then, when it’s time for the harvest you bring in some low-level workers and you want them to pick the flowers and you also want them to sort the flowers. But they are obviously no Iris flower experts which means that you have to train them on how to distinguish the different types of flowers before they can start to work.
The problem with this approach, however, is that every season there are new workers, so you have to train them anew every single time. And that is just a very repetitive and inefficient way of doing things. So, the question is: How can you get the workers to sort the flowers without having to train them?
Train-Test-Split of the Data
To solve that problem, you want to analyze the flowers and see if you can find some patterns that might help the workers to distinguish them. So, you randomly pick 50 flowers of each species and gather some data on them.
See slide 2
And now, before you start analyzing these flowers, you randomly pick 20 flowers out of the 150 that you collected.
See slide 3
And these will serve as your first test case. With them you can check if the patterns that you detected in your analysis of the remaining 130 flowers are actually valid and really help to distinguish the different types of Iris-flowers.
So, to start the analysis, one thing you could do is to calculate some statistics, for example the mean of the several features with regards to the different types of Iris-flowers.
See slide 4
And here you could see that on average the petal width of Iris-setosa flowers are much smaller than those of the other two types. The same holds true for the petal length.
But just looking at such raw numbers, it is generally relatively difficult to detect patterns. A better way for doing that is to visualize the data.
See slide 5
So here, you created a scatter plot that depicts the features of the sepal leaves, so the sepal width and length. And as you can see, the Iris-setosa flowers are clearly set apart from the other two types. But the versicolor and virginica flowers are quite interspersed. So, this graph is probably not very useful for deriving rules to distinguish the different types of Iris-flowers.
And that’s because you can’t easily separate out areas where there is just one type of flower. And what I mean by this can be for example seen at a sepal length of 7cm. Above that, there are only blue dots. So, what you could say to the worker is: “Check the sepal length of the flower and if it is larger than 7cm, then it’s an Iris-virginica.” But that’s about it what you can do in this graph.
So, let’s have a look at the petal leaves.
See slide 6
And here the graph looks much better because the Iris-setosa flowers are clearly distinguished from the other two as well. But additionally, Iris-versicolor and Iris-virginica are also more clearly separated from each other. So, let’s stick with this plot for deriving some rules.
Creating Decision Boundaries
Looking at the graph, one possible first rule for the flower pickers might be: “Okay, check the petal width of the flower and if it is smaller than let’s say 0.8 centimeters, then the flower is clearly an Iris-setosa.” So, let’s add that decision boundary to the plot.
See slide 7
And since all the dots left of this line are green, let’s shade this whole area with a green color.
See slide 8
And now, if we go back to the field and pick a new flower and it falls somewhere into this green area, so if it has a petal width smaller than 0.8cm, then we will classify it as an Iris-setosa. And this means that we only have to consider the remaining data points and see if we can find some other good decision boundaries.
So, let’s start with the petal width again. Here, looking at the scatter plot, the boundary clearly has to be between a petal width of 1.6cm and 1.7cm. Let’s add that.
See slide 9
And similar as before, all the dots to the right of this line are blue. So, let’s shade that area accordingly.
See slide 10
And now, if we want to separate out the orange dots from the remaining 3 blue dots, we need to create a decision boundary along the petal length. And it should be at a petal length of 4.9cm.
See slide 11
This way there are only orange dots in bottom area and therefore we can shade the area with an orange color.
See slide 12
And finally, we are only left with a small area which hasn’t been classified yet. Here, we could add one or maybe even two additional boundaries to isolate the orange dot from the blue dots. But we are not going to do that because the goal of our analysis is to somehow capture general characteristics of the different types of Iris-flowers.
For example, a general characteristic might be that, with regards to their petal leaves, the Iris-setosa flowers are the smallest flowers, Iris-versicolor are kind of the middle flowers and Iris-virginica are the biggest flowers. But since there are not many flowers in this area, it is probably not very representative for Iris flowers in general.
So, they are kind of outliers. The blue dots, for example, are virginica flowers that have a slightly smaller petal width than what is normal for Iris-virginica flowers. So, by making really intricate decision boundaries in this area we might include some information of those extreme examples into our decision making process. And this is not desirable if we want to capture the general traits of Iris flowers.
But, nonetheless, we have to classify this area in the case a new flower falls into it. And since there are 3 blue dots and only 1 orange dot, it is more likely that the flower would be an Iris-virginica. And for that reason, we are going to shade this area with a blue color.
See slide 13
And now, we are done with our analysis. We can classify any new flower that falls onto this diagram.
Testing the Decision Boundaries
So, let’s see how good the decision boundaries classify our 20 test flowers that we extracted from our original data set at the beginning of this post.
See slide 14
So, these are our 20 test flowers. Let’s add our decision boundaries to see how they would be classified.
See slide 15
And now let’s have a look what type these flowers actually are.
See slide 16
As we can see, all flowers that were predicted to be Iris-setosa, actually are Iris-setosa. The same is true for Iris-versicolors. There is only one misclassification in the Iris-virginica area. So, let me remove the shading so that it is easier to see.
See slide 17
There is one orange dot in this area. So, it actually is an Iris-versicolor but it was misclassified as an Iris-virginica. That means: 19 out of 20 flowers were classified correctly which results in an accuracy of 95%.
And you as the flower grower decide that this accuracy is sufficient and therefore you are going to implement this approach instead of training the flower pickers anew every season.
See slide 18
So, you simply give the flower pickers a ruler and a sheet of paper with this diagram on it and then they can use it, just like we did, to determine what type a specific flower is and then sort them accordingly. And with that, we have solved our initial problem.
And as it turns out, what we have done so far in this post to create those decision boundaries, is exactly how the decision tree algorithm works. In fact, we can also illustrate our diagram as a decision tree.
See slide 19
So, as you remember, the first boundary was the one at a petal width of 0.8cm. If the petal width of the respective flower is smaller or equal to that, then the flower is an Iris-setosa. Otherwise, we have to check the other boundaries.
The second boundary was the one at a petal width of 1.65cm. If the petal width of the respective flower is bigger than that, then the flower is an Iris-virginica. If it is smaller or equal to that, then we have to consider the last boundary.
The third boundary was at a petal length of 4.9cm. And if the petal length of the respective flower is smaller or equal to that, then the flower is an Iris-versicolor. Otherwise, it is an Iris-virginica.
And now we can use this decision tree just like the diagram to classify the flowers. So, let’s say for example that we pick a flower with the following values.
See slide 20
First, we have to check if the petal width is smaller or equal to 0.8cm. That’s not the case because it’s 1.1cm. So, we have to move to the right. Then, we check if the petal width is smaller or equal to 1.65cm. Yes it is. So, we have to move to the left. And lastly, we have to check if the petal length is smaller or equal to 4.9cm. It’s just 3cm. So, we have to move to the left and the flower would be classified as an Iris-versicolor.
And now, let’s also see where it would fall on the diagram with the decision boundaries.
See slide 21
And, as expected, it falls into the Iris-versicolor area. So, let’s finally check what type the flower actually is.
See slide 22
And it is indeed an Iris-versicolor.
Decision Tree Algorithm
So, as you can see, our decision boundaries can also be depicted as a decision tree. And the process (or algorithm) we went through to create it, looks something like this.
See slide 23
First, we need to check if the data is pure (meaning that there is just one class). If it is, then we classify the data and stop. If it is not, we need to find the best feature to split the data and then split it into two parts.
One part contains the data points with values that are smaller or equal to that feature. And the other part contains the data points with values that are bigger than that feature. And then, we repeat this process for both parts. And we keep doing that until all parts are eventually pure and reach the stop field.
So, to see this process in action, let’s quickly go through the creation of the decision boundaries again.
See slide 24
First, we need to check if the data is pure. This is not the case because there are 3 different classes. So, we have to the right. Then we need to find the best feature to split the data and we said it is a petal width of 0.8cm. So, we split the data into two parts.
See slide 25
Now, for the data points that have a petal width that is smaller or equal to 0.8cm, so for the dots on the left side of the decision boundary, we check again if the data is pure. And this time it is. So, we classify this area and stop which means we don’t have to consider this side anymore.
See slide 26
For the other part, the data is not pure. So, we need to move to the right and split it again into two parts.
See slide 27
This time the dots on the right side of the decision boundary are pure and we can classify them.
See slide 28
The other side is not pure, so we need to split the data again.
See slide 29
Here, the data points below the boundary are pure and we can classify them.
See slide 30
And for the other part, we get to a point where we, in our original analysis, deviated from this algorithm. Actually, we would need to split the data again because it is not pure. But we didn’t do it for the aforementioned reasons.
We could, however, easily implement that approach from the analysis. Simply by adding another process right before we check if the data is pure. What we could do for example is the to ask whether there are more than 5 data points.
See slide 31
If there are, then we follow the same algorithm as before. If there aren’t, then we classify the area based on the class which appears more often or if the different classes appear equally often, then simply pick one. So, since there are only 4 dots in this area, we have to go to the left and classify it.
See slide 32
And now we reached the stop field for all the splits of the data and the whole algorithm stops.
So, this is the specific algorithm we used to create those decision boundaries. But for the upcoming post, we are not going to consider the part where we check if there are more than 5 data points. We are going to stick with just the more basic version of the algorithm.
See slide 33
And the important question we need to answer now is: What does it mean “to find the best feature to split the data”?
See slide 34
We did that kind of intuitively but if we want to automate this process, i.e. create an algorithm, then we need a metric that precisely tells us what the best feature is to split the data. And what such a metric could look like will be the topic of the next post.