This post is part of a series:
In the previous post I introduced to you the Iris flower data set and a made-up scenario where you should imagine to be a flower grower.
See slide 1
And the problem you were trying to solve was to devise a method which should enable the flower pickers to sort the very similar looking flowers so that you wouldn’t have to train them anymore.
With the help of this scenario I showed how machine learning works at a high level. And the final result of the post was this overview diagram where I depicted the different types of machine learning and the different elements needed to make a machine learn.
See slide 2
In this part of the series I am going to talk about just one element of this diagram, namely the algorithm. In the last post, I simply said that you show the algorithm a bunch of examples and then it will learn to predict the label of a particular example solely by looking at its features. In this post, and the next one, we are going to cover what that learning process actually looks like.
And the specific algorithm we are going to do that with is the decision tree algorithm. And therefore, to first get an intuition about how decision trees generally work, I want you to imagine again that you are the flower grower and that you have to solve the same problem as in the previous post. But this time, you are going to solve it without any machine learning at all.
Train-Test-Split of the Data
So, just like in the last post, you randomly pick 50 flowers of each type and you measure their leaves to gather some data on the Iris flowers.
See slide 3
And now, instead of using machine learning, you analyze this data yourself to see if you can find some patterns which might help the workers to distinguish the different types of flowers. But before you start, you randomly pick out 20 flowers out of the 150 that you collected.
See slide 4
And these will serve as your first test case. With them you can check if the patterns that you detected in your analysis of the remaining 130 flowers are actually valid and really help to distinguish the different types of Iris-flowers.
So, to start, one thing you could do is to calculate some statistics, for example the mean of the several features with regards to the different types of Iris-flowers.
See slide 5
And here you could see that on average the petal width of Iris-setosa flowers are much smaller than those of the other two types. The same holds true for the petal length.
But just looking at such raw numbers, it is generally relatively difficult to detect patterns. A better way for doing that is to visualize the data.
See slide 6
So here, you created a scatter plot that depicts the features of the sepal leaves, so the sepal width and length. And as you can see, the Iris-setosa flowers are clearly set apart from the other two types. But the versicolor and virginica flowers are quite interspersed. So, this graph is probably not very useful for deriving rules to distinguish the different types of Iris-flowers.
And that’s because you can’t easily separate out areas where there is just one type of flower. And what I mean by this can be for example seen at a sepal length of 7cm. Above that, there are only blue dots. So, what you could say to the worker is: “Check the sepal length of the flower and if it is larger than 7cm, then it’s an Iris-virginica.” But that’s about it what you can do in this graph.
So, let’s have a look at the petal leaves.
See slide 7
And here the graph looks much better because the Iris-setosa flowers are clearly distinguished from the other two as well. But additionally, Iris-versicolor and Iris-virginica are also more clearly separated from each other. So, let’s stick with this plot for deriving some rules.
Creating Decision Boundaries
Looking at the graph, one possible first rule for the flower pickers might be: “Okay, check the petal width of the flower and if it is smaller than let’s say 0.8 centimeters, then the flower is clearly an Iris-setosa.” So, let’s add that decision boundary to the plot.
See slide 8
And since all the dots left of this line are green, let’s shade this whole area with a green color.
See slide 9
And now, if we go back to the field and pick a new flower and it falls somewhere into this green area, so if it has a petal width smaller than 0.8cm, then we will classify it as an Iris-setosa. And this means that we only have to consider the remaining data points and see if we can find some other good decision boundaries.
So, let’s start with the petal width again. Here, looking at the scatter plot, the boundary clearly has to be between a petal width of 1.6cm and 1.7cm. Let’s add that.
See slide 10
And similar as before, all the dots to the right of this line are blue. So, let’s shade that area accordingly.
See slide 11
And now, if we want to separate out the orange dots from the remaining 3 blue dots, we need to create a decision boundary along the petal length. And it should be at a petal length of 4.9cm.
See slide 12
This way there are only orange dots in bottom area and therefore we can shade the area with an orange color.
See slide 13
And finally, we are only left with a small area which hasn’t been classified yet. Here, we could add one or maybe even two additional boundaries to isolate the orange dot from the blue dots. But we are not going to do that because the goal of our analysis is to somehow capture general characteristics of the different types of Iris-flowers.
For example, a general characteristic might be that, with regards to their petal leaves, the Iris-setosa flowers are the smallest flowers, Iris-versicolor are kind of the middle flowers and Iris-virginica are the biggest flowers. But since there are not many flowers in this area, it is probably not very representative for Iris flowers in general.
So, they are kind of outliers. The blue dots, for example, are virginica flowers that have a slightly smaller petal width than what is normal for Iris-virginica flowers. So, by making really intricate decision boundaries in this area we might include some information of those extreme examples into our decision making process. And this is not desirable if we want to capture the general traits of Iris flowers.
But, nonetheless, we have to classify this area in the case a new flower falls into it. And since there are 3 blue dots and only 1 orange dot, it is more likely that the flower would be an Iris-virginica. And for that reason, we are going to shade this area with a blue color.
See slide 14
And now, we are done with our analysis. We can classify any new flower that falls onto this diagram.
Testing the Decision Boundaries
So, let’s see how good the decision boundaries classify our 20 test flowers that we extracted from our original data set at the beginning of this post.
See slide 15
So, these are our 20 test flowers. Let’s add our decision boundaries to see how they would be classified.
See slide 16
And now let’s have a look what type these flowers actually are.
See slide 17
As we can see, all flowers that were predicted to be Iris-setosa, actually are Iris-setosa. The same is true for Iris-versicolors. There is only one misclassification in the Iris-virginica area. So, let me remove the shading so that it is easier to see.
See slide 18
There is one orange dot in this area. So, it actually is an Iris-versicolor but it was misclassified as an Iris-virginica. That means: 19 out of 20 flowers were classified correctly which results in an accuracy of 95%.
And you as the flower grower decide that this accuracy is sufficient and therefore you are going to implement this approach instead of training the flower pickers anew every season.
See slide 19
So, you simply give the flower pickers a sheet of paper with this diagram on it and then they can use it, just like we did, to determine what type a specific flower is and then sort them accordingly.
So, to sum up, we solved the same problem as in the last post. But this time we did it without using any machine learning at all.
But as it turns out, what we have done so far in this post to create those decision boundaries, is exactly how the decision tree algorithm works. In fact, we can also illustrate our diagram as a decision tree.
See slide 20
So, as you remember, the first boundary was the one at a petal width of 0.8cm. If the petal width of the respective flower is smaller or equal to that, then the flower is an Iris-setosa. Otherwise, we have to check the other boundaries.
The second boundary was the one at a petal width of 1.65cm. If the petal width of the respective flower is bigger than that, then the flower is an Iris-virginica. If it is smaller or equal to that, then we have to consider the last boundary.
The third boundary was at a petal length of 4.9cm. And if the petal length of the respective flower is smaller or equal to that, then the flower is an Iris-versicolor. Otherwise, it is an Iris-virginica.
And now we can use this decision tree just like the diagram to classify the flowers. So, let’s say for example that we pick a flower with the following values.
See slide 21
First, we have to check if the petal width is smaller or equal to 0.8cm. That’s not the case because it’s 1.1cm. So, we have to move to the right. Then, we check if the petal width is smaller or equal to 1.65cm. Yes it is. So, we have to move to the left. And lastly, we have to check if the petal length is smaller or equal to 4.9cm. It’s just 3cm. So, we have to move to the left and the flower would be classified as an Iris-versicolor.
And now, let’s also see where it would fall on the diagram with the decision boundaries.
See slide 22
And, as expected, it falls into the Iris-versicolor area. So, let’s finally check what type the flower actually is.
See slide 23
And it is indeed an Iris-versicolor.
Decision Tree Algorithm
So, as you can see, our decision boundaries can also be depicted as a decision tree. And the process, or algorithm, we went through to create it looks something like this.
See slide 24
First, we need to check if the data is pure (meaning that there is just one class). If it is, then we classify the data and stop. If it is not, we need to find the best feature to split the data and then split it into two parts.
One part contains the data points with values that are smaller or equal to that feature. And the other part contains the data points with values that are bigger than that feature. And then, we repeat this process for both parts. And we keep doing that until all parts are eventually pure and reach the stop field.
So, to see this process in action, let’s quickly go through the creation of the decision boundaries again.
See slide 25
First, we need to check if the data is pure. This is not the case because there are 3 different classes. So, we have to the right. Then we need to find the best feature to split the data and we said it is a petal width of 0.8cm. So, we split the data into two parts.
See slide 26
Now, for the data points that have a petal width that is smaller or equal to 0.8cm, so for the dots on the left side of the decision boundary, we check again if the data is pure. And this time it is. So, we classify this area and stop which means we don’t have to consider this side anymore.
See slide 27
For the other part, the data is not pure. So, we need to move to the right and split it again into two parts.
See slide 28
This time the dots on the right side of the decision boundary are pure and we can classify them.
See slide 29
The other side is not pure, so we need to split the data again.
See slide 30
Here, the data points below the boundary are pure and we can classify them.
See slide 31
And for the other part, we get to a point where we, in our original analysis, deviated from this algorithm. Actually, we would need to split the data again because it is not pure. But we didn’t do it for the aforementioned reasons.
We could, however, easily implement that approach from the analysis. Simply by adding another process right before we check if the data is pure. What we could do for example is the to ask whether there are more than 5 data points.
See slide 32
If there are, then we follow the same algorithm as before. If there aren’t, then we classify the area based on the class which appears more often or if the different classes appear equally often, then simply pick one. So, since there are only 4 dots in this area, we have to go to the left and classify it.
See slide 33
And now we reached the stop field for all the splits of the data and the whole algorithm stops.
So, this is the specific algorithm we used to create those decision boundaries. But for the upcoming post, we are not going to consider the part where we check if there are more than 5 data points. We are going to stick with just the more basic version of the algorithm.
See slide 34
And the important question we need to answer now is: What does it mean “to find the best feature to split the data”?
See slide 35
We did that kind of intuitively but if we want to automate this process, i.e. create an algorithm, then we need a metric that precisely tells us what the best feature is to split the data. And what such a metric could look like will be the topic of the next post.