This post is part of a series:
In the previous post I introduced to you the Iris flower data set and a madeup scenario where you should imagine to be a flower grower. See slide 1 And the problem you were trying to solve was to devise a method which should enable the flower pickers to sort the very similar looking flowers so that you wouldn’t have to train them anymore. With the help of this scenario I showed how machine learning works at a high level. And the final result of the post was this overview diagram where I depicted the different types of machine learning and the different elements needed to make a machine learn. See slide 2 In this part of the series I am going to talk about just one element of this diagram, namely the algorithm. In the last post, I simply said that you show the algorithm a bunch of examples and then it will learn to predict the label of a particular example solely by looking at its features. In this post, and the next one, we are going to cover what that learning process actually looks like. And the specific algorithm we are going to do that with is the decision tree algorithm. And therefore, to first get an intuition about how decision trees generally work, I want you to imagine again that you are the flower grower and that you have to solve the same problem as in the previous post. But this time, you are going to solve it without any machine learning at all. TrainTestSplit of the DataSo, just like in the last post, you randomly pick 50 flowers of each type and you measure their leaves to gather some data on the Iris flowers. See slide 3 And now, instead of using machine learning, you analyze this data yourself to see if you can find some patterns which might help the workers to distinguish the different types of flowers. But before you start, you randomly pick out 20 flowers out of the 150 that you collected. See slide 4 And these will serve as your first test case. With them you can check if the patterns that you detected in your analysis of the remaining 130 flowers are actually valid and really help to distinguish the different types of Irisflowers. Statistics: MeanSo, to start, one thing you could do is to calculate some statistics, for example the mean of the several features with regards to the different types of Irisflowers. See slide 5 And here you could see that on average the petal width of Irissetosa flowers are much smaller than those of the other two types. The same holds true for the petal length. Data VisualizationBut just looking at such raw numbers, it is generally relatively difficult to detect patterns. A better way for doing that is to visualize the data. See slide 6 So here, you created a scatter plot that depicts the features of the sepal leaves, so the sepal width and length. And as you can see, the Irissetosa flowers are clearly set apart from the other two types. But the versicolor and virginica flowers are quite interspersed. So, this graph is probably not very useful for deriving rules to distinguish the different types of Irisflowers. And that’s because you can’t easily separate out areas where there is just one type of flower. And what I mean by this can be for example seen at a sepal length of 7cm. Above that, there are only blue dots. So, what you could say to the worker is: “Check the sepal length of the flower and if it is larger than 7cm, then it’s an Irisvirginica.” But that’s about it what you can do in this graph. So, let’s have a look at the petal leaves. See slide 7 And here the graph looks much better because the Irissetosa flowers are clearly distinguished from the other two as well. But additionally, Irisversicolor and Irisvirginica are also more clearly separated from each other. So, let’s stick with this plot for deriving some rules. Creating Decision BoundariesLooking at the graph, one possible first rule for the flower pickers might be: “Okay, check the petal width of the flower and if it is smaller than let’s say 0.8 centimeters, then the flower is clearly an Irissetosa.” So, let’s add that decision boundary to the plot. See slide 8 And since all the dots left of this line are green, let’s shade this whole area with a green color. See slide 9 And now, if we go back to the field and pick a new flower and it falls somewhere into this green area, so if it has a petal width smaller than 0.8cm, then we will classify it as an Irissetosa. And this means that we only have to consider the remaining data points and see if we can find some other good decision boundaries. So, let’s start with the petal width again. Here, looking at the scatter plot, the boundary clearly has to be between a petal width of 1.6cm and 1.7cm. Let’s add that. See slide 10 And similar as before, all the dots to the right of this line are blue. So, let’s shade that area accordingly. See slide 11 And now, if we want to separate out the orange dots from the remaining 3 blue dots, we need to create a decision boundary along the petal length. And it should be at a petal length of 4.9cm. See slide 12 This way there are only orange dots in bottom area and therefore we can shade the area with an orange color. See slide 13 And finally, we are only left with a small area which hasn’t been classified yet. Here, we could add one or maybe even two additional boundaries to isolate the orange dot from the blue dots. But we are not going to do that because the goal of our analysis is to somehow capture general characteristics of the different types of Irisflowers. For example, a general characteristic might be that, with regards to their petal leaves, the Irissetosa flowers are the smallest flowers, Irisversicolor are kind of the middle flowers and Irisvirginica are the biggest flowers. But since there are not many flowers in this area, it is probably not very representative for Iris flowers in general. So, they are kind of outliers. The blue dots, for example, are virginica flowers that have a slightly smaller petal width than what is normal for Irisvirginica flowers. So, by making really intricate decision boundaries in this area we might include some information of those extreme examples into our decision making process. And this is not desirable if we want to capture the general traits of Iris flowers. But, nonetheless, we have to classify this area in the case a new flower falls into it. And since there are 3 blue dots and only 1 orange dot, it is more likely that the flower would be an Irisvirginica. And for that reason, we are going to shade this area with a blue color. See slide 14 And now, we are done with our analysis. We can classify any new flower that falls onto this diagram. Testing the Decision BoundariesSo, let’s see how good the decision boundaries classify our 20 test flowers that we extracted from our original data set at the beginning of this post. See slide 15 So, these are our 20 test flowers. Let’s add our decision boundaries to see how they would be classified. See slide 16 And now let’s have a look what type these flowers actually are. See slide 17 As we can see, all flowers that were predicted to be Irissetosa, actually are Irissetosa. The same is true for Irisversicolors. There is only one misclassification in the Irisvirginica area. So, let me remove the shading so that it is easier to see. See slide 18 There is one orange dot in this area. So, it actually is an Irisversicolor but it was misclassified as an Irisvirginica. That means: 19 out of 20 flowers were classified correctly which results in an accuracy of 95%. And you as the flower grower decide that this accuracy is sufficient and therefore you are going to implement this approach instead of training the flower pickers anew every season. See slide 19 So, you simply give the flower pickers a sheet of paper with this diagram on it and then they can use it, just like we did, to determine what type a specific flower is and then sort them accordingly. So, to sum up, we solved the same problem as in the last post. But this time we did it without using any machine learning at all. Decision TreeBut as it turns out, what we have done so far in this post to create those decision boundaries, is exactly how the decision tree algorithm works. In fact, we can also illustrate our diagram as a decision tree. See slide 20 So, as you remember, the first boundary was the one at a petal width of 0.8cm. If the petal width of the respective flower is smaller or equal to that, then the flower is an Irissetosa. Otherwise, we have to check the other boundaries. The second boundary was the one at a petal width of 1.65cm. If the petal width of the respective flower is bigger than that, then the flower is an Irisvirginica. If it is smaller or equal to that, then we have to consider the last boundary. The third boundary was at a petal length of 4.9cm. And if the petal length of the respective flower is smaller or equal to that, then the flower is an Irisversicolor. Otherwise, it is an Irisvirginica. And now we can use this decision tree just like the diagram to classify the flowers. So, let’s say for example that we pick a flower with the following values. See slide 21 First, we have to check if the petal width is smaller or equal to 0.8cm. That’s not the case because it’s 1.1cm. So, we have to move to the right. Then, we check if the petal width is smaller or equal to 1.65cm. Yes it is. So, we have to move to the left. And lastly, we have to check if the petal length is smaller or equal to 4.9cm. It’s just 3cm. So, we have to move to the left and the flower would be classified as an Irisversicolor. And now, let’s also see where it would fall on the diagram with the decision boundaries. See slide 22 And, as expected, it falls into the Irisversicolor area. So, let’s finally check what type the flower actually is. See slide 23 And it is indeed an Irisversicolor. Decision Tree AlgorithmSo, as you can see, our decision boundaries can also be depicted as a decision tree. And the process, or algorithm, we went through to create it looks something like this.
See slide 24 First, we need to check if the data is pure (meaning that there is just one class). If it is, then we classify the data and stop. If it is not, we need to find the best feature to split the data and then split it into two parts. One part contains the data points with values that are smaller or equal to that feature. And the other part contains the data points with values that are bigger than that feature. And then, we repeat this process for both parts. And we keep doing that until all parts are eventually pure and reach the stop field. So, to see this process in action, let’s quickly go through the creation of the decision boundaries again. See slide 25 First, we need to check if the data is pure. This is not the case because there are 3 different classes. So, we have to the right. Then we need to find the best feature to split the data and we said it is a petal width of 0.8cm. So, we split the data into two parts. See slide 26 Now, for the data points that have a petal width that is smaller or equal to 0.8cm, so for the dots on the left side of the decision boundary, we check again if the data is pure. And this time it is. So, we classify this area and stop which means we don’t have to consider this side anymore. See slide 27 For the other part, the data is not pure. So, we need to move to the right and split it again into two parts. See slide 28 This time the dots on the right side of the decision boundary are pure and we can classify them. See slide 29 The other side is not pure, so we need to split the data again. See slide 30 Here, the data points below the boundary are pure and we can classify them. See slide 31 And for the other part, we get to a point where we, in our original analysis, deviated from this algorithm. Actually, we would need to split the data again because it is not pure. But we didn’t do it for the aforementioned reasons. We could, however, easily implement that approach from the analysis. Simply by adding another process right before we check if the data is pure. What we could do for example is the to ask whether there are more than 5 data points. See slide 32 If there are, then we follow the same algorithm as before. If there aren’t, then we classify the area based on the class which appears more often or if the different classes appear equally often, then simply pick one. So, since there are only 4 dots in this area, we have to go to the left and classify it. See slide 33 And now we reached the stop field for all the splits of the data and the whole algorithm stops. So, this is the specific algorithm we used to create those decision boundaries. But for the upcoming post, we are not going to consider the part where we check if there are more than 5 data points. We are going to stick with just the more basic version of the algorithm. See slide 34 And the important question we need to answer now is: What does it mean “to find the best feature to split the data”? See slide 35 We did that kind of intuitively but if we want to automate this process, i.e. create an algorithm, then we need a metric that precisely tells us what the best feature is to split the data. And what such a metric could look like will be the topic of the next post. 
AuthorJust someone trying to explain his understanding of data science concepts Archives
February 2020
Categories
