This post is part of a series:
In the previous post, we have seen how the Naive Bayes algorithm works. And in this post, we are going to cover some more general points about the algorithm, namely:
What’s “naive” about Naive BayesTo address this point, let’s look again at this slide from the previous post: See slide 1 Here, we were only considering the features “Sex” and “Pclass” and we were looking at test passenger 3 which is a female who travels in the 1st class. And our goal was to estimate how many of the 549 nonsurvivors in the training data have the same combination of values as test passenger 3. Therefor, we then simply multiplied the probability that a nonsurvivor is female (15%) with the probability that a nonsurvivor travels in the 1st class (15%) with the total number of nonsurvivors (549). And it is exactly in this calculation where the “Naive” in Naive Bayes comes into play. Namely, by multiplying the two probabilities together, we implicitly made the assumption that the two events Sex=female and Pclass=1 are independent. Or more generally speaking, we made the assumption that the two features are independent. So, what does this mean? It means that those two features don’t influence each other. So, for example, if we only look at the 81 female nonsurvivors (instead of looking at all 549 nonsurvivors), we should still see the same percentages for the different passenger classes. So, 15% of those 81 female nonsurvivors should travel in the 1st class, 18% in the 2nd class and 68% in the 3rd class. The other way around is also true. Namely, if we only look at the 80 nonsurvivors that travel in the 1st class, then 15% of those should be female and 85% should be male. So now, let’s see if that is really the case. Namely, since we are currently only considering two features, we can actually check what those percentages look like. So, let’s start by looking at how the 81 female nonsurvivors are distributed across the different passenger classes. See slide 2 And, as we can see, the percentages differ. Compared to all nonsurvivors, female nonsurvivors are less likely to travel in the 1st or 2nd class and are much more likely to travel in the 3rd class. And now, let’s look at how many of the 80 nonsurvivors, who travel in the 1st class, are male and how many are female. See slide 3 And here, the percentages differ, as well. Compared to all nonsurvivors, nonsurvivors who travel in the 1st class, are less likely to be female and more likely to be male. From these two comparisons we can see that the two features “Sex” and “Pclass” are not truly independent of each other. So, they seem to influence each other. Namely, with respect to the nonsurvivors, there seems to be a negative correlation between being female and traveling in the 1st class. If the nonsurvivor is female, then the person is less likely to travel in the 1st class. And if the nonsurvivor travels in the 1st class, then the person is less likely to be female. And this kind of observation applies to every data set, not just the Titanic data set. There are basically always features that are in some way correlated to each other. So, making the assumption, that all features of a data set are independent of each other, is actually an extreme oversimplification of the real world (since it is highly unlikely). You could even say that this assumption is a very naive assumption. So, that’s why the algorithm is called “Naive” Bayes. So now, just as a side note, let’s see what happens if we actually wouldn’t make this naive assumption. Namely, in that case, we couldn’t make use of the “Multiplication Rule for independent events” (as we did in the “Bayes Theorem revisited” section of the previous post). See slide 4 Instead we would need to use the “General Multiplication Rule”. See slide 5 So, to calculate the probability of two events happening at the same time, we can’t just multiply together the probabilities of the individual events (“Multiplication Rule for independent events”). Instead, we need to multiply the probability of A with the probability of B given A. Or we can also do it the other way around, multiplying the probability of B with the probability of A given B. So now, let’s use the approach of the “General Multiplication Rule” to calculate how many of the 549 nonsurvivors we would expect to be female and to travel in the 1st class. So, we are going to make use of the new probabilities that we just determined. And, as we can see in the formula for the “General Multiplication Rule”, there are actually two ways of doing this calculation. The first one is that we multiply the probability that a nonsurvivor is female with the probability that a nonsurvivor travels the in 1st class, given that the nonsurvivor is female. See slide 6 Or we can multiply the probability that a nonsurvivor travels in the 1st class with the probability that a nonsurvivor is female, given that the nonsurvivor travels in the 1st class. See slide 7 As you can see, in both cases we get the same result of 3. And again, since we are currently considering only two features, we can actually check how many nonsurvivors there are in the training data that are female and travel in the 1st class. So, let’s do that. See slide 8 And, as we can see, by using the probabilities in this way, we are actually calculating the exact number of nonsurvivors in the training data that are female and travel in the 1st class. So, one might be tempted to think that this is a better way of estimating the number of survivors and nonsurvivors for a given combination of values. However, there is a problem with making the calculation in this way. And that problem are the new conditional probabilities. In order to determine either of those, we actually have to know how many of the nonsurvivors in the training data are female and travel in the 1st class, namely 3. See slide 9 So, we actually need to lookup how often a specific combination of values appears in the training data. And, as we have seen in the previous post, this lookup approach might work when we are only considering two features. But, as soon as we are considering all the features of a data set, it usually doesn’t work anymore because the number of possible combinations grows exponentially. And therefore, those probabilities are most likely zero since we wouldn’t have any examples in the training data. So, our estimates would always be zero again and we wouldn’t know what our prediction should be. So, after all, we actually have to make the naive assumption that all features are independent of each other. Otherwise, we wouldn’t be able to calculate the number of survivors and nonsurvivors for a given combination of values. And therefore, we wouldn’t be able to make predictions. How to handle the problem of rare valuesThere are two ways in which “rare values” can cause problems for the algorithm. To see the first one, let’s look at the distribution of the feature “ParCh”. See slide 10 Here, we can see that the majority of passengers (both survivors or nonsurvivors) travel with 0, 1 or 2 parents/children. Everything above that, doesn’t happen that often. For example, there are only 5 passengers in total that travel with 3 parents/children. So, this would be an example of a “rare value”. However, with this particular value of 3 parents/children, there isn’t actually a problem. There are just low percentages. However, where we have a problem is, for example, with the value “4”. Here, we can see that there were only 4 passengers in total that travel with 4 parents/children. So, the total number is similar to value “3”. However, the problem is that all of those 4 passengers belong to just one class, namely “Not Survived”. So, we have zero examples of value “4” that survived. And, accordingly, the respective probability is zero. And that’s exactly where the problem lies. Namely, if a passenger in the test set happens to travel with 4 parents/children, then we are always going to predict that this passenger died, as you can see, for instance, with test passenger 5. See slide 11 And that’s because one of the probabilities for estimating the number of survivors is going to be zero. See slide 12 And therefore, we are always going to estimate that there are zero survivors. So, in a sense, the probabilities of all the other features don’t really matter, even if they all would be very high (which might indicate that a passenger survived). For example, if we look at test passenger 5, then we can see that the “Sex” is female and the “Pclass” is 1. So, for those two features, the passenger has the same values as test passenger 3. And if you remember back to the previous post, where we just considered those two features to explain how the Naive Bayes algorithm works, we estimated the following numbers of survivors and nonsurvivors: See slide 13 As we can see, there was an 88% chance that the respective passenger survived. So, if we just consider those two features, we might actually think that test passenger 5 survived. So, let’s see what actually happened. See slide 14 And she really did survive. But as said before, we will never predict that she survived since she travels with 4 parents/children. So, how can we get around this problem? Well, one of the simplest approaches is to just add one instance to every number in our probability table. See slide 15 This way, there are no probabilities anymore that are zero. And all the other nonzero probabilities don’t really change that much since we have only increased the number of instances by one. So, the overall estimations shouldn’t really change as well. And now, let’s see what our prediction for test passenger 5 is, after we made this small change. See slide 16 And, as we can see, now we are actually predicting that the passenger survives (and that even with a relatively high certainty of 80%). Side note: Another way of dealing with such rare values is called smoothing. With this approach, you distribute the existing probabilities more evenly across all values of a feature. So, the probabilities of values with high probabilities will be decreased somewhat and the probabilities of values with low probabilities will be increased somewhat. So, that’s the first way in which rare values can cause problems for the algorithm. The second thing that can happen with a rare value is that it happens to only occur in the test set. See slide 17 As one can see, test passenger 6 travels with 9 parents/children. The problem with that value is that we didn’t have any passengers in our training data which travel with 9 parents/children. And therefore, we also don’t have probabilities for this value in our probability table. So, we actually can’t complete the calculation in the second step of the algorithm. See slide 18 So, how can we deal with this problem? Well, the probably simplest approach is to just ignore the feature “ParCh” in this case for the second step of the algorithm. See slide 19 If we do that, then we estimate that there is a 90% chance that test passenger 6 did not survive. So, let’s see if that is really case. See slide 20 And, as you can see, this prediction is correct. So, this one possibility of how we can handle the problem when a rare value only appears in the test set and not in the training set. Side note: This is just my own opinion. I haven’t really found anything in the literature about this. So, take that advice with a grain of salt. However, my reasoning for this approach is that this value is probably not very predictive anyway (since it so rare). So, we might as well ignore it. How to handle continuous featuresThe Naive Bayes algorithm, that we have covered in the previous post, actually can’t handle continuous features. And that’s because, by definition, a continuous feature can (theoretically) take on infinitely many values. See slide 21 And this results in the fact that the 549 nonsurvivors, for example, would be spread across an infinite number of values. So, the great majority of the values would contain zero examples and accordingly most of the probabilities would be zero. And this means that it is highly likely that there is a zero in the second step of the algorithm, where we multiply a bunch of probabilities together. See slide 22 And therefore, both of our estimates would (very likely) be zero and we wouldn’t know what our prediction should be. Now, in reality, continuous features don’t really have infinite many distinct values. But they do have an unusually high number of distinct values. And therefore, the number of examples that we have in our training data, e.g. the 549 nonsurvivors, would be spread across a large number of distinct values. So, at a very minimum, we would again run into the problem of rare values (which we have discussed in the previous section). So, how do we then deal with continuous features? Well, one thing that we can do, is to transform them into categorical features by creating bins. For example, in the original Titanic data set (the one from Kaggle), there is actually no feature called “Age_Group”. Originally, it is just called “Age” and it is actually a continuous feature where the age is given in years (side note: there are 88 distinct values). So, for that reason, I created the feature “Age_Group” with the following 4 bins:
This way, one can ensure that there is not one distinct value with a probability of zero. Classification vs. RegressionThe Naive Bayes algorithm, that we have covered in the previous post, can only be used for classification and not regression. Namely, with a regression task, the label is continuous. So theoretically, there would be an infinite number of different classes.
See slide 23 So, we would run into the same problem as with continuous features. For example, in total there are 314 females in the training data set. So, if the label of this data set would be continuous, then those 314 females would be spread across an infinite number of different classes. Therefore, the great majority of classes would contain zero females and only a few classes would contain one or maybe two females. So, most of the probabilities for the value “female” would be zero. And this, obviously, also applies to every other value in the data set as well. And because of that, the majority of the probabilities, that we determine in the first step of the algorithm, would be zero. So, when we are making our estimations in the second step of the algorithm, it is almost certain that there is at least one probability that is zero. See slide 24 So, all of our estimates would be zero again and we wouldn’t know what our prediction should be. And that’s why Naive Bayes is only used for classification and not regression. And with that, we have reached the end of this tutorial. So, if you would like to know how to implement that Naive Bayes algorithm from scratch, you can check out this post.
0 Comments
Leave a Reply. 
AuthorJust someone trying to explain his understanding of data science concepts Archives
November 2020
Categories
