This post is part of a series:
In the previous post, we have seen how the Naive Bayes algorithm works. And in this post, we are going to cover some more general points about the algorithm, namely:
What’s “naive” about Naive Bayes
To address this point, let’s look again at this slide from the previous post:
See slide 1
Here, we were only considering the features “Sex” and “Pclass” and we were looking at test passenger 3 which is a female who travels in the 1st class. And our goal was to estimate how many of the 549 non-survivors in the training data have the same combination of values as test passenger 3. Therefor, we then simply multiplied the probability that a non-survivor is female (15%) with the probability that a non-survivor travels in the 1st class (15%) with the total number of non-survivors (549).
And it is exactly in this calculation where the “Naive” in Naive Bayes comes into play. Namely, by multiplying the two probabilities together, we implicitly made the assumption that the two events Sex=female and Pclass=1 are independent. Or more generally speaking, we made the assumption that the two features are independent. So, what does this mean?
It means that those two features don’t influence each other. So, for example, if we only look at the 81 female non-survivors (instead of looking at all 549 non-survivors), we should still see the same percentages for the different passenger classes. So, 15% of those 81 female non-survivors should travel in the 1st class, 18% in the 2nd class and 68% in the 3rd class. The other way around is also true. Namely, if we only look at the 80 non-survivors that travel in the 1st class, then 15% of those should be female and 85% should be male.
So now, let’s see if that is really the case. Namely, since we are currently only considering two features, we can actually check what those percentages look like. So, let’s start by looking at how the 81 female non-survivors are distributed across the different passenger classes.
See slide 2
And, as we can see, the percentages differ. Compared to all non-survivors, female non-survivors are less likely to travel in the 1st or 2nd class and are much more likely to travel in the 3rd class.
And now, let’s look at how many of the 80 non-survivors, who travel in the 1st class, are male and how many are female.
See slide 3
And here, the percentages differ, as well. Compared to all non-survivors, non-survivors who travel in the 1st class, are less likely to be female and more likely to be male.
From these two comparisons we can see that the two features “Sex” and “Pclass” are not truly independent of each other. So, they seem to influence each other. Namely, with respect to the non-survivors, there seems to be a negative correlation between being female and traveling in the 1st class. If the non-survivor is female, then the person is less likely to travel in the 1st class. And if the non-survivor travels in the 1st class, then the person is less likely to be female.
And this kind of observation applies to every data set, not just the Titanic data set. There are basically always features that are in some way correlated to each other. So, making the assumption, that all features of a data set are independent of each other, is actually an extreme over-simplification of the real world (since it is highly unlikely). You could even say that this assumption is a very naive assumption. So, that’s why the algorithm is called “Naive” Bayes.
So now, just as a side note, let’s see what happens if we actually wouldn’t make this naive assumption.
Namely, in that case, we couldn’t make use of the “Multiplication Rule for independent events” (as we did in the “Bayes Theorem revisited” section of the previous post).
See slide 4
Instead we would need to use the “General Multiplication Rule”.
See slide 5
So, to calculate the probability of two events happening at the same time, we can’t just multiply together the probabilities of the individual events (“Multiplication Rule for independent events”). Instead, we need to multiply the probability of A with the probability of B given A. Or we can also do it the other way around, multiplying the probability of B with the probability of A given B.
So now, let’s use the approach of the “General Multiplication Rule” to calculate how many of the 549 non-survivors we would expect to be female and to travel in the 1st class. So, we are going to make use of the new probabilities that we just determined.
And, as we can see in the formula for the “General Multiplication Rule”, there are actually two ways of doing this calculation. The first one is that we multiply the probability that a non-survivor is female with the probability that a non-survivor travels the in 1st class, given that the non-survivor is female.
See slide 6
Or we can multiply the probability that a non-survivor travels in the 1st class with the probability that a non-survivor is female, given that the non-survivor travels in the 1st class.
See slide 7
As you can see, in both cases we get the same result of 3. And again, since we are currently considering only two features, we can actually check how many non-survivors there are in the training data that are female and travel in the 1st class. So, let’s do that.
See slide 8
And, as we can see, by using the probabilities in this way, we are actually calculating the exact number of non-survivors in the training data that are female and travel in the 1st class. So, one might be tempted to think that this is a better way of estimating the number of survivors and non-survivors for a given combination of values.
However, there is a problem with making the calculation in this way. And that problem are the new conditional probabilities. In order to determine either of those, we actually have to know how many of the non-survivors in the training data are female and travel in the 1st class, namely 3.
See slide 9
So, we actually need to look-up how often a specific combination of values appears in the training data. And, as we have seen in the previous post, this look-up approach might work when we are only considering two features. But, as soon as we are considering all the features of a data set, it usually doesn’t work anymore because the number of possible combinations grows exponentially. And therefore, those probabilities are most likely zero since we wouldn’t have any examples in the training data. So, our estimates would always be zero again and we wouldn’t know what our prediction should be.
So, after all, we actually have to make the naive assumption that all features are independent of each other. Otherwise, we wouldn’t be able to calculate the number of survivors and non-survivors for a given combination of values. And therefore, we wouldn’t be able to make predictions.
How to handle the problem of rare values
There are two ways in which “rare values” can cause problems for the algorithm. To see the first one, let’s look at the distribution of the feature “ParCh”.
See slide 10
Here, we can see that the majority of passengers (both survivors or non-survivors) travel with 0, 1 or 2 parents/children. Everything above that, doesn’t happen that often. For example, there are only 5 passengers in total that travel with 3 parents/children. So, this would be an example of a “rare value”.
However, with this particular value of 3 parents/children, there isn’t actually a problem. There are just low percentages. However, where we have a problem is, for example, with the value “4”.
Here, we can see that there were only 4 passengers in total that travel with 4 parents/children. So, the total number is similar to value “3”. However, the problem is that all of those 4 passengers belong to just one class, namely “Not Survived”. So, we have zero examples of value “4” that survived. And, accordingly, the respective probability is zero.
And that’s exactly where the problem lies. Namely, if a passenger in the test set happens to travel with 4 parents/children, then we are always going to predict that this passenger died, as you can see, for instance, with test passenger 5.
See slide 11
And that’s because one of the probabilities for estimating the number of survivors is going to be zero.
See slide 12
And therefore, we are always going to estimate that there are zero survivors. So, in a sense, the probabilities of all the other features don’t really matter, even if they all would be very high (which might indicate that a passenger survived).
For example, if we look at test passenger 5, then we can see that the “Sex” is female and the “Pclass” is 1. So, for those two features, the passenger has the same values as test passenger 3. And if you remember back to the previous post, where we just considered those two features to explain how the Naive Bayes algorithm works, we estimated the following numbers of survivors and non-survivors:
See slide 13
As we can see, there was an 88% chance that the respective passenger survived. So, if we just consider those two features, we might actually think that test passenger 5 survived. So, let’s see what actually happened.
See slide 14
And she really did survive. But as said before, we will never predict that she survived since she travels with 4 parents/children. So, how can we get around this problem?
Well, one of the simplest approaches is to just add one instance to every number in our probability table.
See slide 15
This way, there are no probabilities anymore that are zero. And all the other non-zero probabilities don’t really change that much since we have only increased the number of instances by one. So, the overall estimations shouldn’t really change as well.
And now, let’s see what our prediction for test passenger 5 is, after we made this small change.
See slide 16
And, as we can see, now we are actually predicting that the passenger survives (and that even with a relatively high certainty of 80%).
Side note: Another way of dealing with such rare values is called smoothing. With this approach, you distribute the existing probabilities more evenly across all values of a feature. So, the probabilities of values with high probabilities will be decreased somewhat and the probabilities of values with low probabilities will be increased somewhat.
So, that’s the first way in which rare values can cause problems for the algorithm. The second thing that can happen with a rare value is that it happens to only occur in the test set.
See slide 17
As one can see, test passenger 6 travels with 9 parents/children. The problem with that value is that we didn’t have any passengers in our training data which travel with 9 parents/children. And therefore, we also don’t have probabilities for this value in our probability table. So, we actually can’t complete the calculation in the second step of the algorithm.
See slide 18
So, how can we deal with this problem?
Well, the probably simplest approach is to just ignore the feature “ParCh” in this case for the second step of the algorithm.
See slide 19
If we do that, then we estimate that there is a 90% chance that test passenger 6 did not survive. So, let’s see if that is really case.
See slide 20
And, as you can see, this prediction is correct. So, this one possibility of how we can handle the problem when a rare value only appears in the test set and not in the training set.
Side note: This is just my own opinion. I haven’t really found anything in the literature about this. So, take that advice with a grain of salt. However, my reasoning for this approach is that this value is probably not very predictive anyway (since it so rare). So, we might as well ignore it.
How to handle continuous features
The Naive Bayes algorithm, that we have covered in the previous post, actually can’t handle continuous features. And that’s because, by definition, a continuous feature can (theoretically) take on infinitely many values.
See slide 21
And this results in the fact that the 549 non-survivors, for example, would be spread across an infinite number of values. So, the great majority of the values would contain zero examples and accordingly most of the probabilities would be zero. And this means that it is highly likely that there is a zero in the second step of the algorithm, where we multiply a bunch of probabilities together.
See slide 22
And therefore, both of our estimates would (very likely) be zero and we wouldn’t know what our prediction should be.
Now, in reality, continuous features don’t really have infinite many distinct values. But they do have an unusually high number of distinct values. And therefore, the number of examples that we have in our training data, e.g. the 549 non-survivors, would be spread across a large number of distinct values. So, at a very minimum, we would again run into the problem of rare values (which we have discussed in the previous section). So, how do we then deal with continuous features?
Well, one thing that we can do, is to transform them into categorical features by creating bins. For example, in the original Titanic data set (the one from Kaggle), there is actually no feature called “Age_Group”. Originally, it is just called “Age” and it is actually a continuous feature where the age is given in years (side note: there are 88 distinct values). So, for that reason, I created the feature “Age_Group” with the following 4 bins:
This way, one can ensure that there is not one distinct value with a probability of zero.
Classification vs. Regression
The Naive Bayes algorithm, that we have covered in the previous post, can only be used for classification and not regression. Namely, with a regression task, the label is continuous. So theoretically, there would be an infinite number of different classes.
See slide 23
So, we would run into the same problem as with continuous features. For example, in total there are 314 females in the training data set. So, if the label of this data set would be continuous, then those 314 females would be spread across an infinite number of different classes. Therefore, the great majority of classes would contain zero females and only a few classes would contain one or maybe two females. So, most of the probabilities for the value “female” would be zero.
And this, obviously, also applies to every other value in the data set as well. And because of that, the majority of the probabilities, that we determine in the first step of the algorithm, would be zero. So, when we are making our estimations in the second step of the algorithm, it is almost certain that there is at least one probability that is zero.
See slide 24
So, all of our estimates would be zero again and we wouldn’t know what our prediction should be. And that’s why Naive Bayes is only used for classification and not regression.
And with that, we have reached the end of this tutorial. So, if you would like to know how to implement that Naive Bayes algorithm from scratch, you can check out this post.