This post is part of a series:
In this post, we are going to cover how the Naive Bayes algorithm works. Bayes TheoremAnd, as the name implies, Naive Bayes is based on Bayes Theorem. And its formula looks like this: See slide 1 However, I am going to explain the algorithm without using the formula. This way, I think, one can get a more intuitive understanding of how the algorithm works. And then, once we have that, we will revisit Bayes Theorem and see that we have actually done the calculations according to the formula. Titanic Data SetSo, to understand how the algorithm works, we will need some data. And we are going to use the wellknown Titanic Data Set. See slide 2 This data set contains information about the passengers of the Titanic accident. So, each row represents one specific passenger. And in total, there are 891 rows. The label of the data set is whether the respective passenger did survive (1) or did not survive (0) the accident. And the available features are the following:
Side note: I got the data set from Kaggle. However, I made some changes to it. First of all, I didn’t use all of the features that were in the original data set. And secondly, I created the feature “Age_Group”. This feature was originally just called “Age” and it contained the age of the passenger given in years, for instance 21 years (why I made this particular change, I will cover in the “How to handle continuous features” section of the next post). So, if you want to look at the data yourself, you can either download it from Kaggle and prepare the data yourself. Or you can download the data from this GitHub repo where I have already prepared it. So, that’s our training data. And now, let’s look at the test set. See slide 3 Here, the label “Survived” is (obviously) missing. So, our goal is to predict if these 4 particular passengers survived or not. Therefor, let’s have a closer look at the first passenger. See slide 4 This passenger is a male, travels in the 3rd class, belongs to age group “Unknown”, embarked at harbor S, travels with 0 siblings/spouses and travels with 0 parents/children. And the question that we now want to answer is: Given that a passenger has these particular values, did that passenger survive or did that passenger not survive? So now, let’s think about how we could go about answering that question. Lookup ApproachAnd the probably simplest way of doing that, is to just look at all the passengers in our training data that have the exact same values as test passenger 1: See slide 5 And then, check how many of those survived and how many of those didn’t survive: See slide 6 And whatever happened more often, that is going to be our prediction for test passenger 1. So, as we can see, the great majority of those passengers, did not survive, namely 38 out of 42. And only 4 did survive. Let’s also see what that looks like expressed in percentages. See slide 7 Namely, 90% of those passengers did not survive and only 10% survived. So, accordingly, our prediction for test passenger 1 is that he did not survive, simply because it seems to be much more likely. So now, let’s check out what really happened to that passenger. See slide 8 And, as we can see, he really did not survive. So, our prediction was correct. So now, let’s do the same thing for the second test passenger. See slide 9 Here, there are 25 passengers in the training set that have the same combination of values as test passenger 2. And only 12% of those did not survive, whereas 88% did survive. So, accordingly, our prediction for this test passenger is going to be that she survived. So, let’s see if that is correct. See slide 10 And, in fact, she really did survive which means that our prediction was again correct. So now, let’s predict the third passenger in the test set. See slide 11 And here, something strange happens. Namely, we don’t have any examples in our training data that have the exact same combination of values as test passenger 3. So now, we have a problem because we don’t know what our prediction should be. So, what do we do? Well, maybe this is just an exception. If it is, then we might think about how we could handle those exceptions later on. If it is not, then we have a problem with our current approach. So, let’s look at test passenger 4. See slide 12 And here again, there are zero passengers in our training data. So, it doesn’t seem to be just an exception. So, let’s think about why it might happen that we have zero examples of a specific combination of values in our training data. Problem of the Lookup ApproachTherefor, let’s recap what happened so far: See slide 13 Namely, for the first two passengers in our test set, so for the first two specific combinations of values, we had some examples in our training data, 42 and 25 respectively. And for the last two specific combinations of values, we didn’t have any examples in our training data. Therefore, let’s think about how many possible combinations of values there actually are in total. To see how we can calculate that, let’s first consider just two features of our data set, namely “Sex” and “Pclass”. See slide 14 “Sex” can take on the values male or female and “Pclass” can take on the values 1, 2 or 3. So now, let’s write out all the possible combinations for those two features. See slide 15 So, we can either have a male that travels in the 1st, 2nd or 3rd class or we can have a female that travels in the 1st, 2nd or 3rd class. This means, in total there are 6 different combinations for these two features. See slide 16 Now, if we want to determine the number of possible combinations when we consider all the features in our data set (and not just “Sex” and “Pclass”), then actually writing out the combinations is obviously not practical. But luckily, we don’t have to because we can easily calculate the number of possible combinations. And therefor, we simply have to determine how many distinct values each feature has. So, let’s do that for “Sex” and “Pclass”. See slide 17 So, for “Sex” we have 2 distinct values and for “Pclass” we have 3 distinct values. And if we multiply those two numbers together, then we again get the result that there are 6 possible combinations in total. So, that’s how we can calculate the number of possible combinations. So now, let’s calculate the total number of possible combinations when we actually consider all the features in our data set. Therefor, as just mentioned, we first need to know the number of distinct values for each feature. See slide 18 And with that, we can finally calculate the number of possible combinations in our data set. See slide 19 So, as we can see, there are 3,528 possible combinations in total. And that’s exactly where the problem lies. Namely, the number of possible combinations is bigger than the number of examples that we actually have in our training set. So accordingly, there will be a lot of combinations for which we don’t have any examples in our training data. And here you also have to keep in mind that the 891 examples aren’t 891 different combinations. For instance, as we have seen earlier, there are 42 passengers that have the same combination of values as test passenger 1. So, let’s see how many distinct combinations there actually are in our training data. See slide 20 And, as you can see, there are only 199 distinct combinations. So, the number of possible combinations is actually much, much bigger than the number of combinations that we have in our training data. Therefore, when we use our lookup approach for a specific combination from the test set, it is very likely that we don’t have any examples of that specific combination in our training data. So, referring back to the previous section, having zero examples is actually not an exception, but it is more the rule. And this will be the case for basically any data set, not just the Titanic data set. And that’s because the number of possible combinations grows exponentially with the number of features (since we have to multiply together the number of distinct values for each feature). So, the number of possible combinations will basically always be bigger than the number of actual examples in the training data. So, after all, we can’t use the simple lookup approach for making predictions about the passengers from the test set. See slide 21 And that’s simply because we (most likely) won’t have enough data. So, what do we do now? Estimation Approach (i.e. Naive Bayes)Well, we might not have enough data to apply our lookup approach, but we do have some data. And we can actually use it to, at least, roughly estimate the number of survivors and nonsurvivors for a given specific combination. To be more precise, we can estimate how many of all the survivors in our training data we would expect to have a specific combination of values. And we can estimate how many of all the nonsurvivors in our training data we would expect to have that specific combination of values. So therefore, let’s see how many survivors and nonsurvivors there actually are in the training data. See slide 22 As one can see, there are 549 nonsurvivors and 342 survivors. And what we now want to do is: We want to estimate how many of the 549 nonsurvivors we would expect to have a certain combination of values, e.g. the combination of test passenger 3 (for which, as we have seen earlier, we actually have zero examples in our training data). And we want to estimate how many of the 342 survivors we would expect to have that certain combination of values. See slide 23 This way, we would have again two numbers that aren’t zero. And whichever number is higher, that is going to be our prediction for the respective test passenger. So, that’s the general idea of what we want to do now. But how do we actually implement it? Well, the main problem of our lookup approach is that we, so to say, consider all the different features in our training data at once. See slide 24 And because of that the number of possible combinations grows exponentially. And therefore, we then won’t have enough data (in most cases) to make use of the lookup approach. However, if we consider the features individually, then we actually do have enough data to, at least, find general patterns about survivors and nonsurvivors with regards to each individual feature. See slide 25 And by “general patterns” I mean how the values of a feature are distributed, e.g. with respect to “Sex”, how many of the survivors/nonsurvivors are male or female. And then, once we have these general patterns for each individual feature, we can, so to say, put them back together to estimate how many of the survivors or nonsurvivors we would expect to have a certain combination of values. See slide 26 So, this is what we want to do now. And since this probably sounded more complicated than it really is, let’s just see how this approach actually gets implemented. Then, I think, it should be pretty straightforward to understand how it works. So, let’s start with estimating how many of the 549 nonsurvivors we would expect to have the same combination of values as test passenger 3. Therefor, as just explained, we first need to determine the general patterns about the nonsurvivors with regards to each feature. Let’s start with “Sex”. See slide 27 So, if we only look at the 549 nonsurvivors in our training data, then we can see that 81 of those are female, which represents 15%. And we can also see that 468 are male, which represent 85% of the nonsurvivors. Side note: Even if we would have 10times more data, the percentages wouldn’t really change much. So, for that reason, we actually don’t need that much data to find reliable patterns. So now, let’s look at the next feature, namely “Pclass”. See slide 28 Here, we can see that 15% of the nonsurvivors travel in the 1st class, 18% in the 2nd class and 68% in the 3rd class. So now, we have the general patterns for nonsurvivors with respect to two features. And since there is actually no more space on the slide, let’s not consider the remaining features for now (this doesn’t change anything about the explanation of how the Naive Bayes algorithm works). See slide 29 Okay, so now that we have all the general patterns that we need, let’s see how many of the 549 nonsurvivors we would expect to have the same values as test passenger 3. So, let’s see how many of the 549 nonsurvivors we would expect to be female and to travel in the 1st class. So, how do we do that? Well, we know that 15% of the 549 nonsurvivors are female. And we also know that 15% of the 549 nonsurvivors travel in the 1st class. So, how many would we expect to have both values? The straightforward thing to do, is to simply multiply these two percentages together. You can think of it like this: Let’s say we have 10 white squares. See slide 30 And let’s also say that we can randomly assign 40% of the squares to be blue. And we can randomly assign 50% of the squares to be red. See slide 31 So, that are the rules. And now, let’s do one round of that. So, let’s randomly assign 40% of the squares to be blue. See slide 32 And let’s randomly assign 50% of the squares to be red. See slide 33 Now, if we would do many, many rounds of that, then, on average, we would expect to see exactly the result that we see in the slide. Namely, we would expect that 2 out of the 10 squares would be both, blue and red. And that’s because 50% times 40% is 20%. And 20% of 10 is 2. See slide 34 And the same kind of logic applies to our two features “Sex” and “Pclass”. So, let’s now calculate how many of the 549 nonsurvivors we would expect to be female and to travel in the 1st class. See slide 35 So, if we multiply 15% with 15% and that then with 549, we get 12.4. So, we would expect that (“on average”) 12.4 nonsurvivors of the 549 total nonsurvivors are both female and travel in the 1st class. And now, let’s see how good this estimation actually is. Namely, since we are currently only considering two of the features in the data set (and not all 6), we can actually look up how many of the nonsurvivors in the training data have both of these values (since there are now only 6 different combinations and not 3,528). See slide 36 And, as you can see, in the actual training data there are 3 passengers that are female and travel in the 1st class. So, our estimation of 12.4 is not quite right, but it is also not that far off. So, let’s see what our estimate for the survivors looks like. See slide 37 Here, we would estimate that, out of the 342 survivors, there are 93 that are female and travel in the 1st class. And this estimate is actually pretty close to the actual value. So overall, if we look at the percentages, then we can see that they are not quite the same. However, this difference is actually not that bad since we are not really interested in the concrete numbers. We are more interested in the general tendency. So, for a given combination, we are more interested in the fact if, generally speaking, more people survived or if more people died. And if we look at the percentages of the lookup approach and the estimation approach (i.e. Naive Bayes approach), then we can see that they are somewhat similar to each other. At the very minimum, they both tell us that, for the combination of test passenger 3, generally much more people survived than died. And that is what we are interested in to make our prediction. So accordingly, our prediction for test passenger 3 is that she survived. So, let’s see if that is actually the case. See slide 38 And, as we can see, she indeed survived. So, our prediction was correct. And with that, we now know how the Naive Bayes algorithm basically works. So, we could actually already go back to Bayes Theorem and check if we have done the calculations according to its formula. However, for the sake of completeness, let’s check what our estimates for test passenger 3 look like, when we again consider all the features and not just “Sex” and “Pclass”. See slide 39 And here, as you may remember from before, with the lookup approach, we had zero passengers in our training set that had the same combination of values as test passenger 3. So now, let’s see what numbers we get with the Naive Bayes approach (side note: Due to the limited space, I didn’t create a table for all the probabilities. Instead, I just wrote down the respective probabilities that we need for the calculation). See slide 40 And here, as you can see, the actual estimates are very small (since we are multiplying a lot of probabilities together). But looking at the percentages, we can see that we again estimate that the great majority of passengers, which have the same combination of values as test passenger 3, survived, namely 98%. So again, we would make the correct prediction that test passenger 3 survived. So, the Naive Bayes approach seems to be working. So, let’s now use it for test passenger 4. See slide 41 Here, we would estimate that 86% of the passengers in the training set with this combination of values did not survive. So, we would predict that this passenger didn’t survive. Let’s see if that is the case. See slide 42 And again, our prediction is correct. So now, let’s also see what the predictions are for the first two test passengers (for which the lookup approach actually still worked). Let’s start with test passenger 1. See slide 43 Here, we can see that the Naive Bayes algorithm also predicts that this passenger did not survive. And the percentages are actually pretty close to the lookup approach. So, the prediction is correct again. And finally, let’s see what the prediction of the Naive Bayes algorithm is for test passenger 2. See slide 44 And, as before with the lookup approach, we again predict that this passenger survived which is the correct prediction. But this time the percentages of the Naive Bayes approach are somewhat different from the lookup approach. However, the general tendency of both approaches is the same. So now, we have predicted all the test passengers and all those predictions were correct. However, this is just because I specifically picked these passengers from the test set. If we use the Naive Bayes algorithm for all the passengers in the test set (with a slight adjustment covered in the “How to handle the problem of rare values” section of the next post), then we actually get an overall accuracy of 77%. See slide 45 And with that, we have now an intuitive understanding of how the algorithm works. So let’s go back to Bayes Theorem and let’s check if we have actually done the calculations according to its formula. Bayes Theorem revisitedTherefor, for the sake of simplicity, we look again at this slide from before, where we just considered the two features “Sex” and “Pclass”:
See slide 46 And now, let’s look again at the formula of Bayes Theorem. See slide 47 This is what the general formula looks like and we need it to calculate the two respective probabilities in the table which is in the bottomright corner of the slide. So now, let’s look again at test passenger 3 and let’s see how we use Bayes Theorem to calculate the probability that a passenger with those values did not survive. See slide 48 So, what we want to know is: what is the probability that the passenger did not survive, given that the sex is female and passenger class is 1. And since this expression is somewhat lengthy, let’s rewrite it like this: See slide 49 So, the “S with subscript 0” represents that the passenger did not survive (subscript 1 would represent that the passenger survived), the “venus symbol” represents that the passenger is female and “1st” represents that the passenger travels in the 1st class. If we use those symbols, then the full formula looks like this: See slide 50 However, we can’t make use of the formula as it is. And that is because of the conditional probabilities. For example, in the numerator, the conditional probability that we need to determine from the data is this: Given that the passenger did not survive (so we are only looking at the nonsurvivors), what is the probability that the “Sex” is female and that the “Pclass” is 1? In this specific case, it is actually not a problem because we are only considering 2 features. However, if we would consider all the features of the data set, then we would have again the problem that there are way more possible combinations than actual examples in our data set. So, most likely, we wouldn’t have that specific combination in our data and the conditional probability would simply be zero. And therefore, the whole formula would be zero. So, what we need to do is, we need to express the conditional probabilities without the ANDoperator. And therefor, we can make use of the multiplication rule for independent events (what it means that events are independent, we will discuss in the “What’s ‘naive’ about Naive Bayes” section of the next post). See slide 51 This allows us to get rid of the ANDoperator and we can express the jointprobability with two individual probabilities. So again, this is the general form. Therefore, let’s apply it to our use case which is straightforward to do. See slide 52 However, in our Bayes Theorem formula, we have conditional probabilities and not just regular probabilities. So, how does the multiplication rule work with that? See slide 53 Actually, it is pretty simple. With the upper application of the multiplication rule, we want to calculate the probability that a passenger is female and travels in the 1st class. Therefor, we multiply the probability that a passenger is female with the probability that a passenger travels in the 1st class. With the lower application of the multiplication rule, we basically do the same thing. The only difference is that the condition is that the passenger did not survive. So, in the data, we are only looking at nonsurvivors. So, what we want to calculate is: What is the probability that a nonsurvivor is female and travels in the 1st class. Accordingly then, we just need to use the probability that a nonsurvivor is female and the probability that a nonsurvivor travels in the 1st class. See slide 54 So, this is how we can replace the probabilities with the ANDoperator in our Bayes Theorem formula. So, let’s do that. See slide 55 And this is now what the final formula looks like. And here, we don’t have the problem that a probability is zero because we are now, so to say, again considering the features individually and not as a combination. So, this is now the formula that we can actually work with. Therefore, let’s create some space so that we can fill in the actual numbers and calculate the probability that a passenger did not survive given that the person is a female and travels in the 1st class. See slide 56 And we are going to start with the numerator. The first expression is the probability that a passenger is female given that the passenger did not survive. And according to our table in the topright corner of the slide, this is 0.15. See slide 57 The next expression is the probability that the passenger travels in the 1st class given that the person did not survive. This is 0.15, as well. See slide 58 And the last expression is simply the probability that a passenger did not survive. So, this is simply the number of passengers that did not survive in our data set divided by the total number of passengers. And this probability we are actually going to write down as a fraction. See slide 59 Now, let’s get to the denominator. Here, the expressions in front of the plus sign are actually the same as the numerator. So, we just need to copy the values. See slide 60 The values after the plus sign, we can determine similar to before. Only this time, we are looking at the survivors. See slide 61 Now, before we are actually going to calculate this probability, we are going to rearrange the formula a little bit. See slide 62 So, we are bringing down the numerators in front of the fractions. And we are putting everything that is not a fraction into parentheses. And now, we are going to calculate what’s in the parentheses. See slide 63 In the next step, we are going to factor out the “1/891” in the denominator. See slide 64 And now, the “1/891” cancels out. See slide 65 So, as we can see, the probability that the passenger did not survive given that the person is a female and travels in 1st class is equal to 12.4 divided by 12.4 plus 93.0. And this is exactly how we calculated this probability before with our tables. So again, we get a probability of 12%. See slide 66 So, as you can see, with our algorithm we have actually done the calculations according to the formula of Bayes Theorem. And with that, we have now reached the end of this post. In the next post, we are going to cover some additional points about the algorithm in general.
0 Comments
Leave a Reply. 
AuthorJust someone trying to explain his understanding of data science concepts Archives
November 2020
Categories
