A Sophistical way to learn Naive Bayes !!!

Not always Machine learning algorithms mean linear algebra and calculus, the best solutions are always simple. Yes, this blog aims at diving deep into one of the most efficient yet a naive Machine learning algorithm that is based on “probabilities”.

Naïve bayes has its roots strongly grounded in probabilities, we know that, probabilities are way of measuring the chance of an event happening. An event is any occurrence that has a probability attached to it.

Example: A coin has two sides head and tail. Here, getting head or tail is an event that has a probability of 0.5 attached to it.

For a question like, is Naïve bayes such simple? , if you expect an “yes”, it comes with a condition, because naïve bayes is based on conditional probability, as we all agree to the general fact that everything comes and leaves with a condition.

How is conditional probabilities different from the simple probabilities?

If we extend the above example, say we take two coins and toss them up, to find what is the probability of getting head in coin 1 is provided a condition that coin 2 is already turned out to be tail. In other words, with coin 2 being tail what is the probability of getting head in coin 1.

Out of two given events, with one event assumed to be already happened, conditional probabilities involves in finding out what is the probability of another event to happen. Mathematically,

In simple terms, LHS à the probability of event A given event B has already happened, is computed by dividing the probability of both events A and B happening together with probability of event B. since, we already know the probability of event B, we can easily find out the probability of A using the probability of event B.

Suppose that somebody secretly rolls two fair six-sided dice, and we wish to compute the probability that the face-up value of the first one is 2, given the information that their sum is no greater than 5.

  • Let D1 be the value rolled on die 1.

Probability that D1 = 2

Table 1 shows the sample space of 36 combinations of rolled values of the two dice, each of which occurs with probability 1/36, with the numbers displayed in the red and dark gray cells being D1 + D2.

D1 = 2 in exactly 6 of the 36 outcomes; thus P(D1 = 2) = ​6⁄36 = ​1⁄6:

Probability that D1 + D2 ≤ 5

Table 2 shows that D1 + D2 ≤ 5 for exactly 10 of the 36 outcomes, thus P(D1 + D2 ≤ 5) = ​10⁄36:

Probability that D1 = 2 given that D1 + D2 ≤ 5

Table 3 shows that for 3 of these 10 outcomes, D1 = 2.

Thus, the conditional probability is given by, P(A|B) = P(A∩B)/P(B)

P(D1 = 2 | (D1+D2 ≤ 5)) / P (D1+D2 ≤ 5)=​3⁄10 = 0.3:

The conditional probability part is stressed here because it serves as the backbone for the rest of the blog.

So, in the drive to understand naïve bayes before throttling up there are two more terms knowing them in prior helps us in the due course, they are independent event and mutually exclusive events

What are independent events?

Two events are said to be independent, when P(A|B)=P(A) or P(B|A)=P(B) (i.e) when two events are not influenced by each other they are independent.

Say for Example, when two dice are thrown at random, the probability of getting 5 or any other number in dice 1 is not going to affect the outcome of dice 2 and vice versa at any cost.

Proof: when two events are independent, P(A|B)=P(A ∩ B)/P(B) can be written as,



What are mutually exclusive events?

Two events are said to be mutually exclusive when P (A|B) = 0 or P (B|A)=0 (i.e.)

It can be explained well on single dice experiment, where when a single dice is thrown at random, the probability of getting 3 in a dice when 6 is already arrived by rolling it is 0. Since, 6 is already arrived there is no chance of 3 happening.

Proof: when two events are mutually exclusive, P (A|B ) = P (A ∩ B) / P (B) can be written as,

= 0/P (B)

= 0

Since, event A and B cannot happen together P (A ∩ B) =0 for mutually exclusive events.

As we now familiar with the basic fundas of naïve bayes, it’s time for us to test your knowledge,

Consider in a job interview, a interviewer asks the participant to write a spelling of the word rigour/rigor (British/American) (each letter in a piece of paper), a letter is taken at random from those words is found to be a vowel. Assume 40% British men attend the interview and the remaining 60% are Americans. What is the probability that the writer is a British men?

Give it a try….

So, converting the problem in Bayesian terms, the task here is to find the probability that the participant is a British men given the random word chosen is a vowel.

From the proof of bayes theorem we can write this as,

P(British |vowel) = P(British ∩vowel) / P(vowel)

= P(vowel| British) *P(British) / (P(vowel | British) * P(British) + P(vowel | British’) *P(British’))

So, finding the basic probabilities, no of vowels in both the spellings is 5 and no of words is 11 so, P(vowel) =5/11

P (british) = 0.4, P(british ’) = 0.6

P (American) = 0.6 , P(American ‘) = 0.4

P(vowel | british) = 3/6

P(vowel | british ‘) = 2/5

Substituting this in the above equation we get

P(British | vowel) = (3/6) * (0.6) / ((3/6)*(0.4)+(2/5)(0.6))

= 5/11

So, we can conclude that the probability the participant is a British men given the random word chosen is a vowel is 5/11

As far as now, we have looked into the basic requirement to understand naïve bayes, now it’s time for us to look into the math part of naïve bayes.

Here, I have demonstrated a hands-on proof for naïve bayes which will give you a holistic approach of the algorithm.

I definitely agree to the fact that, be it python or R it’s just a matter of 3 lines with libraries like scikit-learn, but, I always prefer to know what happens behind the screens, because I personally feel when we know the internals of an algorithms, it helps us to take informed decisions well.The above images must have done their part explaining the math behind naïve bayes.

It’s time to apply them for a real world problem. I have taken an example of tennis data by Jeff Sackmann. This dataset involves in predicting the probability of the tennis game can be played or not given different weather conditions like temperature humidity wind speed etc.,

Recalling bayes theorem, we know that

P(A|B= P(B∩A) / P(B)

So, here Event named A will represent the features or predictors and Event named B will represent the Response variable (whether to play or not).

Our aim here is, given a set of predictors we have to find whether the given conditions are suitable for the match to be played or not.

So, for example we have to calculate the probability for all the predictors in the train dataset, like

and we should continue the probability of calculating the probability of all predictors given (Play = Yes) and the same should be repeated for the response variable (Play = No).

I have summarized the above calculation for quick reference,

That’s great, now we have all the training dataset probabilities handy. Trust me it’s 95% done.

Now, it’s high time the harvested crops bearing us fruits. Yes, Now it’s time to classify our model.

Assume the Test feature as X. Let X be (Sunny, Mild, Normal, Weak). Given this predictors, we have to conclude whether this condition is suitable for playing or not.

From the proof of Bayesian theorem w. k. t.

Smoothing the above equation as per the problem

Now, we have to work on substitutions, as we have already calculated the probability values, replacing them in the above equation gives us

So, from the above equations we can conclude, that the game is on, since

So, simple right. Old is always gold. Naïve Bayes is almost a 260 years old algorithm, which stands high among the recently developed algorithms.

ML Data Associate | Amazon