Lets start the discussion with a simple problem : You toss two identical coins simultaneously 100 times. What is the expected number of times at-least one heads will come up ? Since the event that a head comes up is independent from a tail coming up, hence sum of their probabilities should equal 1, i.e. probability of a head coming up is exactly equals to a tail coming up (assuming a fair coin) and that is 1/2. Now since there are 4 possible ways two coins can land, i.e. {H, H} {H, T}, {T, H} and {T, T} and each possibility has equal probability of appearing. Thus 3 out 4 possibilities has at-least one heads. Hence it is expected that 75 out of 100 times tossing the two coins will give at-least one heads.

What if I now say that given one coin comes up tails, what is the expected number of times at-least one heads will come up ? With the new information that one coin is tails, we see that 2 out of 4 possibilities satisfies the criteria that at-least one heads comes up. Thus 50 out of 100 times tossing the two coins will give at-least one heads with one tails.

Given a murder scene, and the list of 5 possible suspects (3 males, 2 females), the chances that one of them actually committed the murder is 1/5, this is the prior belief for each suspect. But then the detective discovers long hair besides the corpse. This becomes an evidence that the killer was a female (assuming that the deceased was a male). So now what should be the beliefs about each suspect. The updated beliefs are 1/2 for each female suspect and 0 for the male suspects. Bayes rule helps us in updating the prior beliefs given a set of evidences. Expressing the Bayes’ Theorem :

Probability of event A given evidence B = (Probability or Likelihood of observing B given event A occurred * Probability of A)/(Probability of B)

Coming to solving a machine learning problem with Bayes Theorem : Given a set of reviews on Amazon for a household product, which are labelled with “positive” and “negative” tags. Now for a new incoming review, what are the chances that the review is “negative”. Without any information about word distribution in the labelled reviews, if there are 75 positive and 25 negative reviews, then the probability of the incoming review being negative would be about 1/4. Now suppose our new review contains the word “broken” and it is reported that 60% of all the reviews, contains the word “broken” and 80% of the negative reviews contain the word “broken”, then what will be the updated probability of the new review being negative ? According to Bayes Theorem it would be 0.8*0.25/0.6 = 1/3.

We see that given the information about the word “broken” in the existing reviews and the new review, the chances of the new review being negative increases. This is the backbone of the Naive Bayes algorithm where instead of information about one word, we find all prior information about bag-of-words and then using Bayes’ Theorem we update the belief about the new event falling in one of the categories.