Machine Learning Fundamentals Part 2 : Bayesian Classification

Given the task of classifying a product review as “positive” or “negative”, first we need to train a model using already labelled samples. For example lets say that 1000 reviews are available with the tags “positive” or “negative” associated with each review. Now to train a model using the distribution of words in the sample reviews, compute the following probabilities P(w1), P(w2), … where P(wi) denotes the fraction of reviews containing the word wi i.e. if the word w1 occurs in 30 out of 1000 reviews, then P(w1) = 0.03. Now lets say that 800 out of 1000 reviews are labelled “positive” while remaining 200 are labelled as “negative”. Then P(+) = 0.8 and P(-) = 0.2.

What we need to estimate is, given an unlabelled review and the set of words w’1, w’2, …. in it, which is a subset of the words w1, w2, …. from our training samples, what is the probability of the review being “positive” (probability of it being “negative” would be 1 minus that), i.e. P(+|w’1, w’2, ….). From our earlier tutorial we saw that using Bayes’ Theorem:

P(+|w’1, w’2, ….) = (P(w’1, w’2, …. | +) * P(+))/P(w’1, w’2, ….)

Since the value P(w’1, w’2, ….) is just a scaling factor and is constant for both the categories “positive” and “negative”, we can omit that. P(+) is already known to be 0.8. We need to estimate the probability P(w’1, w’2, …. | +) i.e. given that a review is “positive” what is the likelihood of the review “generating” the words w’1, w’2, ….. Using the chain rule of probability :

P(w’1, w’2, …. | +) = P(w’1 | w’2, w’3, …., + ) * P(w’2 | w’3, w’4, …, +) * …* P(w’M|+)

From out training samples we can estimate each of the above product terms, but given a large number of terms, it will take a lot of time to find the above distribution for each review and will not be feasible for any real time prediction. Instead we assume that each word is independent of other words and each word is only dependent on the category that generated it i.e. P(w’1 | w’2, w’3, w’4, …., + ) = P(w’1 | +), P(w’2 | w’3, w’4, w’5, …., +) = P(w’2 | +), and so on for other words. This is the Naive Bayes assumption. With the above assumptions, we have :

P(w’1, w’2, …. | +) = P(w’1 | +) * P(w’2 | +) * …. P(w’M | +)

From the training samples we can easily estimate the above terms beforehand i.e. Given positive reviews only, P(w’1 | +) denotes what fraction of the positive reviews contains the word w’1. In fact, to account for words which are not present in “positive” reviews and subsequently avoid multiplying by a zero, the values P(w’i | +) are computed as

(1 + Number of positive reviews w’i occurs in)/(Number words + Number of positive reviews)

After computing the quantity P(w’1, w’2, …. | +) we can compute the quantity P(+|w’1, w’2, ….), which if greater than 0.6(or any other value greater than 0.5), we say that the review is positive. While for the negative case, we might say that if 1-P(+|w’1, w’2, ….) is less than 0.25, then the review is negative, else if it is between 0.25 and 0.6 then we say that we are not sure.

The above model can be easily extended to multiple classes instead of just two classes (positive and negative), for example in medical diagnosis, the classes can be the list of diseases, while counterparts for words will be the observed symptoms. Also in the above example, instead of just considering words, we can also consider N-grams i.e. sequence of words. Experimentally it is found that 2-grams or 3-grams perform better that single words (1-grams) in the overall accuracy of the classifier.

Machine Learning Fundamentals Part 1 : Bayesian Probability

Lets start the discussion with a simple problem : You toss two identical coins simultaneously 100 times. What is the expected number of times at-least one heads will come up ? Since the event that a head comes up is independent from a tail coming up, hence sum of their probabilities should equal 1, i.e. probability of a head coming up is exactly equals to a tail coming up (assuming a fair coin) and that is 1/2. Now since there are 4 possible ways two coins can land, i.e. {H, H} {H, T}, {T, H} and {T, T} and each possibility has equal probability of appearing. Thus 3 out 4 possibilities has at-least one heads. Hence it is expected that 75 out of 100 times tossing the two coins will give at-least one heads.

What if I now say that given one coin comes up tails, what is the expected number of times at-least one heads will come up ? With the new information that one coin is tails, we see that 2 out of 4 possibilities satisfies the criteria that at-least one heads comes up. Thus 50 out of 100 times tossing the two coins will give at-least one heads with one tails.

Given a murder scene, and the list of 5 possible suspects (3 males, 2 females), the chances that one of them actually committed the murder is 1/5, this is the prior belief for each suspect. But then the detective discovers long hair besides the corpse. This becomes an evidence that the killer was a female (assuming that the deceased was a male). So now what should be the beliefs about each suspect. The updated beliefs are 1/2 for each female suspect and 0 for the male suspects. Bayes rule helps us in updating the prior beliefs given a set of evidences. Expressing the Bayes’ Theorem :

Probability of event A given evidence B = (Probability or Likelihood of observing B given event A occurred * Probability of A)/(Probability of B)

Coming to solving a machine learning problem with Bayes Theorem : Given a set of reviews on Amazon for a household product, which are labelled with “positive” and “negative” tags. Now for a new incoming review, what are the chances that the review is “negative”. Without any information about word distribution in the labelled reviews, if there are 75 positive and 25 negative reviews, then the probability of the incoming review being negative would be about 1/4. Now suppose our new review contains the word “broken” and it is reported that 60% of all the reviews, contains the word “broken” and 80% of the negative reviews contain the word “broken”, then what will be the updated probability of the new review being negative ? According to Bayes Theorem it would be 0.8*0.25/0.6 = 1/3.

We see that given the information about the word “broken” in the existing reviews and the new review, the chances of the new review being negative increases. This is the backbone of the Naive Bayes algorithm where instead of information about one word, we find all prior information about bag-of-words and then using Bayes’ Theorem we update the belief about the new event falling in one of the categories.