Machine Learning Fundamentals Part 2 : Bayesian Classification

Given the task of classifying a product review as “positive” or “negative”, first we need to train a model using already labelled samples. For example lets say that 1000 reviews are available with the tags “positive” or “negative” associated with each review. Now to train a model using the distribution of words in the sample reviews, compute the following probabilities P(w1), P(w2), … where P(wi) denotes the fraction of reviews containing the word wi i.e. if the word w1 occurs in 30 out of 1000 reviews, then P(w1) = 0.03. Now lets say that 800 out of 1000 reviews are labelled “positive” while remaining 200 are labelled as “negative”. Then P(+) = 0.8 and P(-) = 0.2.

What we need to estimate is, given an unlabelled review and the set of words w’1, w’2, …. in it, which is a subset of the words w1, w2, …. from our training samples, what is the probability of the review being “positive” (probability of it being “negative” would be 1 minus that), i.e. P(+|w’1, w’2, ….). From our earlier tutorial we saw that using Bayes’ Theorem:

P(+|w’1, w’2, ….) = (P(w’1, w’2, …. | +) * P(+))/P(w’1, w’2, ….)

Since the value P(w’1, w’2, ….) is just a scaling factor and is constant for both the categories “positive” and “negative”, we can omit that. P(+) is already known to be 0.8. We need to estimate the probability P(w’1, w’2, …. | +) i.e. given that a review is “positive” what is the likelihood of the review “generating” the words w’1, w’2, ….. Using the chain rule of probability :

P(w’1, w’2, …. | +) = P(w’1 | w’2, w’3, …., + ) * P(w’2 | w’3, w’4, …, +) * …* P(w’M|+)

From out training samples we can estimate each of the above product terms, but given a large number of terms, it will take a lot of time to find the above distribution for each review and will not be feasible for any real time prediction. Instead we assume that each word is independent of other words and each word is only dependent on the category that generated it i.e. P(w’1 | w’2, w’3, w’4, …., + ) = P(w’1 | +), P(w’2 | w’3, w’4, w’5, …., +) = P(w’2 | +), and so on for other words. This is the Naive Bayes assumption. With the above assumptions, we have :

P(w’1, w’2, …. | +) = P(w’1 | +) * P(w’2 | +) * …. P(w’M | +)

From the training samples we can easily estimate the above terms beforehand i.e. Given positive reviews only, P(w’1 | +) denotes what fraction of the positive reviews contains the word w’1. In fact, to account for words which are not present in “positive” reviews and subsequently avoid multiplying by a zero, the values P(w’i | +) are computed as

(1 + Number of positive reviews w’i occurs in)/(Number words + Number of positive reviews)

After computing the quantity P(w’1, w’2, …. | +) we can compute the quantity P(+|w’1, w’2, ….), which if greater than 0.6(or any other value greater than 0.5), we say that the review is positive. While for the negative case, we might say that if 1-P(+|w’1, w’2, ….) is less than 0.25, then the review is negative, else if it is between 0.25 and 0.6 then we say that we are not sure.

The above model can be easily extended to multiple classes instead of just two classes (positive and negative), for example in medical diagnosis, the classes can be the list of diseases, while counterparts for words will be the observed symptoms. Also in the above example, instead of just considering words, we can also consider N-grams i.e. sequence of words. Experimentally it is found that 2-grams or 3-grams perform better that single words (1-grams) in the overall accuracy of the classifier.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s