Machine Learning Fundamentals : Maximum Likelihood Estimation

On tossing a coin 10 times, head lands 7 times while tail lands 3 times. Assuming that the coin can be either fair or biased, we do not know the probability distribution of each outcome. Lets suppose that the probability of landing heads is p, then the probability of tails would be (1-p). What is the likelihood that this probability distribution generated the outcome i.e. the probability of 7 heads and 3 tails ? Since the tossing of a coin follows the Binomial Distribution, we have

L(p) = 10! / (7! 3!) p7(1-p)3 = 120 p7(1-p)3

According to the principle of Maximum Likelihood Estimation, the value of p should be as to maximize the Log-likelihood of the outcome. Computing the Log-likelihood:

Log L(p) = log 120 + 7 log p + 3 log (1-p)

To maximize the Log-likelihood, we differentiate the above w.r.t. p and set it to zero, i.e. (7/p) – 3/(1-p) = 0 or p = 0.7.

MLE is used to estimate the parameters of a probability distribution given a set of independent and identically distributed observations. In the above case it was the distribution of a coin. The same logic could be extended to estimate parameters of complex probability distributions.

Given a sequence of numbers x1, x2, …, xN, following a Normal distribution N(μ, σ) with the parameters μ and σ. Before we estimate the parameters μ and σ, we need to find out what is the likelihood that N(μ, σ) generated the sequence of numbers  x1, x2, …, xN :

L(μ, σ | xi) = ∏i (1/√2π) (1/σ) exp(-(xi – μ)2/2σ2)

L(μ, σ | xi) = (1/√2π)N  (1/σ)N  exp(-(1/2σ2) ∑i (xi – μ)2)

Log L(μ, σ | xi) = N log (1/√2π) – N log σ – (1/2σ2)  ∑i (xi – μ)2

Taking derivative of the above Log likelihood w.r.t. μ, and setting it to zero, we get

μ = (x1 + x2 + …+ xN)/N,

which is the mean of the sequence of numbers. Next taking derivative w.r.t. σ and setting it to zero we get

σ2 = (1/N) ∑i (xi – μ)2

Thus σ2 is the variance of the sequence of numbers, or σ is the standard deviation. Given a random observation and the probability distribution generating the observation, we can use MLE to derive the parameters for the distribution that generated the observation.

Combining prediction of two classifiers

In our problem of classifying a product review as “positive” or “negative”, we had earlier used Bayesian classifier to predict the latest review, using the already tagged older reviews as the training data. Instead of using Bayesian classifier alone, lets suppose we also used K-nearest Neighbours algorithm along with Naive Bayes. How would you combine the results obtained using Naive Bayes and KNN to get the best overall accuracy ? The task is much simpler when both the algorithms predicts the same output for a product review but the task becomes much more difficult when their outputs are different.

From the prediction data, for both Naive Bayes and KNN, scores above 0.75 are correct at-least 90% of the times. Given that a review is reported to be “positive” with score 0.77 by Naive Bayes and is reported to be “negative” with score 0.83 by KNN, which prediction we choose ? We do not have enough information to come to a conclusion right away. But now if I give the information that for scores between 0.75 and 0.85, KNN gives 40% incorrect predictions whereas Naive Bayes gives 20% incorrect predictions, then which prediction will you choose ? Although you are still uncertain but given the additional information, you are now more inclined to choose Naive Bayes prediction over KNN and you are right with very high chances.

We can further improve our chances of choosing the correct prediction, by finding the percentage of accurate predictions  based on the class information. In the above example, where Naive Bayes predicted “positive” and KNN predicted “negative”, instead of finding percentage of correct(incorrect) predictions between scores 0.75 and 0.85, for both Naive Bayes and KNN, find the percentage of correct(incorrect) predictions with “positive” tags between scores 0.75 and 0.85 for Naive Bayes and percentage of correct(incorrect) predictions with “negative” tags between scores 0.75 and 0.85 for KNN. Then do a direct comparison on the percentages. Using class information is helpful since the distribution of scores for different classes might be different hence taking an overall estimate is risky sometimes.

For example given a few thousand reviews to validate from the training set, each algorithm gives its own sequence of predictions and scores. Now using the actual sentiment tags associated with the reviews, create tables for each algorithm with rows being score ranges and columns “Positive Predicted”, “Positive Correct”, “Negative Predicted”, “Negative Correct”.

Naive Bayes Scores Positive Predicted Positive Correct Negative Predicted Negative Correct
0.10 – 0.15 10 1 13 1
0.75 – 0.80 25 22 18 14
0.95 – 1.00 8 8 5 5

Similarly for KNN :

KNN Scores Positive Predicted Positive Correct Negative Predicted Negative Correct
0.10 – 0.15 5 0 7 0
0.80 – 0.85 30 20 24 12
0.95 – 1.00 10 10 7 7

The entry NB 0.75-0.80, Positive Predicted would denote the number of reviews predicted as “positive” within the score range 0.75 and 0.80 for Naive Bayes, similarly NB 0.75-0.80, Positive Correct would denote the number of reviews “correctly” predicted as “positive” within the score range 0.75 and 0.80 for Naive Bayes and similarly for the KNN algorithm.

Then on applying both the algorithms on the test review, review is reported to be “positive” with score 0.77 by Naive Bayes and is reported to be “negative” with score 0.83 by KNN. From the above tables we see that for Naive Bayes predictions within the score ranges 0.75 and 0.80, it correctly predicts “Positive” reviews with 88% accuracy and for KNN predictions within the score range 0.80 and 0.85, it correctly predicts “Negative” reviews with 50% accuracy. Thus we classify the review as “Positive” since we have greater confidence in it. Although it might be that KNN performs better overall in terms of accuracy but locally their performance might differ.

We need not do the above calculations if both NB and KNN predicted the same class even if scores are different. In case where NB 0.75-0.80, Positive Predicted is very less, increase the window size to 0.70-0.80 or 0.75-0.85 or just take the average of both. In another variation, we can use a Decision Tree learner separately on the prediction scores where the input features to the Decision Tree would the predicted classes for each of the ensemble learners and the predicted scores from each of the learners and the target labels would be actual class labels.

In the Decision Tree variant, we would not have to manually select a constant small range of scores (like 0.80 to 0.85 and so on), the decision tree learner would automatically select ranges such that the entropy is minimized i.e. within each selected score range, the predicted scores of one of the classifiers would have much higher participation within the range than the other classifier, so that the predictions become easier. Use a 50-50 split on the validation samples. Train the Decision Tree learner on 50% of the training samples and validate on the remaining 50%. Will discuss in details on how to combine an ensemble of learners using C5.0 Decision Tree with Ada-boost.