In our problem of classifying a product review as “positive” or “negative”, we had earlier used Bayesian classifier to predict the latest review, using the already tagged older reviews as the training data. Instead of using Bayesian classifier alone, lets suppose we also used K-nearest Neighbours algorithm along with Naive Bayes. How would you combine the results obtained using Naive Bayes and KNN to get the best overall accuracy ? The task is much simpler when both the algorithms predicts the same output for a product review but the task becomes much more difficult when their outputs are different.
From the prediction data, for both Naive Bayes and KNN, scores above 0.75 are correct at-least 90% of the times. Given that a review is reported to be “positive” with score 0.77 by Naive Bayes and is reported to be “negative” with score 0.83 by KNN, which prediction we choose ? We do not have enough information to come to a conclusion right away. But now if I give the information that for scores between 0.75 and 0.85, KNN gives 40% incorrect predictions whereas Naive Bayes gives 20% incorrect predictions, then which prediction will you choose ? Although you are still uncertain but given the additional information, you are now more inclined to choose Naive Bayes prediction over KNN and you are right with very high chances.
We can further improve our chances of choosing the correct prediction, by finding the percentage of accurate predictions based on the class information. In the above example, where Naive Bayes predicted “positive” and KNN predicted “negative”, instead of finding percentage of correct(incorrect) predictions between scores 0.75 and 0.85, for both Naive Bayes and KNN, find the percentage of correct(incorrect) predictions with “positive” tags between scores 0.75 and 0.85 for Naive Bayes and percentage of correct(incorrect) predictions with “negative” tags between scores 0.75 and 0.85 for KNN. Then do a direct comparison on the percentages. Using class information is helpful since the distribution of scores for different classes might be different hence taking an overall estimate is risky sometimes.
For example given a few thousand reviews to validate from the training set, each algorithm gives its own sequence of predictions and scores. Now using the actual sentiment tags associated with the reviews, create tables for each algorithm with rows being score ranges and columns “Positive Predicted”, “Positive Correct”, “Negative Predicted”, “Negative Correct”.
|Naive Bayes Scores||Positive Predicted||Positive Correct||Negative Predicted||Negative Correct|
|0.10 – 0.15||10||1||13||1|
|0.75 – 0.80||25||22||18||14|
|0.95 – 1.00||8||8||5||5|
Similarly for KNN :
|KNN Scores||Positive Predicted||Positive Correct||Negative Predicted||Negative Correct|
|0.10 – 0.15||5||0||7||0|
|0.80 – 0.85||30||20||24||12|
|0.95 – 1.00||10||10||7||7|
The entry NB 0.75-0.80, Positive Predicted would denote the number of reviews predicted as “positive” within the score range 0.75 and 0.80 for Naive Bayes, similarly NB 0.75-0.80, Positive Correct would denote the number of reviews “correctly” predicted as “positive” within the score range 0.75 and 0.80 for Naive Bayes and similarly for the KNN algorithm.
Then on applying both the algorithms on the test review, review is reported to be “positive” with score 0.77 by Naive Bayes and is reported to be “negative” with score 0.83 by KNN. From the above tables we see that for Naive Bayes predictions within the score ranges 0.75 and 0.80, it correctly predicts “Positive” reviews with 88% accuracy and for KNN predictions within the score range 0.80 and 0.85, it correctly predicts “Negative” reviews with 50% accuracy. Thus we classify the review as “Positive” since we have greater confidence in it. Although it might be that KNN performs better overall in terms of accuracy but locally their performance might differ.
We need not do the above calculations if both NB and KNN predicted the same class even if scores are different. In case where NB 0.75-0.80, Positive Predicted is very less, increase the window size to 0.70-0.80 or 0.75-0.85 or just take the average of both. In another variation, we can use a Decision Tree learner separately on the prediction scores where the input features to the Decision Tree would the predicted classes for each of the ensemble learners and the predicted scores from each of the learners and the target labels would be actual class labels.
In the Decision Tree variant, we would not have to manually select a constant small range of scores (like 0.80 to 0.85 and so on), the decision tree learner would automatically select ranges such that the entropy is minimized i.e. within each selected score range, the predicted scores of one of the classifiers would have much higher participation within the range than the other classifier, so that the predictions become easier. Use a 50-50 split on the validation samples. Train the Decision Tree learner on 50% of the training samples and validate on the remaining 50%. Will discuss in details on how to combine an ensemble of learners using C5.0 Decision Tree with Ada-boost.