How to measure performance of a classifier : Accuracy vs. Efficiency

Many organizations face the task of classification, classifying reviews into positive and negative, classifying emails as spam and non-spam, classifying financial documents into loan types, classifying a set of symptoms to a disease etc. There are scientific measures to assess the performance of the classifier such as Precision, F-Score and Recall.

True positive (TP) for a category is the count of how many objects were predicted correctly for that category, False positive (FP) for a category is the count of how many objects were predicted for that category given that those objects actually belonged to a different category, True negative (TN) for a category is the count of how many objects were predicted into a different category given that those objects were actually belonging to a different category and similarly False negative (FN) for a category is the count of how many objects were predicted into a different category given that those objects were actually belonging to this category.

Precision defines the rate of correct prediction (accuracy) by the classifier for a category i.e. Given that reviews are predicted to be “positive”, what fraction of it are truly “positive” i.e. TP/(TP+FP). Similarly Recall defines the efficiency of the classifier for a category i.e. Given reviews are “positive”, what percentage are predicted correctly as “positive” i.e. TP/(TP+FN).

For example out of 100 reviews for a product, 80 reviews are positive and 20 are negative. Using a trained classifier to validate, out of the 80 truly positive reviews 60 are predicted as positive and 20 as negative, and out of the 20 truly negative reviews, 10 are predicted as positive and remaining 10 as negative. Then the precision of the classifier for “positive” reviews is 60/70 = 0.86 and recall of the classifier for “positive” reviews is 60/80 = 0.75. Similarly precision of classifier for “negative” reviews is 10/30 = 0.33 and recall 10/20 = 0.5.

Two different numbers for each category seems confusing to report to assess the performance of the classifier. While it makes more sense to use the precision value in a production scenario where the true labels are not available but similarly it makes more sense to report recall values to a client who is more concerned with knowing how many of the Loan Applications he has given has been correctly validated with the classifier. Hence we generally use a single measurement called F-Score that is the Harmonic mean of precision and recall

F-score = 2*precision*recall/(precision+recall).

Advertisements