The Kullback-Liebler (KL) divergence is a fundamental equation of information theory that quantifies the proximity of two probability distributions. If the true probability distribution of a sequence of events is denoted by P, then for any distribution Q that is used to model or approximate the true distribution P, the information lost in the process is the KL Divergence.

D_{KL} (P||Q) = ∑_{i} P_{i} log_{2} (P_{i} / Q_{i})

As you can see that this quantity can be related to the mutual information that we had used earlier for feature selection. Mutual Information is the KL Divergence of the joint distribution of feature and class P(Class, Feature) from the product of their marginal distributions P(Class)*P(Feature).

Given that we know the joint distribution P(Class, Feature) i.e. the probability a document containing both feature F and belonging to class C, if we approximate it with P(Class)*P(Feature), our approximation is correct if and only if the feature F is independent of class C, but if given that, for documents belonging to class C, the probability of feature F occurring in those documents is higher than any other documents, then our approximation has errors as it does not account for this additional information gain. Higher the KL divergence, higher is the divergence of the feature from being equally probable in all classes.

Given that the feature F is contained inside N documents, and it is observed that out of N documents, N_{1} belongs to class c_{1}, N_{2} to class c_{2} and N_{3} to class c_{3}. What is the likelihood of observing this distribution given that the feature is equally likely to be present in all the 3 classes ?

Likelihood = N!/(N_{1}! N_{2}! N_{3}!) (1/3)^{N}

Then finding the average log likelihood of the above quantity, we get :

L = (1/N) log (N!/(N_{1}! N_{2}! N_{3}!) (1/3)^{N})

Using Stirling’s approximation log N! ≈ N log N – N, for large values of N, we get

L = (1/N) (N log N – N – N_{1} log N_{1} – N_{2} log N_{2} – N_{3} log N_{3} + N_{1} + N_{2} + N_{3} – N log 3)

= (1/N) (N log N – N_{1} log N_{1} – N_{2} log N_{2} – N_{3} log N_{3} – N log 3), since N_{1} + N_{2} + N_{3} = N

= (1/N) (- N_{1} log (3N_{1}/N)- N_{2} log (3N_{2}/N) – N_{3} log (3N_{3}/N))

If we denote P_{i} = (N_{i }/N) and Q_{i} = 1/3, then we see that the above average log likelihood equals – ∑_{i} P_{i} log_{2} (P_{i} / Q_{i}), i.e. the negative of the KL Divergence.

Hence KL Divergence measures the negative of the average log likelihood of observing the joint probability distribution P(Class, Feature) given that the feature is equally probable in all the classes i.e. features are selected randomly with replacement from a feature space and put in documents without any information which class the document belongs to.