Feature Selection using Kullback-Liebler Divergence

The Kullback-Liebler (KL) divergence is a fundamental equation of information theory that quantifies the proximity of two probability distributions. If the true probability distribution of a sequence of events is denoted by P, then for any distribution Q that is used to model or approximate the true distribution P, the information lost in the process is the KL Divergence.

D_KL (P||Q) = ∑_i P_i log₂ (P_i / Q_i)

As you can see that this quantity can be related to the mutual information that we had used earlier for feature selection. Mutual Information is the KL Divergence of the joint distribution of feature and class P(Class, Feature) from the product of their marginal distributions P(Class)*P(Feature).

Given that we know the joint distribution P(Class, Feature) i.e. the probability a document containing both feature F and belonging to class C, if we approximate it with P(Class)*P(Feature), our approximation is correct if and only if the feature F is independent of class C, but if given that, for documents belonging to class C, the probability of feature F occurring in those documents is higher than any other documents, then our approximation has errors as it does not account for this additional information gain. Higher the KL divergence, higher is the divergence of the feature from being equally probable in all classes.

Given that the feature F is contained inside N documents, and it is observed that out of N documents, N₁ belongs to class c₁, N₂ to class c₂ and N₃ to class c₃. What is the likelihood of observing this distribution given that the feature is equally likely to be present in all the 3 classes ?

Likelihood = N!/(N₁! N₂! N₃!) (1/3)^N

Then finding the average log likelihood of the above quantity, we get :

L = (1/N) log (N!/(N₁! N₂! N₃!) (1/3)^N)

Using Stirling’s approximation log N! ≈ N log N – N, for large values of N, we get

L = (1/N) (N log N – N – N₁ log N₁ – N₂ log N₂ – N₃ log N₃ + N₁ + N₂ + N₃ – N log 3)

= (1/N) (N log N – N₁ log N₁ – N₂ log N₂ – N₃ log N₃ – N log 3), since N₁ + N₂ + N₃ = N

= (1/N) (- N₁ log (3N₁/N)- N₂ log (3N₂/N) – N₃ log (3N₃/N))

If we denote P_i = (N_i/N) and Q_i = 1/3, then we see that the above average log likelihood equals – ∑_i P_i log₂ (P_i / Q_i), i.e. the negative of the KL Divergence.

Hence KL Divergence measures the negative of the average log likelihood of observing the joint probability distribution P(Class, Feature) given that the feature is equally probable in all the classes i.e. features are selected randomly with replacement from a feature space and put in documents without any information which class the document belongs to.

The Mechanical Mind

A Mind cannot Think or Learn on its own unless you Teach it to do so

Menu

Feature Selection using Kullback-Liebler Divergence

Leave a comment Cancel reply

Menu

Share this:

Related

Leave a comment Cancel reply