Feature Selection using Kullback-Liebler Divergence

The Kullback-Liebler (KL) divergence is a fundamental equation of information theory that quantifies the proximity of two probability distributions. If the true probability distribution of a sequence of events is denoted by P, then for any distribution Q that is used to model or approximate the true distribution P, the information lost in the process is the KL Divergence.

DKL (P||Q) = ∑i Pi log2 (Pi / Qi)

As you can see that this quantity can be related to the mutual information that we had used earlier for feature selection. Mutual Information is the KL Divergence of the joint distribution of feature and class P(Class, Feature) from the product of their marginal distributions P(Class)*P(Feature).

Given that we know the joint distribution P(Class, Feature) i.e. the probability a document containing both feature F and belonging to class C, if we approximate it with P(Class)*P(Feature), our approximation is correct if and only if the feature F is independent of class C, but if given that, for documents belonging to class C, the probability of feature F occurring in those documents is higher than any other documents, then our approximation has errors as it does not account for this additional information gain. Higher the KL divergence, higher is the divergence of the feature from being equally probable in all classes.

Given that the feature F is contained inside N documents, and it is observed that out of N documents, N1 belongs to class c1, N2 to class c2 and N3 to class c3. What is the likelihood of observing this distribution given that the feature is equally likely to be present in all the 3 classes ?

Likelihood = N!/(N1! N2! N3!) (1/3)N

Then finding the average log likelihood of the above quantity, we get :

L = (1/N) log (N!/(N1! N2! N3!) (1/3)N)

Using Stirling’s approximation log N! ≈ N log N – N, for large values of N, we get

L = (1/N) (N log N – N – N1 log N1 – N2 log N2 – N3 log N3 + N1 + N2 + N3 – N log 3)

= (1/N) (N log N – N1 log N1 – N2 log N2 – N3 log N3 – N log 3), since N1 + N2 + N3 = N

= (1/N) (- N1 log (3N1/N)- N2 log (3N2/N) – N3 log (3N3/N))

If we denote Pi  = (Ni /N) and Qi = 1/3, then we see that the above average log likelihood equals – ∑i Pi log2 (Pi / Qi), i.e. the negative of the KL Divergence.

Hence KL Divergence measures the negative of the average log likelihood of observing the joint probability distribution P(Class, Feature) given that the feature is equally probable in all the classes i.e. features are selected randomly with replacement from a feature space and put in documents without any information which class the document belongs to.


Feature Selection using Mutual Information

Given two random variables X and Y, mutual information measures how much knowing one of these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice-verse, so their mutual information is zero. At the other extreme, if X is a deterministic function of Y and Y is a deterministic function of X then all information conveyed by X is shared with Y, i.e. knowing X determines the value of Y and vice-verse. If the variable X takes on the values {x1, x2, …} and Y takes on the values {y1, y2, …} then the mutual information of X and Y is denoted as the expected value of the point-wise mutual information i.e. log (P(X=xi, Y=yj)/P(X=xi)P(Y=yj))

MI (X, Y) = ∑ P(X=xi, Y=yj) * log (P(X=xi, Y=yj)/P(X=xi)P(Y=yj))        ∀ (X=xi, Y=yj)

All log values in further calculations are taken base 2.

If X and Y are mutually independent then for every (X=xi, Y=yj), P(X=xi, Y=yj) = P(X=xi)P(Y=yj) and hence log 1 = 0. Mutual Information is also related to the entropy (uncertainty of the random variables) :

MI (X, Y) = H(Y) – H(Y|X) = H(X) – H(X|Y)

where H(Y) is the entropy(uncertainty) of the variable Y, H(Y|X) is the remaining uncertainty in Y after knowing X. Hence H(Y) – H(Y|X) is the uncertainty in Y removed by knowing that X also occurred. Let X be the event the feature “Apple” occurs in a document, whereas Y be the event the feature “Steve Jobs” occurs in a document. If the document is about the Apple computers founded by Steve Jobs, then P(X, Y) denoting the joint probability distribution of X and Y, is greater than the product of the probabilities P(X)P(Y), because presence of the feature “Apple” would increase the chances of the feature “Steve Jobs” being present and vice-verse. Thus mutual information would be high in this case compared to if the document belonged to “Health benefits of fruits every day”. In terms of entropy, knowing that the feature “Apple” is present, makes the occurrence of the feature “Steve Jobs” much more certain than if “Apple” have not occurred and vice-verse. In the context of text classification, let X be the occurrence of the feature F in a document and Y be the event that a document belongs to class C. The point-wise mutual information, for document containing feature F and also belonging to class C.

PMI(F, C) = log (P(X=F, Y=C)/P(X=F)P(Y=C)),

Similarly the PMI for document containing feature F but not belonging to class C :  PMI(F, CT), document not containing feature F and belonging to class C : PMI(FT, C) and document not containing feature F as well as not belonging to class C : PMI(FT, CT), where X=FT denotes the event that document does not contain feature F and Y=CT denotes the event that document does not belong to class C :

PMI(F, CT) = log (P(X=F, Y=CT)/P(X=F)P(Y=CT)),

PMI(FT, C) = log (P(X=FT, Y=C)/P(X=FT)P(Y=CT)),

PMI(FT, CT) = log (P(X=FT, Y=CT)/P(X=FT)P(Y=CT)),

Thus the expected mutual information for feature F and class C is given as :

MI(F, C) = P(X=F, Y=C) * PMI(F, C) + P(X=F, Y=CT) * PMI(F, CT) + P(X=FT, Y=C) * PMI(FT, C) + P(X=FT, Y=CT) * PMI(FT, CT)

For multiple classes Ci, to check whether a feature F is discriminant or not, find the expected MI for feature F, using MI values of feature F with all the classes, i.e.

D(F) = ∑ P(Ci) * MI(F, Ci), where D(F) is the discrimination factor for feature F

From our earlier example with 1000 documents, 100 belong to c1, 800 to c2 and 100 to c3. Out of 1000 documents and 200 documents containing feature F, 30 belongs to class c1, 120 to class c2 and remaining 50 to class c3. Hence, for class c1,

PMI(F, c1) = 0.58, PMI(F, c1T) = -0.083, PMI(FT, c1) = -0.19, PMI(FT, c1T) = 0.02, Thus

MI(F, c1) = 0.00459

PMI(F, c2) = -0.41, PMI(F, c2T) = 1.0, PMI(FT, c2) = 0.087, PMI(FT, c2T) = -0.41, Thus

MI(F, c2) = 0.04076

PMI(F, c3) = 1.32, PMI(F, c3T) = -0.26, PMI(FT, c3) = -0.67, PMI(FT, c3T) = 0.06, Thus

MI(F, c3) = 0.0385

Thus the discriminant factor D(F) for the feature F is : D(F) = 0.037. We see that the feature F is more informative to classes c2 and c3, as compared to class c1

Mutual information is minimum (maximum uncertainty) when the feature F is equally probable in all the 3 classes and hence the feature information is independent of the class information. Mutual information is 0 when the feature is equally probable in all the classes and also the classes are equally distributed across the set of documents.

Mutual information is maximum (minimum uncertainty) when the feature is present in any one of the classes and thus these features becomes discriminative. In this example the feature F is slightly discriminative as its value is greater than 0 but we cannot say that it is highly discriminative of any particular class.