Feature Selection using Mutual Information

Given two random variables X and Y, mutual information measures how much knowing one of these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice-verse, so their mutual information is zero. At the other extreme, if X is a deterministic function of Y and Y is a deterministic function of X then all information conveyed by X is shared with Y, i.e. knowing X determines the value of Y and vice-verse. If the variable X takes on the values {x1, x2, …} and Y takes on the values {y1, y2, …} then the mutual information of X and Y is denoted as the expected value of the point-wise mutual information i.e. log (P(X=xi, Y=yj)/P(X=xi)P(Y=yj))

MI (X, Y) = ∑ P(X=xi, Y=yj) * log (P(X=xi, Y=yj)/P(X=xi)P(Y=yj))        ∀ (X=xi, Y=yj)

All log values in further calculations are taken base 2.

If X and Y are mutually independent then for every (X=xi, Y=yj), P(X=xi, Y=yj) = P(X=xi)P(Y=yj) and hence log 1 = 0. Mutual Information is also related to the entropy (uncertainty of the random variables) :

MI (X, Y) = H(Y) – H(Y|X) = H(X) – H(X|Y)

where H(Y) is the entropy(uncertainty) of the variable Y, H(Y|X) is the remaining uncertainty in Y after knowing X. Hence H(Y) – H(Y|X) is the uncertainty in Y removed by knowing that X also occurred. Let X be the event the feature “Apple” occurs in a document, whereas Y be the event the feature “Steve Jobs” occurs in a document. If the document is about the Apple computers founded by Steve Jobs, then P(X, Y) denoting the joint probability distribution of X and Y, is greater than the product of the probabilities P(X)P(Y), because presence of the feature “Apple” would increase the chances of the feature “Steve Jobs” being present and vice-verse. Thus mutual information would be high in this case compared to if the document belonged to “Health benefits of fruits every day”. In terms of entropy, knowing that the feature “Apple” is present, makes the occurrence of the feature “Steve Jobs” much more certain than if “Apple” have not occurred and vice-verse. In the context of text classification, let X be the occurrence of the feature F in a document and Y be the event that a document belongs to class C. The point-wise mutual information, for document containing feature F and also belonging to class C.

PMI(F, C) = log (P(X=F, Y=C)/P(X=F)P(Y=C)),

Similarly the PMI for document containing feature F but not belonging to class C :  PMI(F, CT), document not containing feature F and belonging to class C : PMI(FT, C) and document not containing feature F as well as not belonging to class C : PMI(FT, CT), where X=FT denotes the event that document does not contain feature F and Y=CT denotes the event that document does not belong to class C :

PMI(F, CT) = log (P(X=F, Y=CT)/P(X=F)P(Y=CT)),

PMI(FT, C) = log (P(X=FT, Y=C)/P(X=FT)P(Y=CT)),

PMI(FT, CT) = log (P(X=FT, Y=CT)/P(X=FT)P(Y=CT)),

Thus the expected mutual information for feature F and class C is given as :

MI(F, C) = P(X=F, Y=C) * PMI(F, C) + P(X=F, Y=CT) * PMI(F, CT) + P(X=FT, Y=C) * PMI(FT, C) + P(X=FT, Y=CT) * PMI(FT, CT)

For multiple classes Ci, to check whether a feature F is discriminant or not, find the expected MI for feature F, using MI values of feature F with all the classes, i.e.

D(F) = ∑ P(Ci) * MI(F, Ci), where D(F) is the discrimination factor for feature F

From our earlier example with 1000 documents, 100 belong to c1, 800 to c2 and 100 to c3. Out of 1000 documents and 200 documents containing feature F, 30 belongs to class c1, 120 to class c2 and remaining 50 to class c3. Hence, for class c1,

PMI(F, c1) = 0.58, PMI(F, c1T) = -0.083, PMI(FT, c1) = -0.19, PMI(FT, c1T) = 0.02, Thus

MI(F, c1) = 0.00459

PMI(F, c2) = -0.41, PMI(F, c2T) = 1.0, PMI(FT, c2) = 0.087, PMI(FT, c2T) = -0.41, Thus

MI(F, c2) = 0.04076

PMI(F, c3) = 1.32, PMI(F, c3T) = -0.26, PMI(FT, c3) = -0.67, PMI(FT, c3T) = 0.06, Thus

MI(F, c3) = 0.0385

Thus the discriminant factor D(F) for the feature F is : D(F) = 0.037. We see that the feature F is more informative to classes c2 and c3, as compared to class c1

Mutual information is minimum (maximum uncertainty) when the feature F is equally probable in all the 3 classes and hence the feature information is independent of the class information. Mutual information is 0 when the feature is equally probable in all the classes and also the classes are equally distributed across the set of documents.

Mutual information is maximum (minimum uncertainty) when the feature is present in any one of the classes and thus these features becomes discriminative. In this example the feature F is slightly discriminative as its value is greater than 0 but we cannot say that it is highly discriminative of any particular class.

Advertisements