Feature Selection using Chi Square Test

Given two random variables X and Y, we propose a null hypothesis H0 which states that the two variables are independent, i.e. the observed distribution of data due to the two random variables is due to chance alone, and then we propose an alternative hypothesis Ha that says the two variables are dependent and their observed distribution is not due to chance alone. Using Chi square test we reject one of the hypothesis and accept the other one.

Let us test the hypothesis that gender is independent of engineering as a stream of study. This is the null hypothesis. The random variable X takes on the values {Male, Female} and random variable Y takes on the values {Engineering, Non-Engineering}. Following data is available :

Gender/Stream Engineering Non/Engineering Total
Male 60 40 100
Female 30 70 100
Total 90 110 200

If any particular gender was selected randomly and the also whether he/she would study Engineering randomly. Then assuming independence, the expected number of males studying engineering would be 100*90/200 = 45. Similarly males in Non-engineering would expected to be 100*110/200 = 55, females in engineering = 45 and females in Non-engineering 55. Then the table for the expected values would be :

Gender/Stream Engineering Non/Engineering Total
Male 45 55 100
Female 45 55 100
Total 90 110 200

Given the observed and the expected data, chi square value is

χ2 = ∑ (Oi2 – Ei2)/Ei

where Oi denotes the observed values and Ei denotes the expected values. Hence

χ2 = (60-45)2/45 + (40-55)2/55 + (30-45)2/45 + (70-55)2/55 = 18.2

The chi square value alone does not give much information. We need to find the p-value (probability that X and Y are independent) using the χ2 value and the degrees of freedom. The degrees of freedom for given X and Y is the product (number of categories in X – 1) * (number of categories in Y – 1) = (2-1)*(2-1) = 1. Using the following site to compute the p-value, we get that p-value is almost 0, i.e. the choice of stream of study is dependent on the gender, hence null hypothesis is rejected. By convention, the “cutoff” point for a p-value is 0.05, anything below that can be considered a very low probability, while anything above it is considered a reasonable probability.

Coming back to our text classification problem, we propose the null hypothesis that presence/absence of a feature F in a document is independent of the class the document belongs to. Let X be the event that the term F is present in a document and Y be the event the document belongs to class c1 :

Feature/Class Belongs to class c1 Do not belong to class c1 Total
F present 30 170 200
F absent 70 730 800
Total 100 900 1000

And then the expected values assuming the presence/absence of feature F is independent of the class c1, we have :

Feature/Class Belongs to class c1 Do not belong to class c1 Total
F present 20 180 200
F absent 80 720 800
Total 100 900 1000

The chi square value is : χ2 = 6.944. Similarly for classes c2 and c3 , the chi square values are 62.5 and 62.5 respectively. Hence the expected chi square value for feature F is

χ2 (F) = P(c1) * 6.944 + P(c2) * 62.5 + P(c3) * 62.5,

P(ci) is the probability of a document belonging to class ci, thus

(1/10)*6.944 + (4/5)*62.5 + (1/10)*62.5 = 56.94,

and using the degrees of freedom = 1, we find that p-value is almost 0. Since p-value is much less than 0.005, we can safely reject the null hypothesis and say that feature F is probably dependent of the class a document belongs to. Thus feature F is a good discriminative feature.

Feature Selection using Mutual Information

Given two random variables X and Y, mutual information measures how much knowing one of these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice-verse, so their mutual information is zero. At the other extreme, if X is a deterministic function of Y and Y is a deterministic function of X then all information conveyed by X is shared with Y, i.e. knowing X determines the value of Y and vice-verse. If the variable X takes on the values {x1, x2, …} and Y takes on the values {y1, y2, …} then the mutual information of X and Y is denoted as the expected value of the point-wise mutual information i.e. log (P(X=xi, Y=yj)/P(X=xi)P(Y=yj))

MI (X, Y) = ∑ P(X=xi, Y=yj) * log (P(X=xi, Y=yj)/P(X=xi)P(Y=yj))        ∀ (X=xi, Y=yj)

All log values in further calculations are taken base 2.

If X and Y are mutually independent then for every (X=xi, Y=yj), P(X=xi, Y=yj) = P(X=xi)P(Y=yj) and hence log 1 = 0. Mutual Information is also related to the entropy (uncertainty of the random variables) :

MI (X, Y) = H(Y) – H(Y|X) = H(X) – H(X|Y)

where H(Y) is the entropy(uncertainty) of the variable Y, H(Y|X) is the remaining uncertainty in Y after knowing X. Hence H(Y) – H(Y|X) is the uncertainty in Y removed by knowing that X also occurred. Let X be the event the feature “Apple” occurs in a document, whereas Y be the event the feature “Steve Jobs” occurs in a document. If the document is about the Apple computers founded by Steve Jobs, then P(X, Y) denoting the joint probability distribution of X and Y, is greater than the product of the probabilities P(X)P(Y), because presence of the feature “Apple” would increase the chances of the feature “Steve Jobs” being present and vice-verse. Thus mutual information would be high in this case compared to if the document belonged to “Health benefits of fruits every day”. In terms of entropy, knowing that the feature “Apple” is present, makes the occurrence of the feature “Steve Jobs” much more certain than if “Apple” have not occurred and vice-verse. In the context of text classification, let X be the occurrence of the feature F in a document and Y be the event that a document belongs to class C. The point-wise mutual information, for document containing feature F and also belonging to class C.

PMI(F, C) = log (P(X=F, Y=C)/P(X=F)P(Y=C)),

Similarly the PMI for document containing feature F but not belonging to class C :  PMI(F, CT), document not containing feature F and belonging to class C : PMI(FT, C) and document not containing feature F as well as not belonging to class C : PMI(FT, CT), where X=FT denotes the event that document does not contain feature F and Y=CT denotes the event that document does not belong to class C :

PMI(F, CT) = log (P(X=F, Y=CT)/P(X=F)P(Y=CT)),

PMI(FT, C) = log (P(X=FT, Y=C)/P(X=FT)P(Y=CT)),

PMI(FT, CT) = log (P(X=FT, Y=CT)/P(X=FT)P(Y=CT)),

Thus the expected mutual information for feature F and class C is given as :

MI(F, C) = P(X=F, Y=C) * PMI(F, C) + P(X=F, Y=CT) * PMI(F, CT) + P(X=FT, Y=C) * PMI(FT, C) + P(X=FT, Y=CT) * PMI(FT, CT)

For multiple classes Ci, to check whether a feature F is discriminant or not, find the expected MI for feature F, using MI values of feature F with all the classes, i.e.

D(F) = ∑ P(Ci) * MI(F, Ci), where D(F) is the discrimination factor for feature F

From our earlier example with 1000 documents, 100 belong to c1, 800 to c2 and 100 to c3. Out of 1000 documents and 200 documents containing feature F, 30 belongs to class c1, 120 to class c2 and remaining 50 to class c3. Hence, for class c1,

PMI(F, c1) = 0.58, PMI(F, c1T) = -0.083, PMI(FT, c1) = -0.19, PMI(FT, c1T) = 0.02, Thus

MI(F, c1) = 0.00459

PMI(F, c2) = -0.41, PMI(F, c2T) = 1.0, PMI(FT, c2) = 0.087, PMI(FT, c2T) = -0.41, Thus

MI(F, c2) = 0.04076

PMI(F, c3) = 1.32, PMI(F, c3T) = -0.26, PMI(FT, c3) = -0.67, PMI(FT, c3T) = 0.06, Thus

MI(F, c3) = 0.0385

Thus the discriminant factor D(F) for the feature F is : D(F) = 0.037. We see that the feature F is more informative to classes c2 and c3, as compared to class c1

Mutual information is minimum (maximum uncertainty) when the feature F is equally probable in all the 3 classes and hence the feature information is independent of the class information. Mutual information is 0 when the feature is equally probable in all the classes and also the classes are equally distributed across the set of documents.

Mutual information is maximum (minimum uncertainty) when the feature is present in any one of the classes and thus these features becomes discriminative. In this example the feature F is slightly discriminative as its value is greater than 0 but we cannot say that it is highly discriminative of any particular class.