Feature Selection using Chi Square Test

Given two random variables X and Y, we propose a null hypothesis H0 which states that the two variables are independent, i.e. the observed distribution of data due to the two random variables is due to chance alone, and then we propose an alternative hypothesis Ha that says the two variables are dependent and their observed distribution is not due to chance alone. Using Chi square test we reject one of the hypothesis and accept the other one.

Let us test the hypothesis that gender is independent of engineering as a stream of study. This is the null hypothesis. The random variable X takes on the values {Male, Female} and random variable Y takes on the values {Engineering, Non-Engineering}. Following data is available :

Gender/Stream Engineering Non/Engineering Total
Male 60 40 100
Female 30 70 100
Total 90 110 200

If any particular gender was selected randomly and the also whether he/she would study Engineering randomly. Then assuming independence, the expected number of males studying engineering would be 100*90/200 = 45. Similarly males in Non-engineering would expected to be 100*110/200 = 55, females in engineering = 45 and females in Non-engineering 55. Then the table for the expected values would be :

Gender/Stream Engineering Non/Engineering Total
Male 45 55 100
Female 45 55 100
Total 90 110 200

Given the observed and the expected data, chi square value is

χ2 = ∑ (Oi2 – Ei2)/Ei

where Oi denotes the observed values and Ei denotes the expected values. Hence

χ2 = (60-45)2/45 + (40-55)2/55 + (30-45)2/45 + (70-55)2/55 = 18.2

The chi square value alone does not give much information. We need to find the p-value (probability that X and Y are independent) using the χ2 value and the degrees of freedom. The degrees of freedom for given X and Y is the product (number of categories in X – 1) * (number of categories in Y – 1) = (2-1)*(2-1) = 1. Using the following site to compute the p-value, we get that p-value is almost 0, i.e. the choice of stream of study is dependent on the gender, hence null hypothesis is rejected. By convention, the “cutoff” point for a p-value is 0.05, anything below that can be considered a very low probability, while anything above it is considered a reasonable probability.

Coming back to our text classification problem, we propose the null hypothesis that presence/absence of a feature F in a document is independent of the class the document belongs to. Let X be the event that the term F is present in a document and Y be the event the document belongs to class c1 :

Feature/Class Belongs to class c1 Do not belong to class c1 Total
F present 30 170 200
F absent 70 730 800
Total 100 900 1000

And then the expected values assuming the presence/absence of feature F is independent of the class c1, we have :

Feature/Class Belongs to class c1 Do not belong to class c1 Total
F present 20 180 200
F absent 80 720 800
Total 100 900 1000

The chi square value is : χ2 = 6.944. Similarly for classes c2 and c3 , the chi square values are 62.5 and 62.5 respectively. Hence the expected chi square value for feature F is

χ2 (F) = P(c1) * 6.944 + P(c2) * 62.5 + P(c3) * 62.5,

P(ci) is the probability of a document belonging to class ci, thus

(1/10)*6.944 + (4/5)*62.5 + (1/10)*62.5 = 56.94,

and using the degrees of freedom = 1, we find that p-value is almost 0. Since p-value is much less than 0.005, we can safely reject the null hypothesis and say that feature F is probably dependent of the class a document belongs to. Thus feature F is a good discriminative feature.