Given two random variables X and Y, we propose a null hypothesis H_{0} which states that the two variables are independent, i.e. the observed distribution of data due to the two random variables is due to chance alone, and then we propose an alternative hypothesis H_{a} that says the two variables are dependent and their observed distribution is not due to chance alone. Using Chi square test we reject one of the hypothesis and accept the other one.

Let us test the hypothesis that gender is independent of engineering as a stream of study. This is the null hypothesis. The random variable X takes on the values {Male, Female} and random variable Y takes on the values {Engineering, Non-Engineering}. Following data is available :

Gender/Stream | Engineering | Non/Engineering | Total |

Male | 60 | 40 | 100 |

Female | 30 | 70 | 100 |

Total | 90 | 110 | 200 |

If any particular gender was selected randomly and the also whether he/she would study Engineering randomly. Then assuming independence, the expected number of males studying engineering would be 100*90/200 = 45. Similarly males in Non-engineering would expected to be 100*110/200 = 55, females in engineering = 45 and females in Non-engineering 55. Then the table for the expected values would be :

Gender/Stream | Engineering | Non/Engineering | Total |

Male | 45 | 55 | 100 |

Female | 45 | 55 | 100 |

Total | 90 | 110 | 200 |

Given the observed and the expected data, chi square value is

χ^{2} = ∑ (O_{i}^{2} – E_{i}^{2})/E_{i}

where O_{i} denotes the observed values and E_{i} denotes the expected values. Hence

χ^{2} = (60-45)^{2}/45 + (40-55)^{2}/55 + (30-45)^{2}/45 + (70-55)^{2}/55 = 18.2

The chi square value alone does not give much information. We need to find the p-value (probability that X and Y are independent) using the χ^{2} value and the degrees of freedom. The degrees of freedom for given X and Y is the product (number of categories in X – 1) * (number of categories in Y – 1) = (2-1)*(2-1) = 1. Using the following site to compute the p-value, we get that p-value is almost 0, i.e. the choice of stream of study is dependent on the gender, hence null hypothesis is rejected. By convention, the “cutoff” point for a p-value is 0.05, anything below that can be considered a very low probability, while anything above it is considered a reasonable probability.

Coming back to our text classification problem, we propose the null hypothesis that presence/absence of a feature F in a document is independent of the class the document belongs to. Let X be the event that the term F is present in a document and Y be the event the document belongs to class c_{1} :

Feature/Class | Belongs to class c_{1} |
Do not belong to class c_{1} |
Total |

F present | 30 | 170 | 200 |

F absent | 70 | 730 | 800 |

Total | 100 | 900 | 1000 |

And then the expected values assuming the presence/absence of feature F is independent of the class c_{1}, we have :

Feature/Class | Belongs to class c_{1} |
Do not belong to class c_{1} |
Total |

F present | 20 | 180 | 200 |

F absent | 80 | 720 | 800 |

Total | 100 | 900 | 1000 |

The chi square value is : χ^{2} = 6.944. Similarly for classes c_{2} and c_{3} , the chi square values are 62.5 and 62.5 respectively. Hence the expected chi square value for feature F is

χ^{2} (F) = P(c_{1}) * 6.944 + P(c_{2}) * 62.5 + P(c_{3}) * 62.5,

P(c_{i}) is the probability of a document belonging to class c_{i}, thus

(1/10)*6.944 + (4/5)*62.5 + (1/10)*62.5 = 56.94,

and using the degrees of freedom = 1, we find that p-value is almost 0. Since p-value is much less than 0.005, we can safely reject the null hypothesis and say that feature F is probably dependent of the class a document belongs to. Thus feature F is a good discriminative feature.