🧠 Introduction
In statistics, we often want to know whether two categorical variables are related — for example,
“Is there an association between gender and preference for a product?”
or,
“Do observed outcomes match what we expected?”
To answer such questions, we use the Chi-Squared Test (χ² test) — a powerful non-parametric statistical test that compares observed frequencies with expected frequencies.
It’s one of the most widely used hypothesis tests in categorical data analysis.
📘 What is the Chi-Squared Test?
The Chi-Squared Test (χ²) evaluates whether there is a significant difference between the observed frequencies and the expected frequencies in one or more categories.
It is based on the Chi-Squared distribution, which is always right-skewed and depends on the degrees of freedom (df).
⚙️ Formula
χ2=∑(Oi−Ei)2Ei\chi^2 = \sum \frac{(O_i – E_i)^2}{E_i}χ2=∑Ei(Oi−Ei)2
where:
- OiO_iOi = Observed frequency in category i
 - EiE_iEi = Expected frequency in category i
 
If the observed and expected frequencies are similar, χ2\chi^2χ2 will be small.
If they differ significantly, χ2\chi^2χ2 will be large.
📊 Types of Chi-Squared Tests
| Type | Purpose | Example | 
|---|---|---|
| 1. Chi-Squared Goodness of Fit Test | Checks if the observed distribution fits an expected theoretical distribution. | Do dice rolls produce equal probabilities for all 6 outcomes? | 
| 2. Chi-Squared Test of Independence | Checks if two categorical variables are independent or associated. | Is gender independent of ice cream flavor preference? | 
| 3. Chi-Squared Test for Homogeneity | Compares distributions across multiple groups. | Do different cities have the same distribution of favorite car brands? | 
🧩 1️⃣ The Chi-Squared Goodness of Fit Test
Purpose:
To determine whether a sample data distribution matches a known or expected distribution.
Steps:
- State hypotheses:
- H0H_0H0: Observed frequencies follow the expected distribution.
 - H1H_1H1: Observed frequencies do not follow the expected distribution.
 
 - Calculate expected frequencies (E): Ei=n×piE_i = n \times p_iEi=n×pi where nnn is total sample size and pip_ipi is the expected probability for category i.
 - Compute test statistic: χ2=∑(Oi−Ei)2Ei\chi^2 = \sum \frac{(O_i – E_i)^2}{E_i}χ2=∑Ei(Oi−Ei)2
 - Find degrees of freedom: df=k−1df = k – 1df=k−1 where k = number of categories.
 - Compare the test statistic with the critical value from the Chi-Squared distribution table, or use a p-value.
 
📘 Example (Goodness of Fit)
A fair die should produce equal outcomes (each with probability 1/6).
We roll a die 60 times, and observe:
| Face | 1 | 2 | 3 | 4 | 5 | 6 | 
|---|---|---|---|---|---|---|
| Observed (O) | 8 | 10 | 9 | 11 | 12 | 10 | 
Expected frequency Ei=60/6=10E_i = 60/6 = 10Ei=60/6=10. χ2=∑(O−E)2E=(8−10)210+(10−10)210+⋯+(10−10)210=1.0\chi^2 = \sum \frac{(O – E)^2}{E} = \frac{(8-10)^2}{10} + \frac{(10-10)^2}{10} + \dots + \frac{(10-10)^2}{10} = 1.0χ2=∑E(O−E)2=10(8−10)2+10(10−10)2+⋯+10(10−10)2=1.0
At df=5df = 5df=5 and significance level α=0.05\alpha = 0.05α=0.05, critical value = 11.07.
✅ Since 1.0 < 11.07, we fail to reject H0H_0H0 — the die appears fair.
🧮 2️⃣ Chi-Squared Test of Independence
Purpose:
To determine whether two categorical variables are independent or associated.
Setup:
We use a contingency table (cross-tabulation).
| Like Product | Don’t Like | Total | |
|---|---|---|---|
| Male | 40 | 10 | 50 | 
| Female | 30 | 20 | 50 | 
| Total | 70 | 30 | 100 | 
Steps:
- Hypotheses:
- H0H_0H0: Gender and preference are independent.
 - H1H_1H1: Gender and preference are associated.
 
 - Calculate expected frequencies: Eij=(Row Total)(Column Total)Grand TotalE_{ij} = \frac{(\text{Row Total})(\text{Column Total})}{\text{Grand Total}}Eij=Grand Total(Row Total)(Column Total) Example:
For Male–Like cell: E=(50)(70)100=35E = \frac{(50)(70)}{100} = 35E=100(50)(70)=35. - Compute test statistic: χ2=∑(O−E)2E\chi^2 = \sum \frac{(O – E)^2}{E}χ2=∑E(O−E)2
 - Degrees of freedom: df=(r−1)(c−1)df = (r – 1)(c – 1)df=(r−1)(c−1) where r = number of rows, c = number of columns.
 - Decision rule:
Compare the calculated χ² value with the critical χ² from the table (based on df and α). 
📈 Example (Test of Independence)
Compute χ² for the above table:
| Category | O | E | (O−E)²/E | 
|---|---|---|---|
| Male–Like | 40 | 35 | 0.714 | 
| Male–Don’t | 10 | 15 | 1.667 | 
| Female–Like | 30 | 35 | 0.714 | 
| Female–Don’t | 20 | 15 | 1.667 | 
| Total | 4.762 | 
χ2=4.76,df=(2−1)(2−1)=1\chi^2 = 4.76, \quad df = (2 – 1)(2 – 1) = 1χ2=4.76,df=(2−1)(2−1)=1
At α = 0.05, critical χ² = 3.84.
✅ Since 4.76 > 3.84, reject H0H_0H0 — gender and preference are not independent (they are associated).
💻 Performing Chi-Squared Test in Python
import pandas as pd
from scipy.stats import chi2_contingency
Create contingency table
data = [[40, 10],
[30, 20]]
Perform Chi-Squared test
chi2, p, dof, expected = chi2_contingency(data)
print(“Chi-Squared Statistic:”, round(chi2, 3))
print(“Degrees of Freedom:”, dof)
print(“P-value:”, round(p, 4))
print(“Expected Frequencies:\n”, expected)
Output:
Chi-Squared Statistic: 4.762
Degrees of Freedom: 1
P-value: 0.0291
Expected Frequencies:
[[35. 15.]
[35. 15.]]
✅ Since p = 0.0291 < 0.05, we reject H0H_0H0.
There is a significant association between gender and product preference.
📊 Assumptions of Chi-Squared Test
- Data are frequencies, not percentages or proportions.
 - Observations are independent (each subject belongs to only one category).
 - Expected frequency ≥ 5 for most cells.
 - Categorical variables (nominal or ordinal data).
 
⚠️ Limitations
- Sensitive to small expected frequencies.
 - Does not show strength or direction of relationship.
 - Only detects association, not causation.
 - Not suitable for continuous data without grouping.
 
🔍 Real-World Applications
- Education: Association between study habits and pass/fail outcomes.
 - Business: Relationship between customer gender and product preference.
 - Healthcare: Link between treatment type and recovery status.
 - Politics: Relationship between age group and voting preference.
 - Marketing: Association between advertisement type and consumer response.
 
🧭 Final Thoughts
The Chi-Squared Test is one of the cornerstones of inferential statistics for categorical data.
It helps analysts, researchers, and educators determine whether patterns observed in sample data are meaningful or just due to chance.
By mastering this test, you gain a foundation for more advanced statistical topics such as logistic regression, contingency table analysis, and non-parametric modeling — all vital tools for modern data analysis and research.


