🧠 Introduction

In statistics, we often want to know whether two categorical variables are related — for example,

“Is there an association between gender and preference for a product?”

or,

“Do observed outcomes match what we expected?”

To answer such questions, we use the Chi-Squared Test (χ² test) — a powerful non-parametric statistical test that compares observed frequencies with expected frequencies.

It’s one of the most widely used hypothesis tests in categorical data analysis.

📘 What is the Chi-Squared Test?

The Chi-Squared Test (χ²) evaluates whether there is a significant difference between the observed frequencies and the expected frequencies in one or more categories.

It is based on the Chi-Squared distribution, which is always right-skewed and depends on the degrees of freedom (df).

⚙️ Formula

χ2=∑(Oi−Ei)2Ei\chi^2 = \sum \frac{(O_i – E_i)^2}{E_i}χ2=∑Ei(Oi−Ei)2

where:

OiO_iOi = Observed frequency in category i
EiE_iEi = Expected frequency in category i

If the observed and expected frequencies are similar, χ2\chi^2χ2 will be small.
If they differ significantly, χ2\chi^2χ2 will be large.

📊 Types of Chi-Squared Tests

Type	Purpose	Example
1. Chi-Squared Goodness of Fit Test	Checks if the observed distribution fits an expected theoretical distribution.	Do dice rolls produce equal probabilities for all 6 outcomes?
2. Chi-Squared Test of Independence	Checks if two categorical variables are independent or associated.	Is gender independent of ice cream flavor preference?
3. Chi-Squared Test for Homogeneity	Compares distributions across multiple groups.	Do different cities have the same distribution of favorite car brands?

🧩 1️⃣ The Chi-Squared Goodness of Fit Test

Purpose:

To determine whether a sample data distribution matches a known or expected distribution.

Steps:

State hypotheses:
- H0H_0H0: Observed frequencies follow the expected distribution.
- H1H_1H1: Observed frequencies do not follow the expected distribution.
Calculate expected frequencies (E): Ei=n×piE_i = n \times p_iEi=n×pi where nnn is total sample size and pip_ipi is the expected probability for category i.
Compute test statistic: χ2=∑(Oi−Ei)2Ei\chi^2 = \sum \frac{(O_i – E_i)^2}{E_i}χ2=∑Ei(Oi−Ei)2
Find degrees of freedom: df=k−1df = k – 1df=k−1 where k = number of categories.
Compare the test statistic with the critical value from the Chi-Squared distribution table, or use a p-value.

📘 Example (Goodness of Fit)

A fair die should produce equal outcomes (each with probability 1/6).
We roll a die 60 times, and observe:

Face	1	2	3	4	5	6
Observed (O)	8	10	9	11	12	10

Expected frequency Ei=60/6=10E_i = 60/6 = 10Ei=60/6=10. χ2=∑(O−E)2E=(8−10)210+(10−10)210+⋯+(10−10)210=1.0\chi^2 = \sum \frac{(O – E)^2}{E} = \frac{(8-10)^2}{10} + \frac{(10-10)^2}{10} + \dots + \frac{(10-10)^2}{10} = 1.0χ2=∑E(O−E)2=10(8−10)2+10(10−10)2+⋯+10(10−10)2=1.0

At df=5df = 5df=5 and significance level α=0.05\alpha = 0.05α=0.05, critical value = 11.07.

✅ Since 1.0 < 11.07, we fail to reject H0H_0H0 — the die appears fair.

🧮 2️⃣ Chi-Squared Test of Independence

Purpose:

To determine whether two categorical variables are independent or associated.

Setup:

We use a contingency table (cross-tabulation).

	Like Product	Don’t Like	Total
Male	40	10	50
Female	30	20	50
Total	70	30	100

Steps:

Hypotheses:
- H0H_0H0: Gender and preference are independent.
- H1H_1H1: Gender and preference are associated.
Calculate expected frequencies: Eij=(Row Total)(Column Total)Grand TotalE_{ij} = \frac{(\text{Row Total})(\text{Column Total})}{\text{Grand Total}}Eij=Grand Total(Row Total)(Column Total) Example:
For Male–Like cell: E=(50)(70)100=35E = \frac{(50)(70)}{100} = 35E=100(50)(70)=35.
Compute test statistic: χ2=∑(O−E)2E\chi^2 = \sum \frac{(O – E)^2}{E}χ2=∑E(O−E)2
Degrees of freedom: df=(r−1)(c−1)df = (r – 1)(c – 1)df=(r−1)(c−1) where r = number of rows, c = number of columns.
Decision rule:
Compare the calculated χ² value with the critical χ² from the table (based on df and α).

📈 Example (Test of Independence)

Compute χ² for the above table:

Category	O	E	(O−E)²/E
Male–Like	40	35	0.714
Male–Don’t	10	15	1.667
Female–Like	30	35	0.714
Female–Don’t	20	15	1.667
Total			4.762

χ2=4.76,df=(2−1)(2−1)=1\chi^2 = 4.76, \quad df = (2 – 1)(2 – 1) = 1χ2=4.76,df=(2−1)(2−1)=1

At α = 0.05, critical χ² = 3.84.

✅ Since 4.76 > 3.84, reject H0H_0H0 — gender and preference are not independent (they are associated).

💻 Performing Chi-Squared Test in Python

import pandas as pd
from scipy.stats import chi2_contingency

Create contingency table

data = [[40, 10],
[30, 20]]

Perform Chi-Squared test

chi2, p, dof, expected = chi2_contingency(data)

print(“Chi-Squared Statistic:”, round(chi2, 3))
print(“Degrees of Freedom:”, dof)
print(“P-value:”, round(p, 4))
print(“Expected Frequencies:\n”, expected)

Output:

Chi-Squared Statistic: 4.762
Degrees of Freedom: 1
P-value: 0.0291
Expected Frequencies:
[[35. 15.]
[35. 15.]]

✅ Since p = 0.0291 < 0.05, we reject H0H_0H0.
There is a significant association between gender and product preference.

📊 Assumptions of Chi-Squared Test

Data are frequencies, not percentages or proportions.
Observations are independent (each subject belongs to only one category).
Expected frequency ≥ 5 for most cells.
Categorical variables (nominal or ordinal data).

⚠️ Limitations

Sensitive to small expected frequencies.
Does not show strength or direction of relationship.
Only detects association, not causation.
Not suitable for continuous data without grouping.

🔍 Real-World Applications

Education: Association between study habits and pass/fail outcomes.
Business: Relationship between customer gender and product preference.
Healthcare: Link between treatment type and recovery status.
Politics: Relationship between age group and voting preference.
Marketing: Association between advertisement type and consumer response.

🧭 Final Thoughts

The Chi-Squared Test is one of the cornerstones of inferential statistics for categorical data.
It helps analysts, researchers, and educators determine whether patterns observed in sample data are meaningful or just due to chance.

By mastering this test, you gain a foundation for more advanced statistical topics such as logistic regression, contingency table analysis, and non-parametric modeling — all vital tools for modern data analysis and research.

The Chi-Squared Test — Understanding Association and Goodness of Fit

🧠 Introduction

📘 What is the Chi-Squared Test?

⚙️ Formula

📊 Types of Chi-Squared Tests

🧩 1️⃣ The Chi-Squared Goodness of Fit Test

Purpose:

Steps:

📘 Example (Goodness of Fit)

🧮 2️⃣ Chi-Squared Test of Independence

Purpose:

Setup:

Steps:

📈 Example (Test of Independence)

💻 Performing Chi-Squared Test in Python

Create contingency table

Perform Chi-Squared test

📊 Assumptions of Chi-Squared Test

⚠️ Limitations

🔍 Real-World Applications

🧭 Final Thoughts

Leave a Comment Cancel Reply

🧠 Introduction

📘 What is the Chi-Squared Test?

⚙️ Formula

📊 Types of Chi-Squared Tests

🧩 1️⃣ The Chi-Squared Goodness of Fit Test

Purpose:

Steps:

📘 Example (Goodness of Fit)

🧮 2️⃣ Chi-Squared Test of Independence

Purpose:

Setup:

Steps:

📈 Example (Test of Independence)

💻 Performing Chi-Squared Test in Python

Create contingency table

Perform Chi-Squared test

📊 Assumptions of Chi-Squared Test

⚠️ Limitations

🔍 Real-World Applications

🧭 Final Thoughts

Related Posts

Leave a Comment Cancel Reply