Chi-square tests

Chi-square test is a statistical method used for analysis of categorical data. There are 2 types:

1. Goodness of fit

Known as:
* Chi-square for 1 sample
* Chi-square for given proportions

Purpose : Comparing the observed frequency to the expected frequency based on a theoretical law specified in the null hypothesis.

2. Goodness of association

Known as:
* Chi-square test for Independence
* Chi-square test for Homogeneity

Purpose : Determine whether there is an association between the categories of the two variables.

General formula for both types: \[ X^2=\sum_{}\frac{(Observed-Expected)^2}{Expected} \] ___________________________________________________________________________

Example 1:

  1. Goodness of fit:

Assume that in a certain forest we counted 132 trees planted. Forest officials claim the ratio of Orange, Cedar and banana trees is 1:2:1 (1+2+1 = 4). This means that the expected proportion is:

1/4 (= 1/4) for Orange trees
2/4 ( = 1/2) for Cedar trees
1/4 for Banana trees

The count was as follow: 50 Orange trees, 60 Cedar trees and 20 banana trees.

Trees_observed <- c(52, 60, 20)
theoretical_proportion <- c(1/4, 1/2, 1/4)
test <- chisq.test(Trees_observed, p=theoretical_proportion)
test
## 
##  Chi-squared test for given probabilities
## 
## data:  Trees_observed
## X-squared = 16.606, df = 2, p-value = 0.0002478

since pval <0.05 then we reject the null hypothesis. In which tree ratio for cedar should be 1/2 and the other 2 types should be 1/4. Thus the field does not follow the theoretical ratio mentioned by the officials. In other words the observed proportions are significantly different from the expected proportions.

If the expected ratio was true then the observed count of Orange, Cedar and banana are respectively.

test$expected
## [1] 33 66 33

By hand: To calculate expected counts:

Total_trees_counted <- sum(Trees_observed)
Total_trees_counted
## [1] 132
1/4*Total_trees_counted # Orange Tree
## [1] 33
1/2*Total_trees_counted # Cedar Tree
## [1] 66
1/4*Total_trees_counted # Banana Tree
## [1] 33

Chi-square Equation:

\[ X^2=\frac{(52-33)^2}{33}+ \frac{(60-66)^2}{66}+ \frac{(20-33)^2}{33}=16.606 \]

i

To get the pvalue of the calculated Chi-square 16.606 and degree of freedom 2 (df=n-1, n=3 (tree types)) we can apply 1-pchisq().

pval = 1-pchisq(16.606,df=2)

Example 2:

  1. Goodness of association

Assume we took 10 plants of Cedar trees and 10 plants of Orange trees and planted them in a forest in Germany. After a year we visited the planting site and counted the trees that survived and the ones that died.

##   Trees Survived Dead Total_trees
##   Cedar        7    3          10
##  Orange        4    6          10
##   Total       11    9          20

Now we want to know whether there is an association between Growth and Trees on the new land. if there is no association then the observed counts would be equal to the expected counts or at least similar.

test <- chisq.test(dat, correct = FALSE)
test
## 
##  Pearson's Chi-squared test
## 
## data:  dat
## X-squared = 1.8182, df = 1, p-value = 0.1775

Since pval>0.05 then fail to reject the null hypothesis. in which there is no association between the tree type and their survival on the new land.

By hand:
To calculate expected counts: \[ Expected=\frac{totalcolumn*totalrow}{totalobservation} \]

dat2
##   Trees         Survived            Dead Total_trees
##   Cedar (11*10)/20 = 5.5 (9*10)/20 = 4.5          10
##  Orange (11*10)/20 = 5.5 (9*10)/20 = 4.5          10
##   Total               11               9          20

Chi-square Equation: \[ X^2=\frac{(7-5.5)^2}{5.5}+ \frac{(4-5.5)^2}{5.5}+ \frac{(3-4.5)^2}{4.5} + \frac{(6-4.5)^2}{4.5}=1.8182 \] As in the previous exercise. To get the pvalue of the calculated Chi-square 1.8182 and degree of freedom 1, we can apply 1-pchisq().

pval = 1-pchisq(1.8182,df=1)
pval
## [1] 0.1775277

Note: We ran the code, chisq.test(dat, correct = FALSE), the correct argument was set to false, If we left it empty then by default the function will apply the Yates continuity correction when one of the observed values is less than 5. In this case the formula becomes: \[ X^2=\sum_{}\frac{(|Observed-Expected|-0.5)^2}{Expected} \]

Firas Fneish
Firas Fneish
Biostatistician
comments powered by Disqus

Related