Chi-square tests
Chi-square test is a statistical method used for analysis of categorical data. There are 2 types:
1. Goodness of fit
Known as:
* Chi-square for 1 sample
* Chi-square for given proportions
Purpose : Comparing the observed frequency to the expected frequency based on a theoretical law specified in the null hypothesis.
2. Goodness of association
Known as:
* Chi-square test for Independence
* Chi-square test for Homogeneity
Purpose : Determine whether there is an association between the categories of the two variables.
General formula for both types: \[ X^2=\sum_{}\frac{(Observed-Expected)^2}{Expected} \] ___________________________________________________________________________
Example 1:
- Goodness of fit:
Assume that in a certain forest we counted 132 trees planted. Forest officials claim the ratio of Orange, Cedar and banana trees is 1:2:1 (1+2+1 = 4). This means that the expected proportion is:
1/4 (= 1/4) for Orange trees
2/4 ( = 1/2) for Cedar trees
1/4 for Banana trees
The count was as follow: 50 Orange trees, 60 Cedar trees and 20 banana trees.
Trees_observed <- c(52, 60, 20)
theoretical_proportion <- c(1/4, 1/2, 1/4)
test <- chisq.test(Trees_observed, p=theoretical_proportion)
test
##
## Chi-squared test for given probabilities
##
## data: Trees_observed
## X-squared = 16.606, df = 2, p-value = 0.0002478
since pval <0.05 then we reject the null hypothesis. In which tree ratio for cedar should be 1/2 and the other 2 types should be 1/4. Thus the field does not follow the theoretical ratio mentioned by the officials. In other words the observed proportions are significantly different from the expected proportions.
If the expected ratio was true then the observed count of Orange, Cedar and banana are respectively.
test$expected
## [1] 33 66 33
By hand: To calculate expected counts:
Total_trees_counted <- sum(Trees_observed)
Total_trees_counted
## [1] 132
1/4*Total_trees_counted # Orange Tree
## [1] 33
1/2*Total_trees_counted # Cedar Tree
## [1] 66
1/4*Total_trees_counted # Banana Tree
## [1] 33
Chi-square Equation:
\[ X^2=\frac{(52-33)^2}{33}+ \frac{(60-66)^2}{66}+ \frac{(20-33)^2}{33}=16.606 \]
i
To get the pvalue of the calculated Chi-square 16.606 and degree of freedom 2 (df=n-1, n=3 (tree types)) we can apply 1-pchisq().
pval = 1-pchisq(16.606,df=2)
Example 2:
- Goodness of association
Assume we took 10 plants of Cedar trees and 10 plants of Orange trees and planted them in a forest in Germany. After a year we visited the planting site and counted the trees that survived and the ones that died.
## Trees Survived Dead Total_trees
## Cedar 7 3 10
## Orange 4 6 10
## Total 11 9 20
Now we want to know whether there is an association between Growth and Trees on the new land. if there is no association then the observed counts would be equal to the expected counts or at least similar.
test <- chisq.test(dat, correct = FALSE)
test
##
## Pearson's Chi-squared test
##
## data: dat
## X-squared = 1.8182, df = 1, p-value = 0.1775
Since pval>0.05 then fail to reject the null hypothesis. in which there is no association between the tree type and their survival on the new land.
By hand:
To calculate expected counts:
\[
Expected=\frac{totalcolumn*totalrow}{totalobservation}
\]
dat2
## Trees Survived Dead Total_trees
## Cedar (11*10)/20 = 5.5 (9*10)/20 = 4.5 10
## Orange (11*10)/20 = 5.5 (9*10)/20 = 4.5 10
## Total 11 9 20
Chi-square Equation: \[ X^2=\frac{(7-5.5)^2}{5.5}+ \frac{(4-5.5)^2}{5.5}+ \frac{(3-4.5)^2}{4.5} + \frac{(6-4.5)^2}{4.5}=1.8182 \] As in the previous exercise. To get the pvalue of the calculated Chi-square 1.8182 and degree of freedom 1, we can apply 1-pchisq().
pval = 1-pchisq(1.8182,df=1)
pval
## [1] 0.1775277
Note: We ran the code, chisq.test(dat, correct = FALSE), the correct argument was set to false, If we left it empty then by default the function will apply the Yates continuity correction when one of the observed values is less than 5. In this case the formula becomes: \[ X^2=\sum_{}\frac{(|Observed-Expected|-0.5)^2}{Expected} \]