引言:本文基於案例分析,介紹了如何運用R語言進行雙樣本T檢驗來判定兩個沒有聯繫的總體樣本均值間是否存在顯著的差異。通過搜集華南師範大學30名男生和30名女生(樣本)每天花在網絡上的時間,來判定學校所有男生和女生(總體)在使用網絡時間上是否有顯著性差異。內容包括:繪製箱型圖進行數據可視化、無效假設(H0)、備擇假設(H1)、用Shapiro-Wilk test來檢測數據是否符合正態分布、用F-test檢測方差是否相等、將p-value與顯著性水平0.05進行比較、以及T-test(方差相同時)和Welch t-test(方差不相同時)。
The purpose of the article is not to explain everything behind t-test. Instead, I intend to offer a simplified way to use the tool, based on a case study, in a way that readers without statistics background could understand. It is advisable to focus on the big picture rather than details. Readers are recommended to copy and paste the code and run it in RStudio to see whether to get the same result. Lastly, t-test is much more complicated than what I describe below. Readers who are interested in the statistical tool should refer to further readings.
Vocabulary
sample
樣本Two-sample T-test is used to compare the means of two unrelated groups of samples. Let’s put the definition aside and look at a case study.
Case study
You want to know whether there is a difference in the amount of time that male students and female students spend online each day at SCNU. The only way you can be 100% sure about the answer is to ask all the students in the college (population) to collect the data and compare their means. In reality, it is not practical to conduct such a survey. However, you can take a random sample from each population and make your inference with some degree of certainty by comparing their means.
Sample data
# hours of 30 male students at SCNU spending online each day
time_male <- c(5, 6, 9, 6, 6, 10, 7, 4, 5, 5, 9, 7, 7, 6, 5, 10, 7, 2, 8, 5, 4, 6, 4, 5, 5, 3, 8, 7, 4, 9)
# hours of 30 female students at SCNU spending online each day
time_female <- c(5, 4, 6, 6, 6, 6, 6, 5, 4, 4, 4, 4, 3, 8, 7, 3, 4, 4, 6, 5, 5, 5, 5, 7, 4, 7, 2, 6, 3, 2)
Visualization
We can draw box plot to visually demonstrate the data.
# Create a data frame
my_data <- data.frame(Gender = rep(c("Male", "Female"), each = 30), Hours = c(time_male, time_female) )
# Visualize the data using box plot
install.packages("ggpubr")
library("ggpubr")
ggboxplot(my_data, x = "Gender", y = "Hours", color = "Gender", palette = c("#00AFBB", "#E7B800"), las = 1, ylab = "Hours", xlab = "Gender")
Hypotheses
In order to answer the research question, we need to form two hypotheses.
H0: No difference in mean / the difference is due to chance
H1: The two means are different / the difference is significant
In statistics, 「significant」 means probably true (not due to chance).
However, before we conduct t-test, we need to know (1) whether the two sets of data follow normal distribution; and (2) whether the variances of the two sets of data are equal or not.
(1) Normal distribution
Shapiro-Wilk test examines whether the data follows normal distribution. The graph below shows a normal distribution.
# Shapiro-Wilk test
shapiro.test(time_male)
shapiro.test(time_female)
The two p-values are 0.3331 and 0.248, respectively. Since they are greater than the significance level 0.05, the result implies that the two sets of data are not significantly different from normal distribution. In other words, we can assume the data follows normal distribution so it passes Shapiro-Wilk test.
(2) Variances
Variance measures how far each number in the set is from the mean. If the variances of the two groups are equivalent, we use t-test. If not, then Welch t-test.
F-test (optional)
We can conduct F-test to compare the variances of the two samples.
# F-test to compare the variances
res.ftest <- var.test(time_male, time_female)
res.ftest
The p-value is 0.111. It is greater than 0.05, which implies that the variances are not significantly different. Therefore, we can use the classic t-test which assume equal variances now.
T-test
# T-test
res <- t.test(time_male, time_female, var.equal = TRUE)
res
The p-value is 0.007992. Since it is less than 0.05, we can reject H0 and accept H1. Therefore, we can conclude that there is a significant difference in the amount of time that male students and female students spend online each day at SCNU.
Welch t-test
However, you do not have to use F-test. You can just assume unequal variances without comparing them and use Welch t-test, which is considered as a safer one.
# Welch t-test to compare the means of two samples
res_welch <- t.test(time_male, time_female, var.equal = FALSE)
res_welch
Still, we focus on the p-value, which is 0.008172. Since it is less than the significance level 0.05, we have reason to reject H0 and to accept H1.
If you are not sure whether they have equal variances or not, always use Welch t-test. So you only need to decide whether the data follows normal distribution or not, then you can use Welch t-test. Finally, you make your inference/conclusion based on the p-value, which is pretty easy!
Statistics can predict the behaviour of the population based on the information we get from a representative sample from that population. This is exactly what we have done in this case study. Based on the amount of time that 30 male students and 30 female students at SCNU spend online each day, we use t-test to predict whether there is a significant difference in the amount of time spending online between all the male and female students at SCNU!
Code
# hours of 30 male students at SCNU spending online each day
time_male <- c(5, 6, 9, 6, 6, 10, 7, 4, 5, 5, 9, 7, 7, 6, 5, 10, 7, 2, 8, 5, 4, 6, 4, 5, 5, 3, 8, 7, 4, 9)
# hours of 30 female students at SCNU spending online each day
time_female <- c(5, 4, 6, 6, 6, 6, 6, 5, 4, 4, 4, 4, 3, 8, 7, 3, 4, 4, 6, 5, 5, 5, 5, 7, 4, 7, 2, 6, 3, 2)
# Create a data frame
my_data <- data.frame(Gender = rep(c("Male", "Female"), each = 30), Hours = c(time_male, time_female) )
# Visualize the data using box plot
install.packages("ggpubr")
library("ggpubr")
p <- ggboxplot(my_data, x="Gender", y = "Hours", color = "Gender", palette = c("#00AFBB", "#E7B800"), las = 1, add = "jitter", short.panel.labs = FALSE)
# adding p-value
p + stat_compare_means(method = "t.test", label.y=10)
# whether they follow normal distribution or not
shapiro.test(time_male)
shapiro.test(time_female)
# whether they have equal variances or not
res.ftest <- var.test(time_male, time_female)
res.ftest
# T-test, if they have equal variances
res <- t.test(time_male, time_female, var.equal = TRUE)
res
# printing the p-value
res$p.value
# Visualization
ggdensity(my_data, x="Hours", add = "mean", rug = TRUE, color = "Gender", fill = "Gender", palette = c("#00AFBB", "#E7B800"))
OR
# Welch t-test, which can always be used, if they have unequal variances
res_welch <- t.test(time_male, time_female, var.equal = FALSE)
res_welch
# printing the p-value
res_welch$p.value