R: How to conduct Two-sample T-test?

2021-02-20 森林語言學筆記

引言：本文基於案例分析，介紹了如何運用R語言進行雙樣本T檢驗來判定兩個沒有聯繫的總體樣本均值間是否存在顯著的差異。通過搜集華南師範大學30名男生和30名女生（樣本）每天花在網絡上的時間，來判定學校所有男生和女生（總體）在使用網絡時間上是否有顯著性差異。內容包括：繪製箱型圖進行數據可視化、無效假設（H0）、備擇假設（H1）、用Shapiro-Wilk test來檢測數據是否符合正態分布、用F-test檢測方差是否相等、將p-value與顯著性水平0.05進行比較、以及T-test（方差相同時）和Welch t-test（方差不相同時）。

The purpose of the article is not to explain everything behind t-test. Instead, I intend to offer a simplified way to use the tool, based on a case study, in a way that readers without statistics background could understand. It is advisable to focus on the big picture rather than details. Readers are recommended to copy and paste the code and run it in RStudio to see whether to get the same result. Lastly, t-test is much more complicated than what I describe below. Readers who are interested in the statistical tool should refer to further readings.

Vocabulary

population
總體

sample

樣本
mean均值
normal distribution正態分布
variance
方差
significant difference
顯著性差異
hypothesis
假設

Two-sample T-test is used to compare the means of two unrelated groups of samples. Let’s put the definition aside and look at a case study.

Case study

You want to know whether there is a difference in the amount of time that male students and female students spend online each day at SCNU. The only way you can be 100% sure about the answer is to ask all the students in the college (population) to collect the data and compare their means. In reality, it is not practical to conduct such a survey. However, you can take a random sample from each population and make your inference with some degree of certainty by comparing their means.

Sample data

# hours of 30 male students at SCNU spending online each day

time_male <- c(5, 6, 9, 6, 6, 10, 7, 4, 5, 5, 9, 7, 7, 6, 5, 10, 7, 2, 8, 5, 4, 6, 4, 5, 5, 3, 8, 7, 4, 9)

# hours of 30 female students at SCNU spending online each day

time_female <- c(5, 4, 6, 6, 6, 6, 6, 5, 4, 4, 4, 4, 3, 8, 7, 3, 4, 4, 6, 5, 5, 5, 5, 7, 4, 7, 2, 6, 3, 2)

Visualization

We can draw box plot to visually demonstrate the data.

# Create a data frame

my_data <- data.frame(Gender = rep(c("Male", "Female"), each = 30), Hours = c(time_male, time_female) )

# Visualize the data using box plot

install.packages("ggpubr")

library("ggpubr")

ggboxplot(my_data, x = "Gender", y = "Hours", color = "Gender", palette = c("#00AFBB", "#E7B800"), las = 1, ylab = "Hours", xlab = "Gender")

Hypotheses

In order to answer the research question, we need to form two hypotheses.

H0: No difference in mean / the difference is due to chance

H1: The two means are different / the difference is significant

In statistics, 「significant」 means probably true (not due to chance).

However, before we conduct t-test, we need to know (1) whether the two sets of data follow normal distribution; and (2) whether the variances of the two sets of data are equal or not.

(1) Normal distribution

Shapiro-Wilk test examines whether the data follows normal distribution. The graph below shows a normal distribution.

# Shapiro-Wilk test

shapiro.test(time_male)

shapiro.test(time_female)

The two p-values are 0.3331 and 0.248, respectively. Since they are greater than the significance level 0.05, the result implies that the two sets of data are not significantly different from normal distribution. In other words, we can assume the data follows normal distribution so it passes Shapiro-Wilk test.

(2) Variances

Variance measures how far each number in the set is from the mean. If the variances of the two groups are equivalent, we use t-test. If not, then Welch t-test.

F-test (optional)

We can conduct F-test to compare the variances of the two samples.

# F-test to compare the variances

res.ftest <- var.test(time_male, time_female)

res.ftest

The p-value is 0.111. It is greater than 0.05, which implies that the variances are not significantly different. Therefore, we can use the classic t-test which assume equal variances now.

T-test

# T-test

res <- t.test(time_male, time_female, var.equal = TRUE)

res

The p-value is 0.007992. Since it is less than 0.05, we can reject H0 and accept H1. Therefore, we can conclude that there is a significant difference in the amount of time that male students and female students spend online each day at SCNU.

Welch t-test

However, you do not have to use F-test. You can just assume unequal variances without comparing them and use Welch t-test, which is considered as a safer one.

# Welch t-test to compare the means of two samples

res_welch <- t.test(time_male, time_female, var.equal = FALSE)

res_welch

Still, we focus on the p-value, which is 0.008172. Since it is less than the significance level 0.05, we have reason to reject H0 and to accept H1.

If you are not sure whether they have equal variances or not, always use Welch t-test. So you only need to decide whether the data follows normal distribution or not, then you can use Welch t-test. Finally, you make your inference/conclusion based on the p-value, which is pretty easy!

Statistics can predict the behaviour of the population based on the information we get from a representative sample from that population. This is exactly what we have done in this case study. Based on the amount of time that 30 male students and 30 female students at SCNU spend online each day, we use t-test to predict whether there is a significant difference in the amount of time spending online between all the male and female students at SCNU!

Code

# hours of 30 male students at SCNU spending online each day

time_male <- c(5, 6, 9, 6, 6, 10, 7, 4, 5, 5, 9, 7, 7, 6, 5, 10, 7, 2, 8, 5, 4, 6, 4, 5, 5, 3, 8, 7, 4, 9)

# hours of 30 female students at SCNU spending online each day

time_female <- c(5, 4, 6, 6, 6, 6, 6, 5, 4, 4, 4, 4, 3, 8, 7, 3, 4, 4, 6, 5, 5, 5, 5, 7, 4, 7, 2, 6, 3, 2)

# Create a data frame

my_data <- data.frame(Gender = rep(c("Male", "Female"), each = 30), Hours = c(time_male, time_female) )

# Visualize the data using box plot

install.packages("ggpubr")

library("ggpubr")

p <- ggboxplot(my_data, x="Gender", y = "Hours", color = "Gender", palette = c("#00AFBB", "#E7B800"), las = 1, add = "jitter", short.panel.labs = FALSE)

# adding p-value

p + stat_compare_means(method = "t.test", label.y=10)

# whether they follow normal distribution or not

shapiro.test(time_male)

shapiro.test(time_female)

# whether they have equal variances or not

res.ftest <- var.test(time_male, time_female)

res.ftest

# T-test, if they have equal variances

res <- t.test(time_male, time_female, var.equal = TRUE)

res

# printing the p-value

res$p.value

# Visualization

ggdensity(my_data, x="Hours", add = "mean", rug = TRUE, color = "Gender", fill = "Gender", palette = c("#00AFBB", "#E7B800"))

# Welch t-test, which can always be used, if they have unequal variances

res_welch <- t.test(time_male, time_female, var.equal = FALSE)

res_welch

# printing the p-value

res_welch$p.value

R: How to conduct Two-sample T-test?

相關焦點

Why do we sample? 常用抽樣方法介紹

使用 R 與 python 進行 t 檢驗

R文科統計1 - test assumptions (驗證前提之數據分布)

R與生物專題 | 第五十四講 R-樣本量及實驗效能計算

R語言:t檢驗

Test the water 試探,摸底 | 地道英語

python test檢驗 - CSDN

「R」統計檢驗函數匯總

R繪圖---分組樣本見兩兩T-test並添加顯著性標記

淺入淺出 | T test與現實的交鋒

R與生物專題 | 第十講 R-兩獨立樣本t檢驗

Hotelling's t-squared test及在R語言中的計算

兩樣本t檢驗原理與R語言實現

r相關性檢驗 - CSDN

r語言 t檢驗方法 - CSDN

第51集 Tell me about yourself(Sample Answer 1) 求職面試『英語修煉-初級』

R語言統計原創視頻7 | ANOVA方差分析