R: How to conduct Two-sample T-test?

2021-02-20 森林語言學筆記

引言:本文基於案例分析,介紹了如何運用R語言進行雙樣本T檢驗來判定兩個沒有聯繫的總體樣本均值間是否存在顯著的差異。通過搜集華南師範大學30名男生和30名女生(樣本)每天花在網絡上的時間,來判定學校所有男生和女生(總體)在使用網絡時間上是否有顯著性差異。內容包括:繪製箱型圖進行數據可視化、無效假設(H0)、備擇假設(H1)、用Shapiro-Wilk test來檢測數據是否符合正態分布、用F-test檢測方差是否相等、將p-value與顯著性水平0.05進行比較、以及T-test(方差相同時)和Welch t-test(方差不相同時)。

 

The purpose of the article is not to explain everything behind t-test. Instead, I intend to offer a simplified way to use the tool, based on a case study, in a way that readers without statistics background could understand. It is advisable to focus on the big picture rather than details. Readers are recommended to copy and paste the code and run it in RStudio to see whether to get the same result. Lastly, t-test is much more complicated than what I describe below. Readers who are interested in the statistical tool should refer to further readings.

 

Vocabulary


population
總體

sample

樣本
mean均值
normal distribution正態分布
variance
方差
significant difference
顯著性差異
hypothesis
假設

Two-sample T-test is used to compare the means of two unrelated groups of samples. Let’s put the definition aside and look at a case study. 

 

Case study

 

You want to know whether there is a difference in the amount of time that male students and female students spend online each day at SCNU. The only way you can be 100% sure about the answer is to ask all the students in the college (population) to collect the data and compare their means. In reality, it is not practical to conduct such a survey. However, you can take a random sample from each population and make your inference with some degree of certainty by comparing their means.

 

Sample data

 

# hours of 30 male students at SCNU spending online each day

time_male <- c(5, 6, 9, 6, 6, 10, 7, 4, 5, 5, 9, 7, 7, 6, 5, 10, 7, 2, 8, 5, 4, 6, 4, 5, 5, 3, 8, 7, 4, 9)

 

# hours of 30 female students at SCNU spending online each day

time_female <- c(5, 4, 6, 6, 6, 6, 6, 5, 4, 4, 4, 4, 3, 8, 7, 3, 4, 4, 6, 5, 5, 5, 5, 7, 4, 7, 2, 6, 3, 2)

Visualization

 

We can draw box plot to visually demonstrate the data.

 

# Create a data frame

my_data <- data.frame(Gender = rep(c("Male", "Female"), each = 30),  Hours = c(time_male, time_female) )

 

# Visualize the data using box plot

install.packages("ggpubr")

library("ggpubr")

ggboxplot(my_data, x = "Gender", y = "Hours", color = "Gender", palette = c("#00AFBB", "#E7B800"), las = 1, ylab = "Hours", xlab = "Gender")

 

Hypotheses

 

In order to answer the research question, we need to form two hypotheses.

 

H0: No difference in mean / the difference is due to chance

H1: The two means are different / the difference is significant

 

In statistics, 「significant」 means probably true (not due to chance).

 

However, before we conduct t-test, we need to know (1) whether the two sets of data follow normal distribution; and (2) whether the variances of the two sets of data are equal or not.

 

(1) Normal distribution

 

Shapiro-Wilk test examines whether the data follows normal distribution. The graph below shows a normal distribution.

 

 

# Shapiro-Wilk test

shapiro.test(time_male)

shapiro.test(time_female)

The two p-values are 0.3331 and 0.248, respectively. Since they are greater than the significance level 0.05, the result implies that the two sets of data are not significantly different from normal distribution. In other words, we can assume the data follows normal distribution so it passes Shapiro-Wilk test.

 

(2) Variances

 

Variance measures how far each number in the set is from the mean. If the variances of the two groups are equivalent, we use t-test. If not, then Welch t-test.

 

F-test (optional)

 

We can conduct F-test to compare the variances of the two samples.

 

# F-test to compare the variances

res.ftest <- var.test(time_male, time_female)

res.ftest

 

The p-value is 0.111. It is greater than 0.05, which implies that the variances are not significantly different. Therefore, we can use the classic t-test which assume equal variances now.

 

T-test 

 

# T-test 

res <- t.test(time_male, time_female, var.equal = TRUE)

res

The p-value is 0.007992. Since it is less than 0.05, we can reject H0 and accept H1. Therefore, we can conclude that there is a significant difference in the amount of time that male students and female students spend online each day at SCNU.

 

Welch t-test 

 

However, you do not have to use F-test. You can just assume unequal variances without comparing them and use Welch t-test, which is considered as a safer one.

 

# Welch t-test to compare the means of two samples

res_welch <- t.test(time_male, time_female, var.equal = FALSE)

res_welch 

Still, we focus on the p-value, which is 0.008172. Since it is less than the significance level 0.05, we have reason to reject H0 and to accept H1.  

 

If you are not sure whether they have equal variances or not, always use Welch t-test. So you only need to decide whether the data follows normal distribution or not, then you can use Welch t-test. Finally, you make your inference/conclusion based on the p-value, which is pretty easy!

Statistics can predict the behaviour of the population based on the information we get from a representative sample from that population. This is exactly what we have done in this case study. Based on the amount of time that 30 male students and 30 female students at SCNU spend online each day, we use t-test to predict whether there is a significant difference in the amount of time spending online between all the male and female students at SCNU!

Code 

 

# hours of 30 male students at SCNU spending online each day

time_male <- c(5, 6, 9, 6, 6, 10, 7, 4, 5, 5, 9, 7, 7, 6, 5, 10, 7, 2, 8, 5, 4, 6, 4, 5, 5, 3, 8, 7, 4, 9)

 

# hours of 30 female students at SCNU spending online each day

time_female <- c(5, 4, 6, 6, 6, 6, 6, 5, 4, 4, 4, 4, 3, 8, 7, 3, 4, 4, 6, 5, 5, 5, 5, 7, 4, 7, 2, 6, 3, 2)

 

# Create a data frame

my_data <- data.frame(Gender = rep(c("Male", "Female"), each = 30),  Hours = c(time_male, time_female) )

 

# Visualize the data using box plot

install.packages("ggpubr")

library("ggpubr")

p <- ggboxplot(my_data, x="Gender", y = "Hours", color = "Gender", palette = c("#00AFBB", "#E7B800"), las = 1, add = "jitter",  short.panel.labs = FALSE)

# adding p-value

p + stat_compare_means(method = "t.test", label.y=10)

# whether they follow normal distribution or not

shapiro.test(time_male)

shapiro.test(time_female)

 

# whether they have equal variances or not

res.ftest <-  var.test(time_male, time_female)

res.ftest

# T-test, if they have equal variances

res <- t.test(time_male, time_female, var.equal = TRUE)

res

# printing the p-value

res$p.value

# Visualization

ggdensity(my_data, x="Hours", add = "mean", rug = TRUE, color = "Gender", fill = "Gender", palette = c("#00AFBB", "#E7B800"))

OR

 

# Welch t-test, which can always be used, if they have unequal variances

res_welch <- t.test(time_male, time_female, var.equal = FALSE)

res_welch

# printing the p-value

res_welch$p.value

相關焦點

  • Why do we sample? 常用抽樣方法介紹
    It isn't possible to ask every single student, so instead we try to get a sample of students.Here are some common ways to select a simple random sample:write everyone's name on a slip of paper and draw two from a hatwrite all possible samples of size two on slips of paper and draw
  • 使用 R 與 python 進行 t 檢驗
    hypothesis: true mean is less than 0.7## 95 percent confidence interval:## -Inf 0.7384773## sample estimates:## mean of x ## 0.7125其實看 p 值我們就明白了,備擇假設沒有影響我們的這個結果,那我們的備擇假設如果選擇 two.side
  • R文科統計1 - test assumptions (驗證前提之數據分布)
    , the two populations being compared should have the same variance (testable using <a href="wikiwand.com/en/F-test_">F-test, Levene's test, Bartlett's test, or the Brown–Forsythe test; or assessable
  • R與生物專題 | 第五十四講 R-樣本量及實驗效能計算
    換句話說,如果您有20%的機會無法檢測到真正的差異,那麼該檢驗的功能就是0.8。例如,假設我們希望能夠使用雙向t檢驗,0.05的顯著性水平,90%的檢驗效能,期望檢測出的效應差異為20個單位。根據以前的研究,效應方差約為60個單位。
  • R語言:t檢驗
    (不同自由度)了解r語言幾個函數:dt,pt,qt,rt分別與dnorm,rnorm,pnorm,qnorm和rnorm對應 > * dt() 的返回值是正態分布概率密度函數(density)> * pt()返回值是正態分布的分布函數(probability)> * 函數qt()的返回值是給定概率p後的下百分位數(quantitle)>
  • Test the water 試探,摸底 | 地道英語
    No Li, to test the water means to test the idea, try it out and see if it works.Oh I see. "Test the water" 就是探測,摸底,看看別人有什麼反應。That's right.
  • python test檢驗 - CSDN
    /usr/bin/python# Paired difference hypothesis testing - Student's t-test# Author: Jin Zhuojun# History: Create Tue May 8 16:12:21 CST 2012import string
  • 「R」統計檢驗函數匯總
    連續型數據基於正態分布的檢驗均值檢驗t.test(1:10, 10:20)#>#> Welch Two Sample t-test#>#> data: 1:10 and 10:20#> t = -7, df = 19, p-value = 2e-06#>
  • R繪圖---分組樣本見兩兩T-test並添加顯著性標記
    set.seed(123)data("anxiety", package = "datarium")anxiety %>% sample_n_by(group, size = 1)## # A tibble: 3 x 5##   id    group    t1    t2    t3##   <fct>
  • 淺入淺出 | T test與現實的交鋒
    上次,我們講了T test的基本原理,T test的原理簡單,但是現實總是骨感的,統計學遇到錯綜複雜的骨感現實,總不得不給她填填補補,調整與修正今天,我們就來看看 T test如何與現實交鋒T-test 基礎回顧
  • R與生物專題 | 第十講 R-兩獨立樣本t檢驗
    其中,mA和mB分別是A、B兩組樣本均值nA和nB分別是A、>是A、B兩組樣本量s2是樣本的標準差,可以由以下公式計算獲得weightswith(my_data, shapiro.test(weight[group == "Woman"])) # p = 0.52輸出結果中,兩個p值大於顯著性水平0.05,說明兩組數據的分布與正態分布沒有顯著差異。
  • Hotelling's t-squared test及在R語言中的計算
    Hotelling's t-squared test及在R語言中的計算有同學後臺留言諮詢Hotelling's t-squared test,本篇簡單梳理下這種方法。Hotelling's t-squared test是T檢驗(t-test)的多元形式,「多元」表示每個對象具有多組變量(Hotelling, 1931)。
  • 兩樣本t檢驗原理與R語言實現
    除了bartlett.test方法,var.test也可以用於方差齊性檢驗:var.test(formula = price ~ xiaoqu, data = testdata)## ## F test to compare two variances ## ## data:
  • r相關性檢驗 - CSDN
    rXYn-21-r2XY(1.74)服從自由度為n-2的t分布.設r1,r2,…,rn為由X1,X2,…,Xn產生的秩統計量,R1,R2,…,Rn為由Y1,Y2,…,Yn產生的秩統計量,則有r=1n∑ni=1ri=n+12=R=1n∑ni=1Ri, 1n∑ni=1(ri-r)2=n2-112=1n∑ni=1(Ri-R)2稱rs=1n∑ni=1riRi-n+122/n2-112為Spearman(斯皮爾曼)秩相關係數.
  • r語言 t檢驗方法 - CSDN
    10名學生身高:130,120,130,110,130,135,129,124,130,134#生成數據x <- c(130,120,130,110,130,135,129,124,130,134)#t檢驗t.test(x,mu = 130) One Sample t-testdata: xt = -1.1884, df = 9, p-value =0.2651alternative
  • 第51集 Tell me about yourself(Sample Answer 1) 求職面試『英語修煉-初級』
    Each episode has two parts. First, we give you some important tips on how to answer the interview question.
  • R語言統計原創視頻7 | ANOVA方差分析
    In its simplest form, ANOVA provides a statistical test of whether the population means of several groups are equal, and therefore generalizes the t-test to more than two groups