本文主要學習和介紹:
比較兩組或者多組數據的平均值
然後自動添加P值或者顯著性水平到ggplot的各種圖形中(such as box plots, dot plots, bar plots and line plots …)
一、比較均值的方法The standard methods to compare the means of two or more groups in R, have been largely described at: comparing means in R.
常用的幾種比較方法如下:
MethodsR functionDescriptionT-testt.test()Compare two groups (parametric)Wilcoxon testwilcox.test()Compare two groups (non-parametric)ANOVAaov() or anova()Compare multiple groups (parametric)Kruskal-Walliskruskal.test()Compare multiple groups (non-parametric)如何使用這些方法進行計算和對結果進行解釋,可以參考以下連結:
Comparing one-sample mean to a standard known mean:
One-Sample T-test (parametric)
One-Sample Wilcoxon Test (non-parametric)
Comparing the means of two independent groups:
Unpaired Two Samples T-test (parametric)
Unpaired Two-Samples Wilcoxon Test (non-parametric)
Comparing the means of paired samples:
Paired Samples T-test (parametric)
Paired Samples Wilcoxon Test (non-parametric)
Comparing the means of more than two groups
Analysis of variance (ANOVA, parametric):
One-Way ANOVA Test in R
Two-Way ANOVA Test in R
Kruskal-Wallis Test in R (non parametric alternative to one-way ANOVA)
二、R中添加P值的函數介紹這裡主要使用的是ggpubr包的的兩個函數:
1. compare_means函數compare_means函數用於比較輸出一組或者多組的均值,使用不同的方法
compare_means(formula, data, method = "wilcox.test", paired = FALSE,
group.by = NULL, ref.group = NULL, ...)
formula: 畫圖的公式y ~ x。
a formula of the form x ~ group, where x is a numeric variable and group is a factor with one or multiple levels. For example, formula = TP53 ~ cancer_group. It’s also possible to perform the test for multiple response variables at the same time. For example, formula = c(TP53, PTEN) ~ cancer_group.
data: 使用的數據。
a data.frame containing the variables in the formula.
method: 使用的檢驗P值的方法。the type of test. Default is 「wilcox.test」. Allowed values include:
「t.test」 (parametric) and 「wilcox.test」" (non-parametric). Perform comparison between two groups of samples. If the grouping variable contains more than two levels, then a pairwise comparison is performed.
「anova」 (parametric) and 「kruskal.test」 (non-parametric). Perform one-way ANOVA test comparing multiple groups.
paired: 是否做配對檢驗。
a logical indicating whether you want a paired test. Used only in t.test and in wilcox.test.
group.by: 指定分組做P值檢驗。variables used to group the data set before applying the test. When specified the mean comparisons will be performed in each subset of the data formed by the different levels of the group.by variables.
ref.group: 比較均值時指定參考的組,就像比較case-control的數據時,指定control組為對照一樣,指定一個組做為參照。
a character string specifying the reference group. If specified, for a given grouping variable, each of the group levels will be compared to the reference group (i.e. control group). ref.group can be also 「.all.」. In this case, each of the grouping variable levels is compared to all (i.e. base-mean). 這裡的base-mean指的就是所有組的平均值
2.stat_compare_means函數功能與上一個函數相似,只是這個函數用於在ggplot圖形中添加平均值比較的P值或者顯著性水平
stat_compare_means(mapping = NULL, comparisons = NULL hide.ns = FALSE,
label = NULL, label.x = NULL, label.y = NULL, ...)
mapping: 和ggplot函數一樣指定aes()映射。
Set of aesthetic mappings created by aes().
comparisons: 由兩個向量組成的列表,這兩個向量就是要對比組的名字或者index值。
A list of length-2 vectors. The entries in the vector are either the names of 2 values on the x-axis or the 2 integers that correspond to the index of the groups of interest, to be compared.
hide.ns: 邏輯值,默認為FALSE。是否顯示不顯著時的符號ns。
logical value. If TRUE, hide ns symbol when displaying significance levels.
label: 指定標籤的類型,可以是P值(「p.format」)或者是顯著性水平(「p.signif」)(用*號和ns表示)。
character string specifying label type. Allowed values include 「p.signif」 (shows the significance levels), 「p.format」 (shows the formatted p value).
label.x,label.y: 指定label的坐標位置。
numeric values. coordinates (in data units) to be used for absolute positioning of the label. If too short they will be recycled.
…: 其他的參數。
other arguments passed to the function compare_means() such as method, paired, ref.group.
使用boxplot演示添加P值和顯著性水平1. 準備# 加載包
# library(tidyverse) #非必須
library(ggpubr)
# 示例數據:ToothGrowth
data("ToothGrowth")
head(ToothGrowth)
head(ToothGrowth) %>% knitr::kable(format = "markdown")
compare_means(len ~ supp, data = ToothGrowth) %>% knitr::kable()
Returned value is a data frame with the following columns:
.y.: the y variable used in the test.
p: the p-value
p.adj: the adjusted p-value. Default value for p.adjust.method = 「holm」
p.format: the formatted p-value
p.signif: the significance level.
method: the statistical test used to compare groups.
Create a box plot with p-values:
p <- ggboxplot(data = ToothGrowth,
x = "supp", y = "len",
color = "supp", palette = "jco",
add = "jitter")
# add = "jitter"參數添加箱線圖的數值點
# 添加p值
p + stat_compare_means()
# 再畫一幅,這次改變P值的統計方法
p + stat_compare_means(method = "t.test")
顯示P值的標籤位置可以通過如下參數來調整:label.x, label.y。
顯示P值的標籤默認為 compare_means() 返回值中的 method 和 p 的組合。也可以通過 aes() 函數指定為其他顯示形式。例如:
For example,
aes(label = ..p.format..) or aes(label = paste0(「p =」, ..p.format..)): display only the formatted p-value (without the method name)
aes(label = ..p.signif..): display only the significance level.
aes(label = paste0(..method.., 「\n」, 「p =」, ..p.format..)): Use line break (「\n」) between the method name and the p-value.
如:
p + stat_compare_means(aes(label = paste0(..method..,": ","p=",..p.format..)),
label.x = 1.5, label.y = 40, method = "t.test")
# 單獨顯示字符型向量的顯著水平
p + stat_compare_means(label = "p.signif", label.x = 1.5, label.y = 40)
## 單獨顯示P值
p + stat_compare_means(label = "p.format",
label.x = 1.5, label.y = 40, method = "t.test")
# 使用aes映射也可以達到同樣的效果,此時只有P值,需要連接P=
p + stat_compare_means(aes(label = paste("p = ",..p.format..)),
label.x = 1.5, label.y = 40, method = "t.test")
compare_means(len ~ supp, data = ToothGrowth, paired = TRUE)
# 使用ggpaired函數可視化配對數據
ggpaired(data = ToothGrowth,
x = "supp", y = "len",
color = "supp", line.color = "gray",
line.size = 0.4, palette = "jco") +
stat_compare_means(paired = T)
1. 全局比較
# 使用anova的方法, 全局比較
compare_means(len ~ dose, data = ToothGrowth, method = "anova")
# Default method = "kruskal.test" for multiple groups
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means()
# Change method to anova
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(method = "anova")
2. 成對比較
如果分組超過兩個,默認將所有組兩兩比較, 默認方法為wilcox.test, 可以改為t.test
compare_means(len ~ dose, data = ToothGrowth)
# 可視化所指定的比較
# 指定比較的組,以下3組都指定
my_comparisons <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") )
ggboxplot(data = ToothGrowth,
x = "dose", y = "len",
color = "dose", palette = "jco") +
stat_compare_means(comparisons = my_comparisons) + # 指定成對比較, 並指定P值位置
stat_compare_means(label.y = 50) # 指定P值位置的這句代碼不能和上面的同一個函數的代碼合併,合併之後P值比價會顯示在一條線段上
# 將分組比較的P值label使用label.y指定位置
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(comparisons = my_comparisons, label.y = c(29, 35, 40))
# 去掉全局比較的數值
3. 指定一個參考組的兩兩比較
# Pairwise comparison against reference
compare_means(len ~ dose, data = ToothGrowth,
# 指定dose中0.5的組作為參考組
ref.group = "0.5",
method = "t.test") %>%
knitr::kable(format = "markdown")
# Visualize
ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(method = "anova", label.y = 40)+ # 全局比較,使用anova方法
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = "0.5") # 指定配對比價的refer group,比較對象為0.5的組
4. 相對於所有組的均值(base-mean)的多重兩兩比較的檢驗
# Comparison of each group against base-mean
compare_means(len ~ dose, data = ToothGrowth, ref.group = ".all.",
method = "t.test")
# .all.參數比較其中一組與所有的組的均值(base-mean),在組比較多時,比較方便
ggboxplot(data = ToothGrowth,
x = "dose", y = "len",
color = "dose", palette = "jco")+
stat_compare_means(method = "kruskal.test", label.y = 40)+
stat_compare_means(label = "p.format", method = "t.test", ref.group = ".all.")
下面使用骨髓瘤的數據來說明和所有樣本的均值(.all.)比較在一些典型的情況中是很有用的。
將依據患者molecular進行分組,繪製各個組別DEPDC1基因的表達水平。目的是比較各個組別之間是否存在差別,如果有差別,差別又在哪裡。
要回答以上問題,可以在所有7個組之間進行兩兩比較(pairwise comparison)。由於組別較多,將導致很多種組別組合,這將很難解釋。
一個簡單的解決辦法是將7組中的每一組與「.all.」(如base-mean)比較。當檢驗結果顯著時,可以得出結論:與所有組相比,xxx組的DEPDC1的表達水平顯著降低或顯著升高。
# Load myeloma data from GitHub
myeloma <- read.delim("https://raw.githubusercontent.com/kassambara/data/master/myeloma.txt")
# Perform the test
compare_means(DEPDC1 ~ molecular_group, data = myeloma,
ref.group = ".all.", method = "t.test")
head(myeloma)
# 可視化
# legend = "none"參數去掉圖形上方顯示的每種顏色的box代表哪種分組的信息
#
ggboxplot(myeloma, x = "molecular_group", y = "DEPDC1", color = "molecular_group",
add = "jitter", legend = "none") +
rotate_x_text(angle = 45)+
geom_hline(yintercept = mean(myeloma$DEPDC1), linetype = 2)+ # Add horizontal line at base mean
stat_compare_means(method = "anova", label.y = 1600)+ # Add global annova p-value
stat_compare_means(label = "p.signif", method = "t.test",
ref.group = ".all.") # Pairwise comparison against all, 此處添加hide.ns = TRUE參數,將隱藏不顯著的ns符號
From the plot above, we can conclude that DEPDC1 is significantly overexpressed in proliferation group and, it’s significantly downexpressed in Hyperdiploid and Low bone disease compared to all.
5. 多個分組變量compare_means(len ~ supp, data = ToothGrowth,
group.by = "dose")
Visualize (1/2). Create a multi-panel box plots facetted by group (here, 「dose」):
# 多個小圖facet
p <- ggboxplot(data = ToothGrowth,
x = "supp", y = "len",
color = "supp", palette = "jco",
add = "jitter",
facet.by = "dose",
short.panel.labs = F) # short.panel.labs默認為T,改為F顯示分組dose:
# 僅顯示P值,去除方法
p + stat_compare_means(label = "p.format")
# 或者顯示顯著性符號,而不是直接顯示P值
p + stat_compare_means(label = "p.signif", label.x = 1.5)
# To hide the 『ns』 symbol, use the argument hide.ns = TRUE.
Visualize (2/2). Create one single panel with all box plots. Plot y = 「len」 by x = 「dose」 and color by 「supp」:
# 將所有小圖組合放在一張圖裡
p <- ggboxplot(ToothGrowth, x = "dose", y = "len",
color = "supp", palette = "jco",
add = "jitter") # 去掉facet.by參數
p + stat_compare_means(aes(group = supp))
# 使用aes映射時,列名不能加雙引號
# 只顯示P值
p + stat_compare_means(aes(group = supp),label = "p.format")
# 單純的只顯示P的數值
p + stat_compare_means(aes(group = supp, label = paste(..p.format..)))
# 只顯示顯著性水平
p + stat_compare_means(aes(group = supp), label = "p.signif")
compare_means(len ~ supp, data = ToothGrowth,
group.by = "dose", paired = TRUE) # 增加paired參數
相比於非配對P檢驗,配對P檢驗的P值更大
# Box plot facetted by "dose"
p <- ggpaired(ToothGrowth, x = "supp", y = "len",
color = "supp", palette = "jco",
line.color = "gray", line.size = 0.4,
facet.by = "dose", short.panel.labs = FALSE)
# Use only p.format as label. Remove method name.
p + stat_compare_means(label = "p.format", paired = TRUE)
# 條形圖
ggbarplot(data = ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "dose",palette="jco", legend="none")+
stat_compare_means()+ # 總體比較
stat_compare_means(ref.group = "0.5", label = "p.signif", label.y = c(22,28))
# Line plot of mean +/-se
ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "black", linetype = 2)+
stat_compare_means() + # Global p-value
stat_compare_means(ref.group = "0.5", label = "p.signif",
label.y = c(22, 29))
# 分組條形圖
# position = position_dodge()調節supp分組的OJ組和VC組兩組樣本的柱狀圖的距離,默認值為重合(0)
ggbarplot(ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "supp", palette = "jco",
position = position_dodge(0.8))+
stat_compare_means(aes(group = supp), label = "p.signif", label.y = 29)
# 分組折線圖
ggline(ToothGrowth, x = "dose", y = "len", add = "mean_se",
color = "supp", palette = "jco")+
stat_compare_means(aes(group = supp), label = "p.signif",
label.y = c(16, 25, 29))
參考:
數據可視化——R語言為ggplot圖形添加P值和顯著性水平
ggpubr: Publication Ready Plots Add P-values and Significance Levels to ggplots
查看原文獲取更佳的排版顯示內容