本文內容來自《R 語言實戰》(R in Action, 2nd),有部分修改
描述性統計分析mtcars 數據集中的三個連續變量
mpg:每加侖汽油行駛英裡數
hp:馬力
wt:車重
selected_variables <- c("mpg", "hp", "wt")
head(mtcars[selected_variables])
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
兩個分類變量
am:變速箱類型
cyl:汽缸數
方法雲集summary(mtcars[selected_variables])
mpg hp wt
Min. :10.40 Min. : 52.0 Min. :1.513
1st Qu.:15.43 1st Qu.: 96.5 1st Qu.:2.581
Median :19.20 Median :123.0 Median :3.325
Mean :20.09 Mean :146.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:180.0 3rd Qu.:3.610
Max. :33.90 Max. :335.0 Max. :5.424
sapply() 函數格式
sapply(x, FUN, options)
fivenum() 返回圖基五數總括 (Tukey’s five-number summary)
sapply(
mtcars[selected_variables],
fivenum
)
mpg hp wt
[1,] 10.40 52 1.5130
[2,] 15.35 96 2.5425
[3,] 19.20 123 3.3250
[4,] 22.80 180 3.6500
[5,] 33.90 335 5.4240
自定義統計函數
在統計學中,峰度(Kurtosis)衡量實數隨機變量概率分布的峰態。峰度高就意味著方差增大是由低頻度的大於或小於平均值的極端差值引起的。
– wiki
在概率論和統計學中,偏度衡量實數隨機變量概率分布的不對稱性。
– wiki
my_stats <- function(x, na.omit=TRUE) {
if(na.omit) {
x <- x[!is.na(x)]
}
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n # 偏度 Skewness
kurt <- sum((x-m)^4/s^4)/n - 3 # 峰度 Kurtosis
return (
c(
n=n,
mean=m,
stdev=s,
skew=skew,
kurtosis=kurt
)
)
}
sapply(mtcars[selected_variables], my_stats)
mpg hp wt
n 32.000000 32.0000000 32.00000000
mean 20.090625 146.6875000 3.21725000
stdev 6.026948 68.5628685 0.97845744
skew 0.610655 0.7260237 0.42314646
kurtosis -0.372766 -0.1355511 -0.02271075
Hmisc 包中的 describe() 函數
library(Hmisc)
describe(mtcars[selected_variables])
mtcars[selected_variables]
3 Variables 32 Observations
mpg
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
32 0 25 0.999 20.09 6.796 12.00 14.34 15.43 19.20 22.80
.90 .95
30.09 31.30
lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
hp
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
32 0 22 0.997 146.7 77.04 63.65 66.00 96.50 123.00 180.00
.90 .95
243.50 253.55
lowest : 52 62 65 66 91, highest: 215 230 245 264 335
wt
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
32 0 29 0.999 3.217 1.089 1.736 1.956 2.581 3.325 3.610
.90 .95
4.048 5.293
lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
pastecs 包中的 stat.desc() 函數
library(pastecs)
stat.desc(mtcars[selected_variables])
mpg hp wt
nbr.val 32.0000000 32.0000000 32.0000000
nbr.null 0.0000000 0.0000000 0.0000000
nbr.na 0.0000000 0.0000000 0.0000000
min 10.4000000 52.0000000 1.5130000
max 33.9000000 335.0000000 5.4240000
range 23.5000000 283.0000000 3.9110000
sum 642.9000000 4694.0000000 102.9520000
median 19.2000000 123.0000000 3.3250000
mean 20.0906250 146.6875000 3.2172500
SE.mean 1.0654240 12.1203173 0.1729685
CI.mean.0.95 2.1729465 24.7195501 0.3527715
var 36.3241028 4700.8669355 0.9573790
std.dev 6.0269481 68.5628685 0.9784574
coef.var 0.2999881 0.4674077 0.3041285
psych 包中的 describe() 函數
library(psych)
註:可以使用 包名::函數名 的方式訪問被覆蓋的函數
psych::describe(mtcars[selected_variables])
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
aggregate() 函數 by 參數
aggregate(
mtcars[selected_variables],
by=list(am=mtcars$am),
mean
)
am mpg hp wt
1 0 17.14737 160.2632 3.768895
2 1 24.39231 126.8462 2.411000
aggregate(
mtcars[selected_variables],
by=list(am=mtcars$am),
sd
)
am mpg hp wt
1 0 3.833966 53.90820 0.7774001
2 1 6.166504 84.06232 0.6169816
by() 函數
dstats <- function(x) {
return (sapply(x, my_stats))
}
by(
mtcars[selected_variables],
mtcars$am,
dstats
)
mtcars$am: 0
mpg hp wt
n 19.00000000 19.00000000 19.0000000
mean 17.14736842 160.26315789 3.7688947
stdev 3.83396639 53.90819573 0.7774001
skew 0.01395038 -0.01422519 0.9759294
kurtosis -0.80317826 -1.20969733 0.1415676
---
mtcars$am: 1
mpg hp wt
n 13.00000000 13.0000000 13.0000000
mean 24.39230769 126.8461538 2.4110000
stdev 6.16650381 84.0623243 0.6169816
skew 0.05256118 1.3598859 0.2103128
kurtosis -1.45535200 0.5634635 -1.1737358
doBy 包中的 summaryBy() 函數
var1 + var2 + var3 + … + varN ~ groupvar1 + groupvar2 + … + groupvarN
library(doBy)
summaryBy(
mpg + hp + wt ~ am,
data=mtcars,
FUN=my_stats
)
am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtosis hp.n hp.mean hp.stdev hp.skew hp.kurtosis
1 0 19 17.14737 3.833966 0.01395038 -0.8031783 19 160.2632 53.90820 -0.01422519 -1.2096973
2 1 13 24.39231 6.166504 0.05256118 -1.4553520 13 126.8462 84.06232 1.35988586 0.5634635
wt.n wt.mean wt.stdev wt.skew wt.kurtosis
1 19 3.768895 0.7774001 0.9759294 0.1415676
2 13 2.411000 0.6169816 0.2103128 -1.1737358
psych 包中的 describeBy() 函數
describeBy(
mtcars[selected_variables],
list(am=mtcars$am)
)
Descriptive statistics by group
am: 0
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 19 17.15 3.83 17.30 17.12 3.11 10.40 24.40 14.00 0.01 -0.80 0.88
hp 2 19 160.26 53.91 175.00 161.06 77.10 62.00 245.00 183.00 -0.01 -1.21 12.37
wt 3 19 3.77 0.78 3.52 3.75 0.45 2.46 5.42 2.96 0.98 0.14 0.18
---
am: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 13 24.39 6.17 22.80 24.38 6.67 15.00 33.90 18.90 0.05 -1.46 1.71
hp 2 13 126.85 84.06 109.00 114.73 63.75 52.00 335.00 283.00 1.36 0.56 23.31
wt 3 13 2.41 0.62 2.32 2.39 0.68 1.51 3.57 2.06 0.21 -1.17 0.17
直方圖
密度圖
箱線圖
點圖
頻數表和列聯表使用 vcd 包中的 Arthritis 數據集
library(vcd)
head(Arthritis)
ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
table() 生成頻數統計表
one_d_table <- with(
Arthritis,
table(Improved)
)
one_d_table
Improved
None Some Marked
42 14 28
prop.table() 將頻數統計錶轉為比例值
prop.table(my_table)
Improved
None Some Marked
0.5000000 0.1666667 0.3333333
prop.table(my_table) * 100
Improved
None Some Marked
50.00000 16.66667 33.33333
table(
Arthritis$Treatment, # 行
Arthritis$Improved # 列
)
None Some Marked
Placebo 29 7 7
Treated 13 7 21
xtabs() 函數
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
margin.table() 生成邊際頻數
1 表示每行生成一個邊際值,即為第 1 個維度生成邊際值
margin.table(my_table, 1)
Treatment
Placebo Treated
43 41
margin.table(my_table, 2)
Improved
None Some Marked
42 14 28
prop.table() 生成比例
1 表示沿行生成比例,即沿第一個維度計算比例
prop.table(my_table, 1)
Improved
Treatment None Some Marked
Placebo 0.6744186 0.1627907 0.1627907
Treated 0.3170732 0.1707317 0.5121951
prop.table(my_table, 2)
Improved
Treatment None Some Marked
Placebo 0.6744186 0.1627907 0.1627907
Treated 0.3170732 0.1707317 0.5121951
addmargins() 添加邊際和
addmargins(my_table)
Improved
Treatment None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
1 表示行,2 表示列。
下面代碼中的 1 表示按行求比例,即每行所有數值加和為 1
2 表示添加列方向的累加和,也就是為每行添加一個累加值
addmargins(
prop.table(
my_table,
1
),
2
)
Improved
Treatment None Some Marked Sum
Placebo 0.6744186 0.1627907 0.1627907 1.0000000
Treated 0.3170732 0.1707317 0.5121951 1.0000000
addmargins(
prop.table(
my_table,
2
),
1
)
Improved
Treatment None Some Marked
Placebo 0.6904762 0.5000000 0.2500000
Treated 0.3095238 0.5000000 0.7500000
Sum 1.0000000 1.0000000 1.0000000
gmodels 包中的 CrossTable() 函數
library(gmodels)
CrossTable(
Arthritis$Treatment,
Arthritis$Improved
)
Cell Contents
||
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
||
Total Observations in Table: 84
| Arthritis$Improved
Arthritis$Treatment | None | Some | Marked | Row Total |
|-|-|-|-|
Placebo | 29 | 7 | 7 | 43 |
| 2.616 | 0.004 | 3.752 | |
| 0.674 | 0.163 | 0.163 | 0.512 |
| 0.690 | 0.500 | 0.250 | |
| 0.345 | 0.083 | 0.083 | |
|-|-|-|-|
Treated | 13 | 7 | 21 | 41 |
| 2.744 | 0.004 | 3.935 | |
| 0.317 | 0.171 | 0.512 | 0.488 |
| 0.310 | 0.500 | 0.750 | |
| 0.155 | 0.083 | 0.250 | |
|-|-|-|-|
Column Total | 42 | 14 | 28 | 84 |
| 0.500 | 0.167 | 0.333 | |
|-|-|-|-|
my_table <- xtabs(
~ Treatment + Sex + Improved,
data=Arthritis
)
my_table
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
ftable() 以一種更緊湊的形式輸出多維列聯表
ftable(my_table)
Improved None Some Marked
Treatment Sex
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5
邊際頻數
margin.table(
my_table,
1
)
Treatment
Placebo Treated
43 41
margin.table(
my_table,
2
)
Sex
Female Male
59 25
margin.table(
my_table,
3
)
Improved
None Some Marked
42 14 28
多維邊際頻數
margin.table(
my_table,
c(1, 3)
)
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
比例
ftable(
prop.table(
my_table,
c(1, 2)
)
)
Improved None Some Marked
Treatment Sex
Placebo Female 0.59375000 0.21875000 0.18750000
Male 0.90909091 0.00000000 0.09090909
Treated Female 0.22222222 0.18518519 0.59259259
Male 0.50000000 0.14285714 0.35714286
ftable(
addmargins(
prop.table(
my_table,
c(1, 2)
),
3 # 為第三維 (Improved) 增加列加和
)
) * 100
Improved None Some Marked Sum
Treatment Sex
Placebo Female 59.375000 21.875000 18.750000 100.000000
Male 90.909091 0.000000 9.090909 100.000000
Treated Female 22.222222 18.518519 59.259259 100.000000
Male 50.000000 14.285714 35.714286 100.000000
卡方檢驗適用於計數數據,可以檢驗數據與預期分布的擬合程度。在統計實踐中,卡方統計量的最常見用法是與 r x c 列聯表一起使用,以評估對變量間獨立性的零假設是否合理。
引自 [1]
chisq.test()
下面示例顯示治療情況和改善情況不獨立 (p值太小)
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
chisq.test(my_table)
Pearson's Chi-squared test
data: my_table
X-squared = 13.055, df = 2, p-value = 0.001463
p 值表示從總體中抽取的樣本行變量與列變量是相互獨立的概率。
下面示例顯示性別和改善情況獨立
my_table <- xtabs(
~ Improved + Sex,
data=Arthritis
)
my_table
Sex
Improved Female Male
None 25 17
Some 12 2
Marked 22 6
chisq.test(my_table)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: my_table
X-squared = 4.8407, df = 2, p-value = 0.08889
可以實際列出所有可能出現的重排 (置換) 情況及其頻數,進而確定觀測結果的極端程度。這一操作被稱為費舍爾精確檢驗 (Fisher’s exact test)。
引自 [1]
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
fisher.test(my_table)
Fisher's Exact Test for Count Data
data: my_table
p-value = 0.001393
alternative hypothesis: two.sided
假設兩個名義變量在第三個變量的每一層中都是條件獨立的。
下面代碼假設不存在三階交互作用 (治療情況 x 改善情況 x 性別)
my_table <- xtabs(
~ Treatment + Improved + Sex,
data=Arthritis
)
ftable(my_table)
Sex Female Male
Treatment Improved
Placebo None 19 10
Some 7 0
Marked 6 1
Treated None 6 7
Some 5 2
Marked 16 5
結果表明,治療與得到的改善在性別的每一水平下並不獨立
mantelhaen.test(my_table)
Cochran-Mantel-Haenszel test
data: my_table
Cochran-Mantel-Haenszel M^2 = 14.632, df = 2, p-value = 0.0006647
vcd 包的 assocstats() 函數
計算二維列聯表的 phi 係數,列聯繫數和 Cramer’s V 係數
my_table <- xtabs(
~ Treatment + Improved,
data=Arthritis
)
my_table
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
assocstats(my_table)
X^2 df P(> X^2)
Likelihood Ratio 13.530 2 0.0011536
Pearson 13.055 2 0.0014626
Phi-Coefficient : NA
Contingency Coeff.: 0.367
Cramer's V : 0.394
條形圖
馬賽克圖
關聯圖
…
相關相關係數
head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
Pearson 積差相關係數衡量兩個定量變量之間的線性相關程度。
Spearman 等級相關係數衡量分級定序變量之間的相關程度。
Kendall’s Tau 相關係數是一種非參數的等級相關度量。
cov() 計算協方差
cor() 計算相關係數
use:all.obs,everything,complete.obs,pairwise.complete.obs
method: pearson,spearman,kendall
默認參數 use="everything", method="pearson"
states <- state.x77[, 1:6]
cov(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 19931683.7588 571229.7796 292.8679592 -407.8424612 5663.523714 -3551.509551
Income 571229.7796 377573.3061 -163.7020408 280.6631837 -521.894286 3076.768980
Illiteracy 292.8680 -163.7020 0.3715306 -0.4815122 1.581776 -3.235469
Life Exp -407.8425 280.6632 -0.4815122 1.8020204 -3.869480 6.312685
Murder 5663.5237 -521.8943 1.5817755 -3.8694804 13.627465 -14.549616
HS Grad -3551.5096 3076.7690 -3.2354694 6.3126849 -14.549616 65.237894
Pearson 積差相關係數
cor(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
Spearman 等級相關係數
cor(states, method="spearman")
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.0000000 0.1246098 0.3130496 -0.1040171 0.3457401 -0.3833649
Income 0.1246098 1.0000000 -0.3145948 0.3241050 -0.2174623 0.5104809
Illiteracy 0.3130496 -0.3145948 1.0000000 -0.5553735 0.6723592 -0.6545396
Life Exp -0.1040171 0.3241050 -0.5553735 1.0000000 -0.7802406 0.5239410
Murder 0.3457401 -0.2174623 0.6723592 -0.7802406 1.0000000 -0.4367330
HS Grad -0.3833649 0.5104809 -0.6545396 0.5239410 -0.4367330 1.0000000
非方形的相關矩陣
x <- states[, c(
"Population",
"Income",
"Illiteracy",
"HS Grad"
)]
y <- states[, c(
"Life Exp",
"Murder"
)]
cor(x, y)
Life Exp Murder
Population -0.06805195 0.3436428
Income 0.34025534 -0.2300776
Illiteracy -0.58847793 0.7029752
HS Grad 0.58221620 -0.4879710
偏相關是指在控制一個或多個定量變量時,另外兩個定量變量之間的相互關係。
ggm 包中的 pcor() 函數
library(ggm)
colnames(states)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder" "HS Grad"
u 中前兩個數值表示計算相關向量的下標,其餘數值為條件變量下標
pcor(u, S)
pcor(
c(1, 5, 2, 3, 6),
cov(states)
)
[1] 0.3462724
polycor 包中的 hetcor() 計算一種混合的相關矩陣
library(polycor)
hetcor(states)
Two-Step Estimates
Correlations/Type of Correlation:
Population Income Illiteracy Life.Exp Murder HS.Grad
Population 1 Pearson Pearson Pearson Pearson Pearson
Income 0.2082 1 Pearson Pearson Pearson Pearson
Illiteracy 0.1076 -0.4371 1 Pearson Pearson Pearson
Life.Exp -0.06805 0.3403 -0.5885 1 Pearson Pearson
Murder 0.3436 -0.2301 0.703 -0.7808 1 Pearson
HS.Grad -0.09849 0.6199 -0.6572 0.5822 -0.488 1
Standard Errors:
Population Income Illiteracy Life.Exp Murder
Population
Income 0.1357
Illiteracy 0.1401 0.1152
Life.Exp 0.1411 0.1257 0.09337
Murder 0.1253 0.1344 0.07247 0.05604
HS.Grad 0.1404 0.08801 0.08129 0.0944 0.1086
n = 50
P-values for Tests of Bivariate Normality:
Population Income Illiteracy Life.Exp Murder
Population
Income 0.003105
Illiteracy 0.004307 0.02102
Life.Exp 0.01916 0.03275 0.1717
Murder 0.1015 0.174 0.2613 0.07242
HS.Grad 0.007223 0.114 0.005472 0.1134 0.02036
常用原假設為變量間不相關(即總體的相關係數為0)
cor.test()
alternative:two.side,less,greater
method:pearson,kendall,spearman
cor.test(states[,3], states[,5])
Pearson's product-moment correlation
data: states[, 3] and states[, 5]
t = 6.8479, df = 48, p-value = 1.258e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5279280 0.8207295
sample estimates:
cor
0.7029752
psych 包中 corr.test() 函數
corr.test(
states,
use="complete"
)
Call:corr.test(x = states, use = "complete")
Correlation matrix
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00 0.21 0.11 -0.07 0.34 -0.10
Income 0.21 1.00 -0.44 0.34 -0.23 0.62
Illiteracy 0.11 -0.44 1.00 -0.59 0.70 -0.66
Life Exp -0.07 0.34 -0.59 1.00 -0.78 0.58
Murder 0.34 -0.23 0.70 -0.78 1.00 -0.49
HS Grad -0.10 0.62 -0.66 0.58 -0.49 1.00
Sample Size
[1] 50
Probability values (Entries above the diagonal are adjusted for multiple tests.)
Population Income Illiteracy Life Exp Murder HS Grad
Population 0.00 0.59 1.00 1.0 0.10 1
Income 0.15 0.00 0.01 0.1 0.54 0
Illiteracy 0.46 0.00 0.00 0.0 0.00 0
Life Exp 0.64 0.02 0.00 0.0 0.00 0
Murder 0.01 0.11 0.00 0.0 0.00 0
HS Grad 0.50 0.00 0.00 0.0 0.00 0
To see confidence intervals of the correlations, print with the short=FALSE option
其他顯著性檢驗
ggm 包的 pcor.test() 函數
psych 包的 r.test() 函數
相關關係的可視化散點圖和散點圖矩陣
相關圖 (correlogram)
t 檢驗本節關注的變量為連續型的組間比較,假設其呈正態分布
library(MASS)
head(UScrime)
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time y
1 151 1 91 58 56 510 950 33 301 108 41 394 261 0.084602 26.2011 791
2 143 0 113 103 95 583 1012 13 102 96 36 557 194 0.029599 25.2999 1635
3 142 1 89 45 44 533 969 18 219 94 33 318 250 0.083401 24.3006 578
4 136 0 121 149 141 577 994 157 80 102 39 673 167 0.015801 29.9012 1969
5 141 0 121 109 101 591 985 18 30 91 20 578 174 0.041399 21.2998 1234
6 121 0 110 118 115 547 964 25 44 84 29 689 126 0.034201 20.9995 682
Prob:監禁的概率
U1:14-24 歲年齡段城市男性失業率
U2:35-39 歲年齡段城市男性失業率
So:分類變量,是否為南方州
獨立樣本的 t 檢驗t.test(
Prob ~ So,
data=UScrime
)
Welch Two Sample t-test
data: Prob by So
t = -3.8954, df = 24.925, p-value = 0.0006506
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03852569 -0.01187439
sample estimates:
mean in group 0 mean in group 1
0.03851265 0.06371269
可以拒絕南方各州和非南方各州擁有相同監禁概率的假設 (p < 0.001)
非獨立樣本的 t 檢驗非獨立組設計 dependent groups design
sapply(
UScrime[c("U1", "U2")],
function(x) (c(mean=mean(x), sd=sd(x)))
)
U1 U2
mean 95.46809 33.97872
sd 18.02878 8.44545
with(
UScrime,
t.test(U1, U2, paried=TRUE)
)
Welch Two Sample t-test
data: U1 and U2
t = 21.174, df = 65.261, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
55.69010 67.28862
sample estimates:
mean of x mean of y
95.46809 33.97872
方差分析(NOAA):第 9 章
組間差異的非參數檢驗兩組的比較兩組數據獨立,可以使用 Wilcoxon 秩和檢驗 (Mann-Whitney U 檢驗) 評估觀測是否是從相同的概率分布中抽得的。
with(
UScrime,
by(Prob, So, median)
)
So: 0
[1] 0.038201
--
So: 1
[1] 0.055552
wilcox.test(
Prob ~ So,
data=UScrime
)
Wilcoxon rank sum exact test
data: Prob by So
W = 81, p-value = 8.488e-05
alternative hypothesis: true location shift is not equal to 0
Wilcoxon 符號秩檢驗是非獨立樣本 t 檢驗的一種非參數替代方法,適用於兩組成對數據和無法保證正態性假設的情境。
sapply(
UScrime[c("U1", "U2")],
median
)
U1 U2
92 34
with(
UScrime,
wilcox.test(
U1, U2,
paired=TRUE
)
)
cannot compute exact p-value with ties
Wilcoxon signed rank test with continuity correction
data: U1 and U2
V = 1128, p-value = 2.464e-09
alternative hypothesis: true location shift is not equal to 0
單向設計 one-way design
各組獨立:Kruskal-Wallis 檢驗
各組不獨立:Friedman 檢驗
states <- data.frame(
state.region,
state.x77
)
head(states)
state.region Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama South 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska West 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona West 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas South 2110 3378 1.9 70.66 10.1 39.9 65 51945
California West 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado West 2541 4884 0.7 72.06 6.8 63.9 166 103766
kruskal.test(
Illiteracy ~ state.region,
data=states
)
Kruskal-Wallis rank sum test
data: Illiteracy by state.region
Kruskal-Wallis chi-squared = 22.672, df = 3, p-value = 4.726e-05
箱線圖
核密度圖
第 9 章和第 19 章介紹的圖形
參考參考文獻
[1] 面向數據科學家的實用統計學
R 語言實戰
《圖形初階》
《基本數據管理》
《高級數據管理》
《基本圖形》
題圖由 trilemedia 上傳到 Pixabay。