利用python進行17種統計假設檢驗

2021-03-06 沙克芬 SharkFin

請點擊閱讀原文，在語雀中查看

https://www.yuque.com/alipayqgthu1irbf/sharkfin/enrzni

翻譯自

https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/

應用機器學習中需要的17種統計假設檢驗的快速參考指南，並提供Python中的示例代碼。

雖然有數百種統計假設檢驗可以使用，但只有一小部分子集可能需要在機器學習項目中使用。

在這篇文章中，你將看到一個機器學習項目中最流行的統計假說檢驗的手冊，其中有使用Python API的例子。

每一個統計檢驗的表述方式都是一致的，包括。

檢驗的名稱

檢驗的內容。

檢驗的關鍵假設。

測試結果如何解釋。

使用測試的Python API。

注意，當涉及到數據的預期分布或樣本大小等假設時，如果違反了某個假設，某個測試的結果很可能會優雅地退化，而不是立即變得不可用。

一般來說，數據樣本需要具有領域的代表性，並且足夠大，以暴露其分布進行分析。

在某些情況下，可以對數據進行修正以滿足假設，例如通過去除離群值將近似正態分布修正為正態分布，或者當樣本具有不同的方差時，在統計測試中使用對自由度的修正，這是兩個例子。

最後，對於一個給定的關注點，如正態性，可能有多種檢驗方法。我們無法通過統計學得到問題的明確答案；相反，我們得到的是概率性的答案。因此，我們可以通過考慮問題的不同方式，對同一問題得出不同的答案。因此，我們對數據的一些問題可能需要進行多種不同的檢驗。

正態性檢驗

相關性檢驗

平穩性檢驗

參數統計假設檢驗

非參數統計假設檢驗

夏皮羅-威爾克檢驗

D'Agostino's K^2檢驗

安德森-達林檢驗

皮爾遜相關係數

斯皮爾曼秩相關

Kendall's Rank Correlation

卡方檢驗

Augmented Dickey-Fuller

Kwiatkowski-Phillips-Schmidt-Shin案

學生T檢驗

配對學生T檢驗

方差分析檢驗（ANOVA）

重複計量方差分析檢驗

Mann-Whitney U檢驗

Wilcoxon Signed-Rank檢驗

Kruskal-Wallis H檢驗

弗裡德曼檢驗

本教程分為5個部分，它們是：。

1. 正態性檢驗 Normality Tests

本節列出了可以用來檢查數據是否具有高斯分布的統計測試。

夏皮羅-威爾克測試 Shapiro-Wilk Test

Python代碼

 # Example of the Shapiro-Wilk Normality Test
 from scipy.stats import shapiro
 data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
 stat, p = shapiro(data)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably Gaussian')
 else:
     print('Probably not Gaussian')
更多資料
D'Agostino's K^2檢驗 D』Agostino’s K^2 TestPython Code
 # Example of the D'Agostino's K^2 Normality Test
 from scipy.stats import normaltest
 data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
 stat, p = normaltest(data)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably Gaussian')
 else:
     print('Probably not Gaussian')
More Information
安德森-達林檢驗 Anderson-Darling TestPython Code
 # Example of the Anderson-Darling Normality Test
 from scipy.stats import anderson
 data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
 result = anderson(data)
 print('stat=%.3f' % (result.statistic))
 for i in range(len(result.critical_values)):
     sl, cv = result.significance_level[i], result.critical_values[i]
     if result.statistic < cv:
         print('Probably Gaussian at the %.1f%% level' % (sl))
     else:
         print('Probably not Gaussian at the %.1f%% level' % (sl))
More Information
2. 相關性檢驗 Correlation TestsThis section lists statistical tests that you can use to check if two samples are related.
皮爾遜相關係數 Pearson’s Correlation CoefficientPython Code
 # Example of the Pearson's Correlation test
 from scipy.stats import pearsonr
 data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
 data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
 stat, p = pearsonr(data1, data2)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably independent')
 else:
     print('Probably dependent')
More Information
斯皮爾曼秩相關 Spearman’s Rank CorrelationPython Code
 # Example of the Spearman's Rank Correlation Test
 from scipy.stats import spearmanr
 data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
 data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
 stat, p = spearmanr(data1, data2)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably independent')
 else:
     print('Probably dependent')
More Information
Kendall’s Rank CorrelationPython Code
 # Example of the Kendall's Rank Correlation Test
 from scipy.stats import kendalltau
 data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
 data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579]
 stat, p = kendalltau(data1, data2)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably independent')
 else:
     print('Probably dependent')
More Information
卡方檢驗 Chi-Squared Test  解釋
H0：兩個樣本是獨立的。
H1：樣本之間有dependency。
Python Code
 # Example of the Chi-Squared Test
 from scipy.stats import chi2_contingency
 table = [[10, 20, 30],[6,  9,  17]]
 stat, p, dof, expected = chi2_contingency(table)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably independent')
 else:
     print('Probably dependent')
More Information
A Gentle Introduction to the Chi-Squared Test for Machine Learning
scipy.stats.chi2_contingency
Chi-Squared test on Wikipedia
3. 平穩性檢驗 Stationary TestsThis section lists statistical tests that you can use to check if a time series is stationary or not.
Augmented Dickey-Fuller Unit Root TestPython Code
 # Example of the Augmented Dickey-Fuller unit root test
 from statsmodels.tsa.stattools import adfuller
 data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
 stat, p, lags, obs, crit, t = adfuller(data)
 print('stat=%.3f, p=%.3f' % (stat, p))
 if p > 0.05:
     print('Probably not Stationary')
 else:
     print('Probably Stationary')
More Information
How to Check if Time Series Data is Stationary with Python
statsmodels.tsa.stattools.adfuller API.
Augmented Dickey–Fuller test, Wikipedia.
Kwiatkowski-Phillips-Schmidt-Shin檢驗一個時間序列是否是趨勢平穩的。
假設
解釋
H0：時間序列不是趨勢穩定的。
H1：時間序列是趨勢穩定的。
觀察中是時間上的有序。
Python Code
# Example of the Kwiatkowski-Phillips-Schmidt-Shin test
from statsmodels.tsa.stattools import kpss
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
stat, p, lags, crit = kpss(data)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably not Stationary')
else:
	print('Probably Stationary')
More Information
4. 參數統計假設檢驗 Parametric Statistical Hypothesis Tests本節列出了您可以用來比較數據樣本的統計測試。
學生T檢驗 Student’s t-testPython Code
# Example of the Student's t-test
from scipy.stats import ttest_ind
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_ind(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
More Information
配對學生T檢驗 Paired Student’s t-testPython Code
# Example of the Paired Student's t-test
from scipy.stats import ttest_rel
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_rel(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
More Information
方差分析檢驗（ANOVA） Analysis of Variance Test (ANOVA)檢驗兩個或多個獨立樣本的均值是否有顯著差異。
假設
解釋
Python Code
# Example of the Analysis of Variance Test
from scipy.stats import f_oneway
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = f_oneway(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
More Information
重複計量方差分析檢驗 Repeated Measures ANOVA Test檢驗兩個或多個配對樣本的均值是否有顯著差異。
假設
解釋
H0：樣本的均值相等。
H1: 一個或多個樣本的均值不相等。
目前無法在python中實現
More Information
5. 非參數統計假設檢驗 Nonparametric Statistical Hypothesis TestsMann-Whitney U TestPython Code
# Example of the Mann-Whitney U Test
from scipy.stats import mannwhitneyu
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = mannwhitneyu(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
More Information
Wilcoxon Signed-Rank TestPython Code
# Example of the Wilcoxon Signed-Rank Test
from scipy.stats import wilcoxon
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = wilcoxon(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
More Information
Kruskal-Wallis H TestPython Code
# Example of the Kruskal-Wallis H Test
from scipy.stats import kruskal
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = kruskal(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
 More Information
Friedman TestPython Code
# Example of the Friedman Test
from scipy.stats import friedmanchisquare
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = friedmanchisquare(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')
More Information
How to Calculate Nonparametric Statistical Hypothesis Tests in Python
scipy.stats.friedmanchisquare
Friedman test on Wikipedia
更多資料如果你想深入了解這個話題，本節提供更多資料。
A Gentle Introduction to Normality Tests in Python
How to Use Correlation to Understand the Relationship Between Variables
How to Use Parametric Statistical Significance Tests in Python
A Gentle Introduction to Statistical Hypothesis Tests
總結在本教程中，你發現了機器學習項目中可能需要用到的關鍵統計假設檢驗。
具體來說，您學到了

利用python進行17種統計假設檢驗

相關焦點

使用Python進行機器學習的假設檢驗(附代碼)

獨家|使用Python進行機器學習的假設檢驗(附連結&代碼)

常用統計檢驗的Python實現

常用統計檢驗的Python實現和結果解釋

R-統計描述與假設檢驗

統計檢驗假設的P值與檢驗水準α

使用 R 與 python 進行 t 檢驗

python如何進行匯總統計?

你真的了解參數估計和假設檢驗嗎?

顯著性檢驗、假設檢驗和原假設顯著性檢驗

Python實現常用的假設檢驗 !

統計:如何用Excel完成雙樣本假設檢驗

讓你成為統計大師的假設檢驗指南

理論與實務學習:專業假設檢驗

假設檢驗、Z檢驗與T檢驗

python時間序列平穩性檢驗專題及常見問題 - CSDN

統計學中的假設檢驗

假設檢驗的前世今生

使用非參數統計檢驗進行分析的指南

統計計量丨常見統計檢驗的本質都是線性模型(或:如何教統計學) Python版(上)