醫學統計與R語言:Kendall是誰?樣本量是自變量的10倍?

2021-02-23 醫學統計與R語言

微信公眾號：醫學統計與R語言
如果你覺得對你有幫助，歡迎轉發

在非常多的中文文獻中關於多重線回歸分析的樣本量是這樣描述的：
「根據Kendall粗糙確定樣本量原則，樣本量可取變量數的5-10倍，考慮到失訪和不合作情況，在原有樣本量的基礎上再擴大20%作為擬調查的樣本例數。」，本人花了一上午也沒找到這篇文獻的原始出處。國內文章引用的資料來源主要為《臨床流行病學臨床科研設計、測量與評價》一書，在此書中有如下描述：

如有人能提供關於Kendall的參考文獻，請留言。

關於回歸分析中樣本量的經驗法則(Rules of Thumb)，可參考如下：

1.From Wikipedia. https://en.wikipedia.org/wiki/One_in_ten_rule

In statistics, the one in ten rule is a rule of thumb for how many predictor parameters can be estimated from data when doing regression analysis (in particular proportional hazards models in survival analysis and logistic regression) while keeping the risk of overfitting low. The rule states that one predictive variable can be studied for every ten events.[1][2][3][4] For logistic regression the number of events is given by the size of the smallest of the outcome categories, and for survival analysis it is given by the number of uncensored events.[3]

For example, if a sample of 200 patients are studied and 20 patients die during the study (so that 180 patients survive), the one in ten rule implies that two pre-specified predictors can reliably be fitted to the total data. Similarly, if 100 patients die during the study (so that 100 patients survive), ten pre-specified predictors can be fitted reliably. If more are fitted, the rule implies that overfitting is likely and the results will not predict well outside the training data. It is not uncommon to see the 1:10 rule violated in fields with many variables (e.g. gene expression studies in cancer), decreasing the confidence in reported findings.[5]

Improvements

A "one in 20 rule" has been suggested, indicating the need for shrinkage of regression coefficients, and a "one in 50 rule" for stepwise selection with the default p-value of 5%.[4][6] Other studies, however, show that the one in ten rule may be too conservative as a general recommendation and that five to nine events per predictor can be enough, depending on the research question.[7]

More recently, a study has shown that the ratio of events per predictive variable is not a reliable statistic for estimating the minimum number of events for estimating a logistic prediction model.[8] Instead, the number of predictor variables, the total sample size (events + non-events) and the events fraction (events / total sample size) can be used to calculate the expected prediction error of the model that is to be developed.[9] One can then estimate the required sample size to achieve an expected prediction error that is smaller than a predetermined allowable prediction error value.[9]

Alternatively, three requirements for prediction model estimation have been suggested: the model should have a global shrinkage factor of ≥ .9, an absolute difference of ≤ .05 in the model's apparent and adjusted Nagelkerke R2, and a precise estimation of the overall risk or rate in the target population.[10] The necessary sample size and number of events for model development are then given by the values that meet these requirements.[10]

[1] Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). "Regression modelling strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52. doi:10.1002/sim.4780030207.
[2] Harrell, F. E. Jr.; Lee, K. L.; Mark, D. B. (1996). "Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors" (PDF). Stat Med. 15 (4): 361–87. doi:10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4.
[3] Peduzzi, Peter; Concato, John; Kemper, Elizabeth; Holford, Theodore R.; Feinstein, Alvan R. (1996). "A simulation study of the number of events per variable in logistic regression analysis". Journal of Clinical Epidemiology. 49 (12): 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487.
[4] "Chapter 8: Statistical Models for Prognostication: Problems with Regression Models". Archived from the original on October 31, 2004. Retrieved 2013-10-11.
[5] Ernest S. Shtatland, Ken Kleinman, Emily M. Cain. Model building in Proc PHREG with automatic variable selection and information criteria. Paper 206–30 in SUGI 30 Proceedings, Philadelphia, Pennsylvania April 10–13, 2005. http://www2.sas.com/proceedings/sugi30/206-30.pdf

[6] Steyerberg, E. W.; Eijkemans, M. J.; Harrell, F. E. Jr.; Habbema, J. D. (2000). "Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets". Stat Med. 19 (8): 1059–1079. doi:10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0.
[7] Vittinghoff, E.; McCulloch, C. E. (2007). "Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression". American Journal of Epidemiology. 165 (6): 710–718. doi:10.1093/aje/kwk052. PMID 17182981.
[8]van Smeden, Maarten; de Groot, Joris A. H.; Moons, Karel G. M.; Collins, Gary S.; Altman, Douglas G.; Eijkemans, Marinus J. C.; Reitsma, Johannes B. (2016-11-24). "No rationale for 1 variable per 10 events criterion for binary logistic regression analysis". BMC Medical Research Methodology. 16 (1): 163. doi:10.1186/s12874-016-0267-3. ISSN 1471-2288. PMC 5122171. PMID 27881078.
[9]van Smeden, Maarten; Moons, Karel Gm; de Groot, Joris Ah; Collins, Gary S.; Altman, Douglas G.; Eijkemans, Marinus Jc; Reitsma, Johannes B. (2018-01-01). "Sample size for binary logistic prediction models: Beyond events per variable criteria". Statistical Methods in Medical Research. 28: 962280218784726. doi:10.1177/0962280218784726. ISSN 1477-0334. PMID 29966490.
[10] Riley, Richard D.; Snell, Kym IE; Ensor, Joie; Burke, Danielle L.; Jr, Frank E. Harrell; Moons, Karel GM; Collins, Gary S. (2018). "Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes". Statistics in Medicine. 0: 1276–1296. doi:10.1002/sim.7992. ISSN 1097-0258. PMC 6519266. PMID 30357870.

2.相關參考文獻

Although there are more complex formulae, the general rule of thumb is no less than 50 participants for a correlation or regression with the number increasing with larger numbers of independent variables (IVs). Green (1991)provides a comprehensive overview of the procedures used to determine regression sample sizes. He suggestsN > 50 + 8m (where m is the number of IVs) for testing the multiple correlation andN > 104 + mfor testing individual predictors (assuming a medium‐sized relationship). If testing both, use the larger sample size.

Although Greenʹs (1991) formula is more comprehensive,there are two other rules of thumb that could be used. With five or fewer predictors (this number would include correlations), a researcher can use Harrisʹs (1985) formula for yielding the absolute minimum number of participants.Harris suggests that the number of participants should exceed the number of predictors by at least 50 (i.e., total number of participants equals the number of predictor variables plus 50)‐‐a formula much the same as Greenʹs mentioned above.For regression equations using six or more predictors, an absolute minimum of 10 participants per predictor variable is appropriate. However, if the circumstances allow, a researcher would have better power to detect a small effect size with approximately 30 participants per variable. For instance, Cohen and Cohen (1975) demonstrate that with a single predictor that in the population correlates with the DV at .30, 124 participants are needed to maintain 80% power. With five predictors and a population correlation of .30, 187 participants would be needed to achieve 80% power. Larger samples are needed when the DV is skewed, the effect size expected is small, there is substantial measurement error, or stepwise regression is being used(Tabachnick & Fidell, 1996).

[1] Green S B. How many subjects does it take to do a regression analysis[J]. Multivariate behavioral research, 1991, 26(3): 499-510.
[2]VanVoorhis C R W, Morgan B L. Understanding power and rules of thumb for determining sample sizes[J]. Tutorials in quantitative methods for psychology, 2007, 3(2): 43-50.http://www.tqmp.org/RegularArticles/vol03-2/p043/p043.pdf
[3]Harris, R. J. (1985). A primer of multivariate statistics (2nd ed.).New York: Academic Press.
[4]https://www.youtube.com/watch?v=PD_xC3Xtqlw
[5]https://www.youtube.com/watch?v=v7YlRye7aMw
[6]https://books.google.com.sg/books?id=vlO2BQAAQBAJ&pg=PA212&lpg=PA212&dq=Rules+of+Thumb+++multiple++regression&source=bl&ots=XBb5fizvwZ&sig=ACfU3U3Y7649s2O4WpG4_8UOvR0gvecrNQ&hl=zh-CN&sa=X&redir_esc=y#v=onepage&q&f=false

前文連結：

醫學統計與R語言：多列分組正態性檢驗

醫學統計與R語言：方差分析中計劃好的多重比較（Planned Comparisons and Post Hoc Tests）

醫學統計與R語言：圓形樹狀圖（circular dendrogram）

醫學統計與R語言：畫一個姑娘陪著我，再畫個花邊的被窩

醫學統計與R語言：雙因素重複測量方差分析（Two-way repeated measures ANOVA）

醫學統計R語言：分面畫boxplot

醫學統計與R語言：調節效應分析（Moderation Analysis）

醫學統計與R語言：結構方程模型（structural equation model）

醫學統計與R語言：中介效應分析（mediation effect analysis）

醫學統計與R語言：生存曲線（survival curves）with risk.table

醫學統計與R語言：如何比較兩種診斷試驗的靈敏度和特異度？

醫學統計與R語言：你知道nomogram的points和total points怎麼算嗎？

醫學統計與R語言：Cleveland dot plot

醫學統計與R語言：交互作用模型中分組效應及標準誤的計算

醫學統計與R語言：多條ROC曲線的AUC多重比較

醫學統計與R語言：來，今天學個散點圖！

醫學統計與R語言：一份簡單的數據整理分析

醫學統計與R語言：利用金字塔圖比較多個指標

醫學統計與R語言：點圖（dotplot）

醫學統計與R語言：幕後高手出馬！

醫學統計與R語言：Calibration plot with 置信區間

醫學統計與R語言：還說自己不會畫Calibration plot！

醫學統計與R語言：KS曲線，KS plot，lift plot

醫學統計與R語言：身體酸痛，醒來學個卡方檢驗

醫學統計與R語言：利用午睡幾分鐘，學習下Population Pyramid

醫學統計與R語言：有序Logistic回歸平行線檢驗（Test proportional odds assumption ）

醫學統計與R語言:你的基金標書裡還少這幅圖！

醫學統計與R語言：這裡的坑你踩過幾回，有序多分類Logistic回歸（Ordinal Logistic Regression）

醫學統計與R語言：logsitc回歸校準曲線 Calibration curve

醫學統計與R語言：多分格相關係數（polychoric）多序列相關係數（polyserial）Coefficient Omega

醫學統計與R語言：Tobit回歸中的Marginal effect

醫學統計與R語言：定量變量的無監督離散化（ unsupervised discretization）

醫學統計與R語言：Welch's ANOVA and Games-Howell post-hoc test

醫學統計與R語言：配對均值檢驗可視化加label

醫學統計與R語言：線性固定效應模型（Linear fix effect model ）

醫學統計與R語言：Tobit回歸模型

函數醫學統計與R語言：隨機森林與Logistic預測（randomForest vs Logistic regression）

醫學統計與R語言：多重比較P值的可視化

醫學統計與R語言：腫瘤研究中的waterfall plot(瀑布圖)

醫學統計與R語言：多元方差分析與非參數多元方差分析

醫學統計與R語言：使用R語言實現Johnson-Neyman分析

醫學統計與R語言：多層線性模型圖示

醫學統計與R語言：多層線性模型(混合線性模型

醫學統計與R語言：best subset of inputs for the glm famil

醫學統計與R語言：多重線回歸自變量篩選的幾種方法

醫學統計與R語言：關聯規則Apriori算

醫學統計與R語言：列聯表可視化的4種方

醫學統計與R語言：盤它！什麼格式文件都可

醫學統計與R語言：離群值分析（Outlier Detection

醫學統計與R語言：決策樹CHAI

醫學統計與R語言：主成分分析（PCA）及可視

醫學統計與R語言：可能是最全R語言操作手冊（cheatsheets）

醫學統計與R語言：聽說你還在手動畫三線表！

醫學統計與R語言：合併多個Excel文

醫學統計與R語言：政策效果評價之合成控制

醫學統計與R語言：線性回歸模型假設條件驗證與診斷

醫學統計與R語言：想發表sci就畫這種Bland-Altman Plot

醫學統計與R語言:Kendall是誰?樣本量是自變量的10倍?

相關焦點

醫學統計與R語言:GiViTI Calibration Belt

r語言的p值檢驗 - CSDN

R語言從入門到精通:Day12--R語言統計--回歸分析

R語言——通過bootstrap自抽樣量化統計估計量的不確定性

r語言卡方檢驗和似然比檢驗_r語言似然比檢驗代碼 - CSDN

R語言從入門到精通:Day10-R語言統計入門代碼大全

R語言實戰:基本統計分析

讀書摘要《生物醫學研究的統計方法》常見疑問—方積乾

醫學統計與R語言:Cronbach's α Coefficient

R語言統計篇:配對t檢驗

r語言多元回歸模型_r語言多元回歸模型殘差分析 - CSDN

醫學研究,樣本量計算結果到底給誰看的?

R與生物專題 | 第七講 R-相關性分析及作圖

r語言兩樣本檢驗 - CSDN

統計學最常用的「數據分析方法」清單(一)|信度|卡方|施測|統計量|...

G.Power教程 | 樣本量估計

2007年10月浙江省自學考試心理統計真題

r語言檢驗效能_r語言兩個正態總體的方差未知但相等,計算檢驗效能...

R語言統計原創視頻7 | ANOVA方差分析

第十三講 R-配對樣本Wilcoxon檢驗

醫學統計與R語言:Kendall是誰?樣本量是自變量的10倍?

相關焦點

醫學統計與R語言:GiViTI Calibration Belt

r語言的p值檢驗 - CSDN

R語言從入門到精通:Day12--R語言統計--回歸分析

R語言——通過bootstrap自抽樣量化統計估計量的不確定性

r語言卡方檢驗和似然比檢驗_r語言似然比檢驗代碼 - CSDN

R語言從入門到精通:Day10-R語言統計入門代碼大全

R語言實戰:基本統計分析

讀書摘要《生物醫學研究的統計方法》常見疑問—方積乾

醫學統計與R語言:Cronbach's α Coefficient

R語言統計篇:配對t檢驗

r語言 多元回歸模型_r語言多元回歸模型殘差分析 - CSDN

醫學研究,樣本量計算結果到底給誰看的?

R與生物專題 | 第七講 R-相關性分析及作圖

r語言兩樣本檢驗 - CSDN

統計學最常用的「數據分析方法」清單(一)|信度|卡方|施測|統計量|...

G.Power教程 | 樣本量估計

2007年10月浙江省自學考試心理統計真題

r語言檢驗效能_r語言兩個正態總體的方差未知但相等,計算檢驗效能...

R語言統計原創視頻7 | ANOVA方差分析

第十三講 R-配對樣本Wilcoxon檢驗

r語言多元回歸模型_r語言多元回歸模型殘差分析 - CSDN