微信公眾號:醫學統計與R語言
簡介如果疾病的發生水平很低,極為不常見,病例在人群中所佔比再就非常小,那麼稱這個醫學事件為稀有事件。如果我們採用常見的現況流行病調查方法或隊列研究研究這種疾病,就會導致收集的數據中病例數與非病例數很不均衡。比如要探索研究該疾病的影響因素,通常的做法是對病例和非病例的兩類人群建立logistic回歸模型,然而由於資料中的病例所佔的比例遠遠低於非病例的比重,這就給稀有事件的統計分析帶來一系列問題,在這種情況下仍採用常規的logistic回歸方法就不適合了。
Example論文:"Breast Cancer Risk Factors in a Defined Population: Weighted Logistic Regression Approach for Rare Events"中METHOD描述如下:
Statistical analysis was performed using statistical analysis software SPSS version 11.5 (SPSS Inc., Chicago, USA). Relogit (logistic regression model for rare events) analysis with a weighting method, which was performed using the program Zelig, was used to determine risk factors . In the univariate model, odds ratios [ORs] and confidence intervals [CIs] were calculated. Covariates with a p-value of less than 0.25 were included into the multiple relogit model to determine risk factors.
Syntax- 輸入1:
install.packages("Hmisc")
library(Hmisc)
datalog <- spss.get("logistic.sav",to.data.frame=T)
library(psych)
headTail(datalog,5,5)
- 結果1:
sex age diabetes tumour lbp pancreatitis die
1 1 10 0 1 0 0 1
2 1 12 0 0 1 0 1
3 2 12 0 0 0 1 1
4 2 14 1 0 1 0 1
5 1 15 1 0 0 0 1
... ... ... ... ... ... ... ...
418 2 77 0 0 0 0 0
419 2 77 0 1 0 0 0
420 1 80 0 0 0 0 0
421 2 88 0 0 0 0 0
422 2 89 0 0 0 0 0
- 輸入2:
table(datalog$die)
- 結果2:
0 1
402 20
If you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.There’s nothing wrong with the logistic model in such cases. The problem is that maximum likelihood estimation of the logistic model is well-known to suffer from small-sample bias. And the degree of bias is strongly dependent on the number of cases in the less frequent of the two categories. So even with a sample size of 100,000, if there are only 20 events in the sample, you may have substantial bias.
- 輸入3:
install.packages("Zelig")
library(Zelig)
zelig.out <- zelig(die~sex+age+diabetes+tumour+lbp+pancreatitis,data=datalog,model="relogit",tau=0.05)
summary(zelig.out)
預先校正法(prior correction),tau=0.05,為已知總體概率;
加權校正法(weight correction),在Zelig中增加case.control = "weighting"
- 結果3:
Model:
Call:
z5$zelig(formula = die ~ sex + age + diabetes + tumour + lbp +
pancreatitis, tau = 0.05, data = datalog)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.37037 -0.21257 -0.07337 -0.01692 2.63922
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.30000 1.15233 0.260 0.79460
sex 0.01184 0.61486 0.019 0.98464
age -0.16363 0.03832 -4.270 1.96e-05
diabetes 3.52070 1.22176 2.882 0.00396
tumour 2.36537 1.15653 2.045 0.04083
lbp 1.60375 0.66204 2.422 0.01542
pancreatitis 1.51434 0.85844 1.764 0.07772
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 161.008 on 421 degrees of freedom
Residual deviance: 80.129 on 415 degrees of freedom
AIC: 94.129
Number of Fisher Scoring iterations: 8
Next step: Use 'setx' method