回歸及釋義
模型選擇
回歸分析(regression analysis)
簡單回歸(一元線性回歸)
構建線性(直線)關係:Y= a + bX + e
Y = dependent variable
X = independent variable
a = 截距 (常數)
b = 斜率
stata實現,reg income education
coefficient(coef.)為斜率b
_cons 為截距a
斜率b表示:x每增加一個單位,y的變化情況
b的檢驗,通過判定係數R²,R²表示,在y的所有差異中,由於x的變化所解釋掉的比例。是對回歸方程擬合優度的度量。該統計量越接近於1,模型的擬合優度越高。
多元回歸
b1:表示在其他變量保持不變的情況下, x1每增加一個單位,y的變化量;
b2:表示在其他變量保持不變的情況下, x2每增加一個單位,y的變化量;
bk:表示在其他變量保持不變的情況下, xk每增加一個單位,y的變化量。
Stata實現,reg var1 var2 var3….
回歸係數的顯著性說明了,假定其它變量不變,一個自變量(Xk)對因變量(y)的額外的貢獻。
R²表示,Y的所有差異中,被方程線所解釋掉的比例; R2越大,表示方程的解釋力越大。放入不同的x,方程的解釋力會有所變化。
回歸應用
虛擬變量:由於回歸分析中x和Y變量都要求是連續性變量,所以,但x是多分類變量時,需要將其變成(n-1)個(0,1)虛擬變量。
虛擬變量係數b表示,x變化一個單位,Y隨之變化的情況,如果是二分虛擬變量,x只有兩個取值(0或者1),變化一個單位,則指的是從0變化到1。例如,常見的,從非黨員(0)變成黨員(1)對收入的影響,從女性(0)變為男性(1)對教育的影響。
X是多分類變量在stata中的回歸實現,
Xi:reg var1 var2 var3 i.var4
其中,var4為多分類變量,var1,var2,var3為連續性變量或者二分類變量。
變量的非線性轉化,如年齡對收入的影響並不是簡單的線性關係,為了擬合這種曲線關係,我們往往對age進行二次方處理。
Gen agesq = age^²
對收入進行對數轉換,gen lgincome=ln(income+1)
特異質
如何判斷特異值?
在回歸之後,生成學生殘差(studentized residuals)
考察每個觀測值的學生殘差,2-3之間就需要注意,>3一般認為是特異值。
Stata實現,
Quietly reg var1 var2 var3
Predict student,rstudent
Gsort –student
List income var1 var2 var3 student in 1/20
如果特異值對結果影響巨大,則需要刪除,或者使用Robust Regression
共線性問題
為什麼加入某個變量之後,前面模型中顯著的某個自變量突然變得不顯著了呢?這必然是受到剛加入的變量的影響,新加入的變量和之前的變量有可能存在著共線性問題。下列情形的出現有可能預示著共線性問題的存在,一是係數的數值和符號發生出乎意料的變化;二是R2很大,但回歸係數不顯著。
判斷變量是否存在共線性
1、輸入
Pwcorr var1 var2,sig
2、在回歸之後運行vif(variance inflation factor)命令
方差膨脹因子(Variance Inflation Factor,VIF):是指解釋變量之間存在多重共線性時的方差與不存在多重共線性時的方差之比。容忍度的倒數,VIF越大,顯示共線性越嚴重。經驗判斷方法表明:當0<VIF<10,不存在多重共線性;當10≤VIF<100,存在較強的多重共線性;當VIF≥100,存在嚴重多重共線性。
Vif
1/vif表示某個x的方差獨立於其他所有x的比例
如1/vif<0.1,或者vif>10,則認為共線性比較嚴重
回歸結果的表達
係數:在顯著情況下,x對y的影響程度
標準誤:標準誤(Standard Error)是衡量對應樣本統計量抽樣誤差大小的尺度。標準誤用來衡量抽樣誤差。標準誤越小,表明樣本統計量與總體參數的值越接近,樣本對總體越有代表性,用樣本統計量推斷總體參數的可靠度越大。因此,標準誤是統計推斷可靠性的指標。
顯著性:*,**,***;p<0.1 1星; P<0.05 2星;P<0.01 3星
擬合係數R2:R²越接近1越好。
Logistic 模型
適用情況,當Y為二分類變量,或者說虛擬變量(0,1)
發生比(odds)
P為事件發生的概率,(1-p)為不發生的概率,odds即為事件發生與不發生的概率之比,可以理解為一種發生的概率。事件發生的概率越高,發生比越高(越接近1)。
「發生比率」(odds ratio)
「發生比率」可以理解為第一組相對於第二組某個事件發生概率的比較。
Logit 在stata中的實現,
Logit var1 var2 var3 var4
對logit回歸係數的解讀,b表示,當x為連續變量時,x每增加一個單位,y的odds成為原來的eβ倍;當x為虛擬變量時,對應組的odds為參照組odds的eβ倍。
β即為coef.(係數)
如,Age的係數(.0504744)表示,年齡每增加一歲,odds成為原來的1.052倍( exp(.0504744)=1.0517699);也就是說,年齡每增加1歲,成為黨員的odds會增加5.2%。
Exp(Exponential),高等數學裡以自然常數e為底的指數函數,全稱指數曲線。
Stata中,如果想回歸結果直接顯示發生比率(odds ratio),而不是coefficient係數的話,在命令的選擇項中加上or
Logit var1 var2 var3 var4,or
或
Logistic var1 var2 var3 var4
Logistic regression
Below we use the logit commandto estimate a logistic regression model. The i. before rank indicatesthat rank is a factor variable (i.e., categorical variable),and that it should be included in the model as a series of indicator variables.Note that this syntax was introduced in Stata 11.
logit admit gre gpa i.rank
· In the output above, we first see the iteration log,indicating how quickly the model converged. The log likelihood (-229.25875) canbe usedin comparisons of nested models, but we won’t show an example of thathere.
·
· Also at the top of the output we see that all 400observations in our data setwere used in the analysis (fewer observations wouldhave been used if any of our variables had missing values).
·
· The likelihood ratio chi-square of41.46 with a p-value of 0.0001 tells us that our model as a whole fits significantly better than an empty model (i.e., a modelwith no predictors).
·
· In the table we see the coefficients, their standarderrors, the z-statistic, associated p-values, and the 95% confidence intervalof the coefficients. Both gre and gpa arestatistically significant, as are the three indicator variables for rank.The logistic regression coefficients give the change in the log odds of theoutcome for a one unit increase in the predictor variable.
·
· For every one unit change in gre, the logodds of admission (versus non-admission) increases by 0.002.
· For a one unit increase in gpa, the log oddsof being admitted to graduate school increases by 0.804.
· The indicator variables for rank have aslightly different interpretation. For example, having attended anundergraduate institution with rank of 2, versus aninstitution with a rank of 1, decreases the log odds ofadmission by 0.675.
if you use the or option,illustrated below. You could also use the logistic command.
Now we can say that for a one unit increasein gpa, the odds of being admitted to graduate school (versus notbeing admitted) increase by a factor of 2.23.
Source:https://stats.idre.ucla.edu/stata/dae/logistic-regression/
Multinomial logistic regression
logit model 的擴展,適用於Y為多分類變量的情況
Multinomial logistic regression is used to model nominal outcomevariables, in which the log odds of the outcomes are modeled as a linearcombination of the predictor variables.
A example:Entering highschool students make program choices among general program, vocational programand academic program. Their choice might be modeled using their writing scoreand their social economic status.
Multinomial logistic regression
Below we use the mlogit command to estimate amultinomial logistic regression model. The i. before ses indicates that ses isa indicator variable (i.e., categorical variable), and that it should beincluded in the model. We have also used the option 「base」 to indicate thecategory we would want to use for the baseline comparison group. In the modelbelow, we have chosen to use the academic program type as the baselinecategory.
· In the output above, we first see the iteration log,indicating how quickly the model converged. The log likelihood (-179.98173) canbe used in comparisons of nested models, but we won’t show an example ofcomparing models here.
· The likelihood ratio chi-square of 48.23 with a p-value< 0.0001 tells us that our model as a whole fits significantly better thanan empty model (i.e., a model with no predictors)
· The output above has two parts, labeled with thecategories of the outcome variable prog. They correspond to the twoequations below:
where b’s are the regressioncoefficients.
· A one-unit increase in the variable write isassociated with a .058 decrease in the relative log odds of being in generalprogram vs. academic program.
· A one-unit increase in the variable write isassociated with a .1136 decrease in the relative log odds of being in vocationprogram vs. academic program.
· The relative log odds of being in general program vs. inacademic program will decrease by 1.163 if moving from the lowest levelof ses (ses==1) to the highest level of ses (ses==3).
The ratio of the probability of choosing oneoutcome category over the probability of choosing the baseline category isoften referred to as relative risk (and it is also sometimes referred to asodds as we have just used to described the regression parameters above). Relative risk can be obtained by exponentiating the linear equations above,yielding regression coefficients that are relative risk ratios for a unitchange in the predictor variable. We can use the rrr optionfor mlogit command to display the regression results in termsof relative risk ratios.
· The relative risk ratio for a one-unit increase in thevariable write is .9437 (exp(-.0579284) from the output of thefirst mlogit command above) for being in general program vs.academic program.
· The relative risk ratio switching from ses =1 to 3 is .3126 for being in general program vs. academic program. In otherwords, the expected risk of staying in the general program is lower forsubjects who are high in ses.
Source:https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/
Ordered logistic regression
適用於Y為定序變量的情況
A example for Ordered logisticregression
A study looks at factors that influence thedecision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likelyto apply to graduate school. Hence, our outcome variable has threecategories. Data on parental educationalstatus, whether the undergraduate institution is public or private, and currentGPA is also collected. The researchershave reason to believe that the 「distances」 between these three points are notequal. For example, the 「distance」between 「unlikely」 and 「somewhat likely」 may be shorter than the distancebetween 「somewhat likely」 and 「very likely」.
Description of the data
This hypothetical data set has a thee levelvariable called apply (coded 0, 1, 2), that we will use as our outcomevariable. We also have three variablesthat we will use as predictors: pared,which is a 0/1 variable indicating whether at least one parent has a graduatedegree; public, which is a 0/1 variable where 1 indicates that theundergraduate institution is public and 0 private, and gpa, which is thestudent’s grade point average.
Ordered logistic regression
Below we use the ologit command toestimate an ordered logistic regression model. The i. before paredindicates that pared is a factor variable (i.e., categorical variable),and that it should be included in the model as a series of indicator variables.The same goes for i.public.
In the output above, we first see the iterationlog. At iteration 0, Stata fits a null model, i.e. the intercept-onlymodel. It then moves on to fit the full model and stops the iteration processonce the difference in log likelihood between successive iterations becomesufficiently small. The final log likelihood (-358.51244) is displayed again.It can be used in comparisons of nested models. Also at the top of the outputwe see that all 400 observations in our data set were used in the analysis. Thelikelihood ratio chi-square of 24.18 with a p-value of 0.0000 tells us that ourmodel as a whole is statistically significant, as compared to the null modelwith no predictors. The pseudo-R-squared of 0.0326 is also given.
In the table we see the coefficients, theirstandard errors, z-tests and their associated p-values, and the 95% confidenceinterval of the coefficients. Both pared and gpa arestatistically significant; public is not. So for pared, we would say that for aone unit increase in pared (i.e., going from 0 to 1), we expect a 1.05increase in the log odds of being in a higher level of apply, given allof the other variables in the model are held constant. For a one unit increase in gpa, wewould expect a 0.62 increase in the log odds of being in a higher level of apply,given that all of the other variables in the model are held constant. The cutpoints shown at the bottom of theoutput indicate where the latent variable is cut to make the three groups thatwe observe in our data. Note that thislatent variable is continuous. Ingeneral, these are not used in the interpretation of the results. The cutpoints are closely related tothresholds, which are reported by other statistical packages.
We can obtain odds ratios using the or option after the ologit command.
In the output above the results are displayed asproportional odds ratios. We would interpret these pretty much as we would oddsratios from a binary logistic regression. For pared, we wouldsay that for a one unit increase in pared, i.e., going from 0 to 1, the odds ofhigh apply versus the combined middle and low categories are 2.85 greater,given that all of the other variables in the model are held constant. Likewise, the odds of the combined middle and high categories versus low applyis 2.85 times greater, given that all of the other variables in the model areheld constant. For a one unit increase in gpa, the odds ofthe high category of apply versus the low and middle categoriesof apply are 1.85 times greater, given that the othervariables in the model are held constant. Because of the proportionalodds assumption (see below for more explanation), the same increase, 1.85times, is found between low apply and the combined categoriesof middle and high apply.
One of the assumptions underlying orderedlogistic (and ordered probit) regression is that the relationship between eachpair of outcome groups is the same. In other words, ordered logisticregression assumes that the coefficients that describe the relationshipbetween, say, the lowest versus all higher categories of the response variableare the same as those that describe the relationship between the next lowestcategory and all higher categories, etc. This is called the proportionalodds assumption or the parallel regression assumption. Because therelationship between all pairs of groups is the same, there is only one set ofcoefficients (only one model).
Source:https://stats.idre.ucla.edu/stata/dae/ordered-logistic-regression/
參考文獻
1、Sun,社會研究方法課,2016
2、Lawrence C.Hamilton,《應用stata做統計分析》,清華大出版社,2017
3、https://stats.idre.ucla.edu/ (UCLA:Institute for Digital Research andEducation)
y/2018.9.11