STATA學習總結(4):回歸及釋義

2021-02-12 冒雨先生

 

回歸及釋義

模型選擇

 


回歸分析(regression analysis

簡單回歸(一元線性回歸)

構建線性(直線)關係:Y= a + bX + e

Y = dependent variable

X = independent variable

a = 截距 (常數)

b = 斜率

 

stata實現,reg income education

coefficient(coef.)為斜率b

_cons 為截距a

 

斜率b表示:x每增加一個單位,y的變化情況

b的檢驗,通過判定係數R²,R²表示,在y的所有差異中,由於x的變化所解釋掉的比例。是對回歸方程擬合優度的度量。該統計量越接近於1,模型的擬合優度越高。

 

多元回歸

b1:表示在其他變量保持不變的情況下, x1每增加一個單位,y的變化量;

b2:表示在其他變量保持不變的情況下, x2每增加一個單位,y的變化量;

bk:表示在其他變量保持不變的情況下, xk每增加一個單位,y的變化量。

Stata實現,reg var1 var2 var3….

回歸係數的顯著性說明了,假定其它變量不變,一個自變量(Xk)對因變量(y)的額外的貢獻。

 

表示,Y的所有差異中,被方程線所解釋掉的比例; R2越大,表示方程的解釋力越大。放入不同的x,方程的解釋力會有所變化。

 

回歸應用

虛擬變量:由於回歸分析中xY變量都要求是連續性變量,所以,但x是多分類變量時,需要將其變成(n-1)個(0,1)虛擬變量。

虛擬變量係數b表示,x變化一個單位,Y隨之變化的情況,如果是二分虛擬變量,x只有兩個取值(0或者1),變化一個單位,則指的是從0變化到1。例如,常見的,從非黨員(0)變成黨員(1)對收入的影響,從女性(0)變為男性(1)對教育的影響。

 

X是多分類變量在stata中的回歸實現,

Xi:reg var1 var2 var3 i.var4

其中,var4為多分類變量,var1,var2,var3為連續性變量或者二分類變量。

 

變量的非線性轉化,如年齡對收入的影響並不是簡單的線性關係,為了擬合這種曲線關係,我們往往對age進行二次方處理。

Gen agesq = age^²

 

對收入進行對數轉換,gen lgincome=ln(income+1)

 

特異質

如何判斷特異值?

在回歸之後,生成學生殘差(studentized residuals)

考察每個觀測值的學生殘差,2-3之間就需要注意,>3一般認為是特異值。

Stata實現,

Quietly reg var1 var2 var3

Predict student,rstudent

Gsort –student

List income var1 var2 var3 student in 1/20

 

如果特異值對結果影響巨大,則需要刪除,或者使用Robust Regression

 

共線性問題

為什麼加入某個變量之後,前面模型中顯著的某個自變量突然變得不顯著了呢?這必然是受到剛加入的變量的影響,新加入的變量和之前的變量有可能存在著共線性問題。下列情形的出現有可能預示著共線性問題的存在,一是係數的數值和符號發生出乎意料的變化;二是R2很大,但回歸係數不顯著。

判斷變量是否存在共線性

1、輸入

Pwcorr var1 var2,sig

2、在回歸之後運行vif(variance inflation factor)命令

方差膨脹因子(Variance Inflation Factor,VIF):是指解釋變量之間存在多重共線性時的方差與不存在多重共線性時的方差之比。容忍度的倒數,VIF越大,顯示共線性越嚴重。經驗判斷方法表明:0<VIF<10,不存在多重共線性;當10VIF<100,存在較強的多重共線性;當VIF100,存在嚴重多重共線性。

Vif

1/vif表示某個x的方差獨立於其他所有x的比例

如1/vif<0.1,或者vif>10,則認為共線性比較嚴重

 

回歸結果的表達

係數:在顯著情況下,x對y的影響程度

標準誤:標準誤(Standard Error)是衡量對應樣本統計量抽樣誤差大小的尺度。標準誤用來衡量抽樣誤差。標準誤越小,表明樣本統計量與總體參數的值越接近,樣本對總體越有代表性,用樣本統計量推斷總體參數的可靠度越大。因此,標準誤是統計推斷可靠性的指標。

顯著性:*,**,***;p<0.1   1星; P<0.05   2星;P<0.01   3星

擬合係數R2:R²越接近1越好。

 

 

Logistic 模型

適用情況,當Y為二分類變量,或者說虛擬變量(0,1)

發生比(odds)

P為事件發生的概率,(1-p)為不發生的概率,odds即為事件發生與不發生的概率之比,可以理解為一種發生的概率。事件發生的概率越高,發生比越高(越接近1)。

 

「發生比率」(odds ratio)

「發生比率」可以理解為第一組相對於第二組某個事件發生概率的比較。

 

Logit 在stata中的實現,

Logit var1 var2 var3 var4

 

對logit回歸係數的解讀,b表示,當x為連續變量時,x每增加一個單位,y的odds成為原來的eβ倍;當x為虛擬變量時,對應組的odds為參照組odds的eβ倍。

β即為coef.(係數)

如,Age的係數(.0504744)表示,年齡每增加一歲,odds成為原來的1.052倍( exp(.0504744)=1.0517699);也就是說,年齡每增加1歲,成為黨員的odds會增加5.2%。

Exp(Exponential),高等數學裡以自然常數e為底的指數函數,全稱指數曲線。

 

Stata中,如果想回歸結果直接顯示發生比率(odds ratio),而不是coefficient係數的話,在命令的選擇項中加上or

Logit var1 var2 var3 var4,or

Logistic var1 var2 var3 var4

Logistic regression

Below we use the logit commandto estimate a logistic regression model. The i. before rank indicatesthat rank is a factor variable (i.e., categorical variable),and that it should be included in the model as a series of indicator variables.Note that this syntax was introduced in Stata 11.

 

logit admit gre gpa i.rank

 

·        In the output above, we first see the iteration log,indicating how quickly the model converged. The log likelihood (-229.25875) canbe usedin comparisons of nested models, but we won’t show an example of thathere.

·         

·        Also at the top of the output we see that all 400observations in our data setwere used in the analysis (fewer observations wouldhave been used if any of our variables had missing values).

·         

·        The likelihood ratio chi-square of41.46 with a p-value of 0.0001 tells us that our model as a whole fits significantly better than an empty model (i.e., a modelwith no predictors).

·         

·        In the table we see the coefficients, their standarderrors, the z-statistic, associated p-values, and the 95% confidence intervalof the coefficients.  Both gre and gpa arestatistically significant, as are the three indicator variables for rank.The logistic regression coefficients give the change in the log odds of theoutcome for a one unit increase in the predictor variable.

·         

·        For every one unit change in gre, the logodds of admission (versus non-admission) increases by 0.002.

·        For a one unit increase in gpa, the log oddsof being admitted to graduate school increases by 0.804.

·        The indicator variables for rank have aslightly different interpretation. For example, having attended anundergraduate institution with rank of 2, versus aninstitution with a rank of 1, decreases the log odds ofadmission by 0.675.

 

if you use the or option,illustrated below.  You could also use the logistic command.

Now we can say that for a one unit increasein gpa, the odds of being admitted to graduate school (versus notbeing admitted) increase by a factor of 2.23.

 

Source:https://stats.idre.ucla.edu/stata/dae/logistic-regression/

 

Multinomial logistic regression


logit model 的擴展,適用於Y為多分類變量的情況

 

Multinomial logistic regression is used to model nominal outcomevariables, in which the log odds of the outcomes are modeled as a linearcombination of the predictor variables.

 

A exampleEntering highschool students make program choices among general program, vocational programand academic program. Their choice might be modeled using their writing scoreand their social economic status.

 

Multinomial logistic regression

 

Below we use the mlogit command to estimate amultinomial logistic regression model. The i. before ses indicates that ses isa indicator variable (i.e., categorical variable), and that it should beincluded in the model. We have also used the option 「base」 to indicate thecategory we would want to use for the baseline comparison group. In the modelbelow, we have chosen to use the academic program type as the baselinecategory.

 

·        In the output above, we first see the iteration log,indicating how quickly the model converged. The log likelihood (-179.98173) canbe used in comparisons of nested models, but we won’t show an example ofcomparing models here.

·        The likelihood ratio chi-square of 48.23 with a p-value< 0.0001 tells us that our model as a whole fits significantly better thanan empty model (i.e., a model with no predictors)

·        The output above has two parts, labeled with thecategories of the outcome variable prog. They correspond to the twoequations below:

where b’s are the regressioncoefficients.

·        A one-unit increase in the variable write isassociated with a .058 decrease in the relative log odds of being in generalprogram vs. academic program.

·        A one-unit increase in the variable write isassociated with a .1136 decrease in the relative log odds of being in vocationprogram vs. academic program.

·        The relative log odds of being in general program vs. inacademic program will decrease by 1.163 if moving from the lowest levelof ses (ses==1) to the highest level of ses (ses==3).

The ratio of the probability of choosing oneoutcome category over the probability of choosing the baseline category isoften referred to as relative risk (and it is also sometimes referred to asodds as we have just used to described the regression parameters above). Relative risk can be obtained by exponentiating the linear equations above,yielding regression coefficients that are relative risk ratios for a unitchange in the predictor variable.  We can use the rrr optionfor mlogit command to display the regression results in termsof relative risk ratios.

·        The relative risk ratio for a one-unit increase in thevariable write is .9437 (exp(-.0579284) from the output of thefirst mlogit command above) for being in general program vs.academic program.

·        The relative risk ratio switching from ses =1 to 3 is .3126 for being in general program vs. academic program. In otherwords, the expected risk of staying in the general program is lower forsubjects who are high in ses.

Source:https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/

 

Ordered logistic regression


適用於Y為定序變量的情況

A example for Ordered logisticregression

A study looks at factors that influence thedecision of whether to apply to graduate school.  College juniors are asked if they are unlikely, somewhat likely, or very likelyto apply to graduate school. Hence, our outcome variable has threecategories.  Data on parental educationalstatus, whether the undergraduate institution is public or private, and currentGPA is also collected.   The researchershave reason to believe that the 「distances」 between these three points are notequal.  For example, the 「distance」between 「unlikely」 and 「somewhat likely」 may be shorter than the distancebetween 「somewhat likely」 and 「very likely」.

 

Description of the data

This hypothetical data set has a thee levelvariable called apply (coded 0, 1, 2), that we will use as our outcomevariable.  We also have three variablesthat we will use as predictors:  pared,which is a 0/1 variable indicating whether at least one parent has a graduatedegree; public, which is a 0/1 variable where 1 indicates that theundergraduate institution is public and 0 private, and gpa, which is thestudent’s grade point average.

 

Ordered logistic regression

Below we use the ologit command toestimate an ordered logistic regression model. The i. before paredindicates that pared is a factor variable (i.e., categorical variable),and that it should be included in the model as a series of indicator variables.The same goes for i.public.

 

In the output above, we first see the iterationlog.  At iteration 0, Stata fits a null model, i.e. the intercept-onlymodel. It then moves on to fit the full model and stops the iteration processonce the difference in log likelihood between successive iterations becomesufficiently small. The final log likelihood (-358.51244) is displayed again.It can be used in comparisons of nested models. Also at the top of the outputwe see that all 400 observations in our data set were used in the analysis. Thelikelihood ratio chi-square of 24.18 with a p-value of 0.0000 tells us that ourmodel as a whole is statistically significant, as compared to the null modelwith no predictors.  The pseudo-R-squared of 0.0326 is also given.

 

In the table we see the coefficients, theirstandard errors, z-tests and their associated p-values, and the 95% confidenceinterval of the coefficients. Both pared and gpa arestatistically significant; public is not.  So for pared, we would say that for aone unit increase in pared (i.e., going from 0 to 1), we expect a 1.05increase in the log odds of being in a higher level of apply, given allof the other variables in the model are held constant.  For a one unit increase in gpa, wewould expect a 0.62 increase in the log odds of being in a higher level of apply,given that all of the other variables in the model are held constant.  The cutpoints shown at the bottom of theoutput indicate where the latent variable is cut to make the three groups thatwe observe in our data.  Note that thislatent variable is continuous.  Ingeneral, these are not used in the interpretation of the results.  The cutpoints are closely related tothresholds, which are reported by other statistical packages.

 

We can obtain odds ratios using the or option after the ologit command.

In the output above the results are displayed asproportional odds ratios. We would interpret these pretty much as we would oddsratios from a binary logistic regression.  For pared, we wouldsay that for a one unit increase in pared, i.e., going from 0 to 1, the odds ofhigh apply versus the combined middle and low categories are 2.85 greater,given that all of the other variables in the model are held constant. Likewise, the odds of the combined middle and high categories versus low applyis 2.85 times greater, given that all of the other variables in the model areheld constant.  For a one unit increase in gpa, the odds ofthe high category of apply versus the low and middle categoriesof apply are 1.85 times greater, given that the othervariables in the model are held constant.  Because of the proportionalodds assumption (see below for more explanation), the same increase, 1.85times, is found between low apply and the combined categoriesof middle and high apply.

 

One of the assumptions underlying orderedlogistic (and ordered probit) regression is that the relationship between eachpair of outcome groups is the same.  In other words, ordered logisticregression assumes that the coefficients that describe the relationshipbetween, say, the lowest versus all higher categories of the response variableare the same as those that describe the relationship between the next lowestcategory and all higher categories, etc.  This is called the proportionalodds assumption or the parallel regression assumption.  Because therelationship between all pairs of groups is the same, there is only one set ofcoefficients (only one model). 

 

Source:https://stats.idre.ucla.edu/stata/dae/ordered-logistic-regression/

 

參考文獻

1、Sun,社會研究方法課,2016

2、Lawrence C.Hamilton,《應用stata做統計分析》,清華大出版社,2017

3、https://stats.idre.ucla.edu/  (UCLA:Institute for Digital Research andEducation)

 

 

y/2018.9.11

 

相關焦點

  • Stata:斷點回歸分析教程
    Stata 不僅在統計方面功能齊全,其在計量分析領域更是有著深刻影響,以至於有人一言以蔽之:「 關於學習 Stata 的意義,大家只需知道:目前,Stata 是計量經濟學,特別是微觀計量經濟學的主流軟體。」為了幫助小夥伴們快速掌握stata,我們特別推出2021年寒假stata學術提升計劃!
  • Stata+R:門檻回歸教程
    進行回歸分析,一般需要研究係數的估計值是否穩定。很多經濟變量都存在結構突變問題,使用普通回歸的做法就是確定結構突變點,進行分段回歸。這就像我們高中學習的分段函數。但是對於大樣本、面板數據如何尋找結構突變點。所以本文在此講解面板門限回歸的問題,門限回歸也適用於時間序列。
  • 零基礎的同學如何用stata做一元線性回歸模型?
    stata軟體越來越受研究生的喜歡,很多研究生在做統計研究、學術分析的時候,也多選用此軟體。網上有關stata的教程有很多,但對於沒有基礎的同學來說,學起來稍微就有些吃力了。那麼,零基礎的同學應該如何學習呢?如何用stata做出滿意的一元線性回歸模型呢 ?
  • 零基礎的同學如何用stata做多元線性回歸模型?
    上一期,我們分享了如何用stata做一元線性回歸模型,不知道同學們學的怎麼樣呢?有沒有自己動手操作一遍呢?這一期:我們將學習如何用stata做多元線性回歸模型!這些是小王(邀請者)最近學習計量時的一些心得和體會,希望能與大家一起分享。
  • 如何用Matlab/Python/Stata做簡單回歸分析
    (4)求出常見的回歸描述量:例如中心化R方、調整後R方 NO.2 |數據來源: 數據採用的是Pieters & Bijmolt(1997)的關於Consumer Memory for Television Advertising調查結果。
  • 門限回歸Stata操作匯總與空間門檻回歸模型簡介
    進行回歸分析,一般需要研究係數的估計值是否穩定。很多經濟變量都存在結構突變問題,使用普通回歸的做法就是確定結構突變點,進行分段回歸。
  • 人生苦短,我學stata
    1、零基礎起步,初級+高級課程,涵蓋數據管理+橫截面+時間序列+面板數據+門限回歸+傾向匹配得分+合成控制法+斷點回歸+雙重差分+空間計量等專題,幫助你建立系統的計量體系2、理論與軟體並重,課程涵蓋理論介紹、軟體操作、案例應用、解釋結果、講解答疑等,手把手教你學Stata軟體操作與分析3、在原有4天班精彩內容基礎上,這次5天高級班又增加了很多乾貨內容
  • Stata語言中的常用函數及其用法解釋, 在附上42篇Stata相關學習資料
    滿滿乾貨拿走不謝,2.Stata資料全分享,快點收藏學習,3.Stata統計功能、數據作圖、學習資源,4.Stata學習的書籍和材料大放送, 以火力全開的勢頭,5.史上最全Stata繪圖技巧, 女生的最愛,6.把Stata結果輸出到word, excel的乾貨方案,7.程式語言中的函數什麼鬼?
  • 面板門檻回歸學習手冊
    很多經濟變量都存在結構突變問題,使用普通回歸的做法就是確定結構突變點,進行分段回歸。這就像我們高中學習的分段函數。但是對於大樣本、面板數據如何尋找結構突變點。所以本文在此講解面板門限回歸的問題,門限回歸也適用於時間序列(文章後面將介紹stata15.0新命令進行時間序列的門限回歸)。
  • 零基礎學Stata 數據分析再不怕
    在學術圈如果是做應用計量(特別是橫截面數據、面板數據),Stata是不二之選,因為不管是管理數據還是跑回歸,實在太太太方便了。現在主流期刊的應用微觀計量文章裡面能用到的模型stata幾乎都有,而且其中的絕大多數都是用stata做的。而且最大的優點是,簡單!
  • 回歸控制法的Stata命令rcm:完整演講視頻已發布於B站
    Stata應用」,正式發布陳強教授團隊的最新研究成果之一:回歸控制法(regression control method)的Stata命令rcm,引起與會觀眾的極大興趣。據悉,陳強教授團隊開發的rcm命令是首個在Stata中實現回歸控制法的命令。
  • 零基礎的同學如何用Stata做logistic回歸?
    前面兩期我們已經分享了一元線性回歸模型、多元線性回歸模型的操作方法,今天我們將分享logistic回歸的學習心得,希望大家都能有所收穫哦。logistic回歸:指的是一種廣義的線性回歸,在一定程度上和多重線性回歸分析有著相似的地方。例如:模型的形式基本上相同,都有待求參數。兩者的關係是:logistic回歸分類模型的預測函數是通過線性回歸模型的預測值的結果進一步接近真實標記的對數機率!從而能夠使線性回歸的預測值和分類任務的真實標記兩者關聯在一起!
  • 資源 | Stata 學習資料
    See what’s new in Stata 10[58] (items 4, 5, and 7 in particular) and in Stata 11[59] (the fourth bullet point in particular).
  • 一文收藏stata14&15小抄:常用命令匯總
    cheat sheets for key Stata Command1  常用數據處理命令 Basic stata command 本節主要對stata一些基礎命令例如導入導出和描述性分析以及創建新變量進行學習。
  • 量化 | 用python輸出stata一樣的標準化回歸結果
    stata寫論文,會了解stata有個outreg2的函數,可以把回歸的結果輸出成非常規範的論文格式,並且可以把多個回歸結果並在一起,方便對比。其實也不用自己手動寫,statsmodels模塊裡有一個summary_col函數,可以實現以上的功能,不過效果沒有stata那麼好,畢竟python也不是專業的計量分析軟體,但好在代碼並不難,所以如果有一些個性化的需求,自己改一改也挺容易的。先解釋代碼,再上例子。首先看看summary_col的說明:一共有七個參數,一一解釋(源碼也不難,有興趣可以自己看看)。
  • Stata15.0中文版正式發布!又要騙我去學習了
    來源:stata官網,https://www.stata.com/new-in-stata/chinese-interface
  • 【學習記·第38期】stata常見問題及解決辦法
    gen age=age^24. stata裡邊怎麼取對數啊?gen lnx=log(x)5.如何用STATA求自然對數?如說:ln(X^2)=-4.8536,如何求X啊?8.求助達人,能否系統介紹stata作虛擬變量的過程與方法,謝可以用tabulate命令,假如有31個省的變量provincetabulate province,gen(dumy)就可以產生dumy1-dumy31變量,reg y x1 x2 dumy2-dumy31或者不產生,在回歸的時候用xi命令xi
  • Stata:機器學習分類器大全
    命令介紹和安裝3.1 基本介紹3.2 安裝方法3.3 語法及選項3.4 注意事項4. Stata 實操4.1 數據結構描述4.2 模型訓練和結果4.3 結果匯總5. 總結6.本推文的餘下部分安排如下:在第二部分,對該命令的部分機器學習分類算法進行簡單的理論介紹;在第三部分,對該命令的簡要介紹和安裝方法進行說明;在第四部分,利用該命令及其提供的數據集使用 Stata 進行分類問題的處理;在第五部分,對本推文主要內容進行總結。 2.
  • Stata:滾動回歸的五個命令-rolling
    滾動窗口回歸2. rollreg 命令3. rolling 命令4. rolling2 命令5. asreg 命令6. 參考文獻7.可以看出,1975q3 到 1982q4 是 30 個時間單位,每次回歸窗口加 1,結束時點不變。
  • 我就是要R輸出Stata的outreg2和estout一樣的回歸表格!
    而且轉眼人生差不多過半,天天還在學習寫代碼,而不是思考問題,看著同行們經常發思想深刻的文章,自己將寫代碼當做玩遊戲,有點玩物喪志之感,很沒有出息。自己有一個毛病,越是任務多的時候越想幹會活休息一下,所謂休息放鬆就是寫代碼。一個問題接著一個問題,直到全部搞通,大半天就過去了,正事沒做多少。