Lasso:拉索中如何做統計推斷

2021-02-14 Stata連享會

🍎 連享會主頁：lianxh.cn

New！ lianxh 命令發布了： GIF 動圖介紹
隨時搜索 Stata 推文、教程、手冊、論壇，安裝命令如下：
. ssc install lianxh

🍓 Stata 高級班

⌚ 2021 年 1 月 29-31 日

🌲 主講：連玉君 (中山大學)

👉 課程主頁：https://gitee.com/arlionn/PX

🍏 Stata論文班

⌚ 2021 年 2 月 2-5 日

🌲 主講：江艇 (中國人民大學)

👉 課程主頁：https://gitee.com/arlionn/PX

作者：楊繼超 (中山大學)
E-Mail: yangjch7@mail2.sysu.edu.cn

Source： Stata Blogs: Using the lasso for inference in high-dimensional models

🍓 Stata 高級班

🍏 Stata論文班

1. 引言

2. Stata 命令及操作

2.1 PO 法

2.2 DS 法

2.3 XPO 法

3. 結論

4. 參考文獻

5. 相關推文

溫馨提示： 文中連結在微信中無法生效。請點擊底部「閱讀原文」。

1. 引言

高維數據模型在應用研究當中越來越普遍。上一篇推文 Stata:拉索開心讀懂-Lasso 入門中討論的 Lasso 方法便可用於高維數據模型中估計變量的係數。但是，上一篇推文始終沒有解決一個問題就是統計推斷。這其實也是 Lasso 方法面臨的大難題，本篇推文則是延續上一篇推文，討論 Stata16 中關於 Lasso 方法在高維數據模型中做統計推斷的相關命令。

為了討論這一問題，本文採用的例子是來自於 Sunyer et al. (2017) 這篇文章。具體而言，這篇文章是想估計空氣汙染對小學學生反應時間的影響，所採用的的模型為：

變量的具體含義如下表所示：

變量含義react測度學生 no2_class測度學生
可能需要包含在模型中的控制變量

我們想要估計 no2_class 對於 htime 的影響係數及相應的置信區間。問題是向量

實際上，除了我們關心的變量 no2_class 以外，其餘變量的統計推斷問題甚至係數估計都是次要的。只要我們能找到控制變量

但問題是我們並不知道

對於變量篩選，可參考上篇推文 Stata:拉索開心讀懂-Lasso 入門中提到的一種方法 postselection 法。具體步驟分兩步，第一，採用 Lasso 方法進行估計，篩選出核心的控制變量；第二，用被解釋變量對自變量和篩選出來的控制變量進行回歸。

可是，直接採用 postselection 法是無法進行可靠的統計推斷的。Leeb and Pötscher (2008) 指出，postselection 法產生的估計量是不具備大樣本情形下正態分布性質的，並且在有限樣本的情形下採用一般的大樣本理論也是無法進行可靠的統計推斷的。

鑑於以上分析的問題，Belloni et al. (2012)、 Belloni, Chernozhukov, and Hansen (2014)、 Belloni, Chernozhukov, and Wei (2016) 以及 Chernozhukov et al. (2018) 推導了三種能夠提供對於

模型PO 法命令DS 法命令XPO 法命令linearporegressdsregressxporegresslogitpologitdslogitxpologitPoissonpopoissondspoissonxpopoissonlinear IVpoivregress/xpoivregress

關於命令的實操參見下一小節的分析。

2. Stata 命令及操作

本小節將延續上一節所討論的空氣汙染對小學生反應時間的例子，截取 Sunyer et al. (2017) 所用數據的部分變量，採用 PO、DS 和 XPO 三種方法進行估計和統計推斷，並給出方法背後的一些直覺。

2.1 PO 法

Step 1: 調用數據並將連續變量和因子變量及其交互項存入暫元

webuse breathe, clear //調用數據

***********將連續控制變量存入暫元`ccontrols'中********************

local ccontrols "sev_home sev_school age precip age0 siblings_old "

local ccontrols "`ccontrols' siblings_young no2_home green_home noise_school "

***********將因子控制變量存入暫元`fcontrols'中********************

local fcontrols "grade sex lbweight breastfeed msmoke "

local fcontrols "`fcontrols' feducation meducation overweight "

********將連續型、因子型控制變量及其交乘項存入暫元`ctrls'中*********

local ctrls "i.(`fcontrols') c.(`ccontrols') "

local ctrls "`ctrls' i.(`fcontrols')#c.(`ccontrols') "

接著對變量進行一下簡單的描述性分析，代碼為

. describe react no2_class `fcontrols' `ccontrols'

variable name type format label variable label
---
react double %10.0g * Reaction time (ms)
no2_class float %9.0g Classroom NO2 levels
(ug/m3)
grade byte %9.0g grade Grade in school
sex byte %9.0g sex Sex
lbweight byte %18.0g lowbw * Low birthweight
breastfeed byte %19.0f bfeed Duration of
breastfeeding
msmoke byte %10.0f smoke * Mother smoked during
pregnancy
feducation byte %17.0g edu Father's education
level
meducation byte %17.0g edu Mother's education
level
overweight byte %32.0g overwt * Overweight by WHO/CDC
definition
sev_home float %9.0g Home socio-economic
vulnerability index
sev_school float %9.0g School socio-economic
vulnerability index
age float %9.0g Age (years)
precip double %10.0g Daily total
precipitation
age0 double %4.1f Age started school
siblings_old byte %1.0f Number of older
siblings in house
siblings_young byte %1.0f Number of younger
siblings in house
no2_home float %9.0g Home NO2 levels
(ug/m3)
green_home double %10.0g Home greenness (NDVI),
300m buffer
noise_school float %9.0g School noise levels
(dB)

Step 2: 進行 PO 回歸

採用 poregress 命令進行回歸，controls() 設定潛在的控制變量，本例中加入了 Step1 當中設定的連續型、因子型變量及其交乘項作為控制變量。

poregress react no2_class, controls(`ctrls') //PO 法

結果如下：

Partialing-out linear model Number of obs = 1,084
Number of controls = 0
Number of selected controls = 0
Wald chi2(1) = 11.39
Prob > chi2 = 0.0007

---
| Robust
react | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
no2_class | 1.472821 .4363526 3.38 0.001 .617586 2.328057
---
Note: Chi-squared test is a Wald test of the coefficients of the variables
of interest jointly equal to zero. Lassos select controls for model
estimation. Type lassoinfo to see number of selected variables in each
lasso.

將結果存儲為 poplug

estimates store poplug

以上結果表明，每立方米再增加一微克 NO2 ，平均反應時間增加 1.47 毫秒。可以注意到，結果中只有核心解釋變量 no2_class 的估計結果，其餘控制變量並沒有結果，這也是採用此類模型的特點。

PO 法採用的是多次 Lasso 方法來篩選控制變量。為說明 PO 方法的機理，我們考慮一個簡單的線性模型：

其中

1.採用 Lasso 方法，用

關於以上 5 步若想進一步了解，可參考 Chernozhukov, Hansen, and Spindler (2015a, b) 更為詳細的分析及推導。

注： poregress 命令默認採用的是基於 plugin 方法的 Lasso 回歸，也可以通過在選項中添加 selection() 來使用其他檢驗方法，這部分也可以可參考上篇推文 Stata:拉索開心讀懂-Lasso 入門。

Step 3: 進一步分析

我們可以採用 lassocoef 命令看下每個 Lasso 回歸哪些變量被選擇。

lassocoef ( ., for(react)) ( ., for(no2_class))//變量篩選結果對比

結果為：

---
| react no2_class
-+-
age | x
|
grade#c.green_home |
4th | x
|
grade#c.noise_school |
2nd | x
|
sex#c.age |
0 | x
|
feducation#c.age |
4 | x
|
sev_school | x
precip | x
no2_home | x
green_home | x
noise_school | x
|
grade#c.sev_school |
2nd | x
|
_cons | x x
---
Legend:
b - base level
e - empty cell
o - omitted
x - estimated

從上面的結果可以看出，對於 react 的回歸中， age 和四個交乘項被選擇。對於對於 no2_class 的回歸中， sev_school、precip、no2_home、green_home、noise_school 和一個交乘項被選擇。可見，兩次 Lasso 回歸中除了交乘項中的部分變量是相同的，其餘被選擇的控制變量差別是很大的。

當然，以上變量篩選的結果也可以用 lassoknots 命令，具體代碼為：

lassoknots , for(react)//變量 react 的變量篩選
lassoknots , for(no2_class)//變量 no2_class 的變量篩選

結果為：

2.2 DS 法

DS 法是 PO 法的拓展。簡單來講， DS 法是通過引入額外的控制變量使得估計結果更穩健。和 PO 法的語法結構類似，採用命令 dsregress 即可實現，最後將估計結構存儲為 dsplug。

. dsregress react no2_class, controls(`ctrls') //DS 法

. estimates store dsplug

Double-selection linear model Number of obs = 1,084
Number of controls = 0
Number of selected controls = 0
Wald chi2(1) = 11.39
Prob > chi2 = 0.0007

--
| Robust
react | Coef. Std. Err. z P>|z| [95% Conf. Interval]
+-
no2_class | 1.472821 .4363526 3.38 0.001 .617586 2.328057
--
Note: Chi-squared test is a Wald test of the coefficients of the variables
of interest jointly equal to zero. Lassos select controls for model
estimation. Type lassoinfo to see number of selected variables in each
lasso.

DS 法估計的結果和 PO 法相似，結果的解讀也是相似的，不再此處贅述。

對於 DS 法的機理，一共有 4 步，具體為：

採用 Lasso 方法，用

Belloni, Chernozhukov, and Wei (2016) 指出，儘管在大樣本情況下 DS 法和 PO 法擁有同樣的性質，但在他們的模擬中，前者表現更好。在有限樣本情形下，DS 法表現更好可能是因為這種方法引入的控制變量是在一次回歸當中而非兩次單獨的回歸。

同理， DS 法也可以採用 lassocoef 和 lassoknots 命令進一步分析，因這部分分析和 PO 方法類似，故不再贅述。

2.3 XPO 法

XPO 法也被稱為雙重機器學習法 (DML)， Chernozhukov et al. (2018) 經過推導發現，這種方法相比於 PO 法而言在理論上具有更好的性質，並且在有限樣本情形下也表現更好。其中，二者最為重要的差別是 XPO 法要求的稀疏條件更弱。在實際操作中，這意味著 XPO 法可以提供更可靠的統計推斷，因為這種方法採用了樣本拆分技術（split-sample techniques），從而可以引入更多的控制變量。

具體實操和 PO 法也是類似的。採用命令 xporegress 即可實現，最後將估計結果存儲為 xpoplug。

xporegress react no2_class, controls(`ctrls') //XPO 法

estimates store xpoplug

結果為：

Cross-fit partialing-out Number of obs = 1,084
linear model Number of controls = 0
Number of selected controls = 0
Number of folds in cross-fit = 10
Number of resamples = 1
Wald chi2(1) = 11.39
Prob > chi2 = 0.0007

-
| Robust
react | Coef. Std. Err. z P>|z| [95% Conf. Interval]
+
no2_class | 1.472821 .4363526 3.38 0.001 .617586 2.328057
-
Note: Chi-squared test is a Wald test of the coefficients of the variables
of interest jointly equal to zero. Lassos select controls for model
estimation. Type lassoinfo to see number of selected variables in each
lasso.

估計和結果及相應的解釋與 PO 法也是類似的，故也不再贅述。

關於 XPO 法的機理，首先是將樣本進行拆分，拆分的子樣份數叫做折（folds），以 2 折為例，一共包含 6 步：

採用 Lasso 方法，用

以上是將樣本拆分為 2 折的算法，10 折甚至 k 折的算法其實都是類似的。

3. 結論

綜上，對比三種方法，最為推薦使用的是 XPO 法，因為這種方法在大樣本和有限樣本情形下都具有更好的性質，其缺點是計算會比較耗時。

4. 參考文獻

溫馨提示： 文中連結在微信中無法生效。請點擊底部「閱讀原文」。

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen. 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80: 2369–2429.-PDF-Belloni, A., V. Chernozhukov, and C. Hansen. 2014. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies 81: 608–650.-PDF-Belloni, A., V. Chernozhukov, and Y. Wei. 2016. Post-selection inference for generalized linear models with many controls. Journal of Business & Economic Statistics 34: 606–619.-PDF-Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. 2018. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21: C1–C68.-PDF-Chernozhukov, V., C. Hansen, and M. Spindler. 2015a. Post-selection and post-regularization inference in linear models with many controls and instruments. American Economic Review 105: 486–90.-PDF-——. 2015b. Valid post-selection and post-regularization inference: An elementary, general approach. Annual Review of Economics 7: 649–688.-PDF-Leeb, H., and B. M. Pötscher. 2008. Sparse estimators and the oracle property, or the return of Hodges estimator. Journal of Econometrics 142: 201–211.-PDF-Sunyer, J., et al. 2017. "Traffic-related Air Pollution and Attention in Primary School Children." 28(2): 181-189.-PDF-Wooldridge, J. M. 2020. Introductory Econometrics: A Modern Approach. 7th ed. Boston, MA: Cengage-Learning.楊繼超，連享會推文 Stata:拉索開心讀懂-Lasso 入門

5. 相關推文

Note：產生如下推文列表的命令為：
lianxh lasso 高維隨機推斷
安裝最新版 lianxh 命令：
ssc install lianxh, replace

溫馨提示： 文中連結在微信中無法生效。請點擊底部「閱讀原文」。

Stata新命令-pdslasso：眾多控制變量和工具變量如何挑選？Stata：拉索回歸和嶺回歸-(Ridge,-Lasso)-簡介Stata Blogs - An introduction to the lasso in Stata (拉索回歸簡介)Stata：ritest-隨機推斷（Randomization Inference）

👉 課程主頁：https://gitee.com/arlionn/PX

關於我們🍎 連享會 ( 主頁：lianxh.cn ) 由中山大學連玉君老師團隊創辦，定期分享實證分析經驗。👉 直達連享會：【百度一下：連享會】即可直達連享會主頁。亦可進一步添加主頁，知乎，面板數據，研究設計等關鍵詞細化搜索。New！ lianxh 命令發布了： 在 Stata 命令窗口中輸入 ssc install lianxh 即可安裝，隨時搜索連享會推文、Stata 資源，詳情：help lianxh。連享會主頁 lianxh.cn

New！ lianxh 命令發布了： GIF 動圖介紹
隨時搜索連享會推文、Stata 資源，安裝命令如下：
. ssc install lianxh
使用詳情參見幫助文件 (有驚喜)：
. help lianxh

Lasso:拉索中如何做統計推斷

相關焦點

拉索回歸:拉索開心讀懂-Lasso入門

Stata: 拉索回歸和嶺回歸 (Ridge, Lasso) 簡介

Stata:拉索開心讀懂-Lasso入門

前沿: Lasso, 嶺回歸, 彈性網估計在軟體中的實現流程和示例解讀

市場營銷研究中的統計推斷與方差分析(1)

《中華醫學統計百科全書. 單變量推斷統計分冊》

技術分享|從理論淺談變量選擇中lasso的稀疏性

機器學習算法之LASSO算法

斜拉橋拉索的振動及其減振措施

推斷統計基礎:點估計與區間估計

數學推導+純Python實現機器學習算法:Lasso回歸

線性回歸的正則化 ——嶺回歸與LASSO回歸

手把手帶你畫高大上的lasso回歸模型圖

SAS系列21:SAS統計推斷(六)

谷歌用β-LASSO武裝MLP,縮小與卷積網絡的差距

減振升級乘風破浪——蘇通大橋拉索減振措施的研究、更換與啟示

從零開始學Python【25】--嶺回歸及LASSO回歸(實戰部分)

2016考研專業課:心理學之推斷統計邏輯導圖

使用python+sklearn實現多任務Lasso的聯合特徵選擇

使用python+sklearn實現Lasso 模型選擇:交叉驗證/ AIC / BIC

Lasso:拉索中如何做統計推斷

相關焦點

拉索回歸:拉索開心讀懂-Lasso入門

Stata: 拉索回歸和嶺回歸 (Ridge, Lasso) 簡介

Stata:拉索開心讀懂-Lasso入門

前沿: Lasso, 嶺回歸, 彈性網估計在軟體中的實現流程和示例解讀

市場營銷研究中的統計推斷與方差分析(1)

《中華醫學統計百科全書. 單變量推斷統計分冊》

技術分享|從理論淺談變量選擇中lasso的稀疏性

機器學習算法之LASSO算法

斜拉橋拉索的振動及其減振措施

推斷統計基礎:點估計與區間估計

數學推導+純Python實現機器學習算法:Lasso回歸

線性回歸的正則化 ——嶺回歸與LASSO回歸

手把手帶你畫高大上的lasso回歸模型圖

SAS系列21:SAS統計推斷(六)

谷歌用β-LASSO武裝MLP,縮小與卷積網絡的差距

減振升級 乘風破浪——蘇通大橋拉索減振措施的研究、更換與啟示

從零開始學Python【25】--嶺回歸及LASSO回歸(實戰部分)

2016考研專業課:心理學之推斷統計邏輯導圖

使用python+sklearn實現多任務Lasso的聯合特徵選擇

使用python+sklearn實現Lasso 模型選擇:交叉驗證/ AIC / BIC

減振升級乘風破浪——蘇通大橋拉索減振措施的研究、更換與啟示