Stata:終極匹配 ultimatch

2021-01-18 Stata連享會

連享會學習群-常見問題解答匯總：
👉 WD 主頁：https://gitee.com/arlionn/WD

🍎 連享會主頁：lianxh.cn

連享會 · 名師講壇

🍓 空間計量專題
⌚ 2020.12.10-13

🌲 主講：楊海生 (中山大學)；範巧 (蘭州大學)

👉 課程主頁：https://gitee.com/arlionn/SP

連享會 · 計量專題

🍓 Stata 數據清洗實戰系列（第二季）
⌚ 2020.11.28，19:00-21:00，88 元

👉 課程主頁：https://gitee.com/arlionn/dataclean

作者： 黃俊凱 (中國人民大學)
郵箱： kopanswer@126.com

前期相關推文

1. 匹配簡介

2 ultimatch 命令

3. 實例演示

溫馨提示： 文中連結在微信中無法生效。請點擊底部「閱讀原文」。

前期相關推文Stata：數據合併與匹配-merge-reclinkStata：psestimate-傾向得分匹配(PSM)中協變量的篩選Stata：廣義精確匹配-Coarsened-Exact-Matching-(CEM)伍德裡奇先生的問題：PSM-分析中的配對——小蝌蚪找媽媽Stata：psestimate-傾向得分匹配(PSM)中匹配變量的篩選Stata-從匹配到回歸：精確匹配、模糊匹配和PSM

Note: 上述檢索結果由 lianxh.ado 命令自動生成，命令為 lianxh 匹配 PSM matching, m 。
lianxh.ado 命令可以使用如下命令安裝：

. net install lianxh.pkg, from(https://arlionn.gitee.io/lianxh) replace
詳情參見 lianxh-項目主頁：https://gitee.com/arlionn/lianxh 。
 
1. 匹配簡介1.1 單變量匹配單變量匹配的方法有精確匹配、k-近鄰匹配和半徑 (卡尺) 匹配，具體介紹如下：
精確匹配 (exact match)：將匹配變量相等的對照組觀測值作為反事實觀測值；
k-近鄰匹配 (k-nearest neighbor match)：挑選距離最近的 k 個對照組觀測值作為反事實觀測值；
半徑 (卡尺) 匹配 (radius caliper match)：指定半徑範圍內的所有對照組觀測值作為反事實觀測值。
1.2 多變量匹配多變量匹配的核心是「降維」，即將多變量降維為距離或得分，然後再進行單變量匹配，具體匹配方法如下：
粗化精確匹配 (coarsened exact match)：如公司金融中的同年度同行業公司、以及教育經濟學中考上同一所大學的同學；
百分等級匹配 (percentile rank match)：這是一種非參方法，對任意處理組觀測值 
馬氏距離匹配 (mahalanobis distance match)：在匹配時，歐氏距離存在兩個缺點：不同維度的量綱之間存在差異，不同維度之間有相關性。因此，坐標軸不是正交的。馬氏距離通過標準化和旋轉矩陣 (合起來就是協方差矩陣的逆矩陣)，將多元隨機變量的分布轉換為一個各變量量綱相同，且各變量之間不相關的多元隨機分布。如果將樣本協方差矩陣替換為單位矩陣，則馬氏距離退化為歐氏距離；
傾向得分匹配 (propensity score match)：傾向得分匹配比馬氏距離更進一步，它體現了選擇機制，並且即使距離較遠的觀測值也可能有相同的得分，也因此成為流行的匹配方法。
Note： 關於上述匹配方法更多理論介紹，請參考「Stata 手動：各類匹配方法大全 A——理論篇」 (微信版)。
溫馨提示： 文中連結在微信中無法打開，請點擊「閱讀原文」。
 
2  ultimatch 命令2.1 基本語法ultimatch [varlist] [if] [in], treated(var) [exact(varlist)] [draw(#)] [caliper(#)] [support] [single] [greedy] [between] [rank] [copy] [report(varlist)]
      [unit(varlist)] [unmatched] [exp(string)] [limit(string)]
其中，varlist 指定用於匹配的變量，並且可以是單變量也可以多變量，var 為指定處理組和對照組的啞變量。
ultimatch 命令默認使用有放回抽樣的馬氏距離匹配。同時，每一次成功運行 ultimatch 命令，都會默認產生三個輔助變量 _match、_distance 和 _weight：
其值是一系列從 _match 具有相同值的觀測值被匹配到一起。_distance 是對照組與匹配成功的處理組觀測值的距離：
對於匹配成功的對照組觀測值，其值等於到匹配成功的處理組觀測值的距離；
對於匹配成功的處理組觀測值始終有 _weight 等於 _weight 之和 (總權重) 也等於 
若一個對照組觀測值匹配到多個處理組觀測值，則 _weight 是對應的多個權重之和。若還指定了選項 copy，則對應不同的處理組觀測值都會生成一個副本，每個副本的 _weight 等於該對照組觀測值與相應處理組觀測值的反事實觀測值中所佔的權重。
2.2 進階選項單變量匹配方法
卡尺匹配/半徑匹配：caliper(real)
卡尺匹配/半徑匹配（不放回抽樣）：caliper(real) greedy
選項 greedy 保留與對照組觀測值 k-近鄰匹配：draw(integer)
k-近鄰匹配（向前後各自獨立搜尋 k 個反事實觀測值）：draw(integre) between
此外，ultimatch 命令默認所有具有相同得分或距離的觀測值為同一個抽樣，這有助於減輕近鄰的 「負擔」 。用戶可以使用 single 選項從具有相同得分或距離的觀測值中隨機選擇一個作為反事實觀測值。
多變量匹配方法
粗化精確匹配：exact(varlist)
歐氏距離的百分等級匹配：rank
馬氏距離匹配：默認的匹配方法
報告變量差異
報告匹配後變量的差異：report(varlist)
報告匹配前後各變量的差異：report(varlist) unmatched
生成輔助變量
觀測值是否在共同支撐集內：support
選項 support 會生成一個輔助變量 _support，若觀測值位於共同支撐集內，則 _support 等於 為對照組觀測值匹配到的每一個處理組觀測值生成副本：copy
一個反事實觀測值 _weight 是各權重之和。如果想為其匹配的每個處理組觀測值生成副本，且每個副本對應不同的 _weight，則需要指定選項 copy。其他選項
指定面板數據中的個體：unit(varlist)
當指定選項 unit(varlist) 時，選項 report(varlist) 中報告的差異也會做相應的聚類標準誤調整。自定義匹配條件：exp(string)
用戶也可以利用選項 exp(string) 自定義匹配條件。如果表達式等於 額外的約束：limit(string)
選項 limit(string) 不同於百分等級匹配。其參數為一組「變量—數字」對，並且其中數字必須介於 age 的單變量百分等級差異不超過 height 的百分等級差異不超過 weight 的百分等級差異不超過  
3. 實例演示3.1 數據具體生成過程
  clear
  tempfile tmp
  set obs 2000
  set seed 2000

*-處理前數據
  gen byte period = 0 //pre-treatment
  label var period "是否處理後"
  gen long id = _n
  label var id "個體id"
  gen byte gender = uniform() > 0.5
  label var gender "性別"
  gen age = uniform()
  label var age "年齡"
  gen fitness = normal(gender*0.25 - age + invnorm(uniform())*0.1) 
  label var fitness "健康程度"
  gen weight = normal(-gender*0.25 + age*0.25 - fitness*0.25 + invnorm(uniform())*0.1)
  label var weight "體重"
  gen treated = normal(fitness + invnorm(uniform())*0.25) > 0.73
  label var treated "是否處理組"
  save `tmp'

*-處理後數據
  replace period = 1 // after treatment
  replace weight = weight + weight*(uniform()-0.5)*0.2 - weight*(fitness-0.5)*0.25

*-合併處理前後的數據
  append using `tmp'
  sort id period
  replace weight = int(30.5+100*weight)
  replace age = int(18.5+50*age)
  gen effect = treated*period // treatment effect (interaction term for DiD)
  label var effect "處理效應的交互"
  order id treated period effect gender age weight fitness  

*-傾向得分
  probit treated age gender weight if period == 0 // omitting "unobserved" selection 
  predict score // probensity score
  label var score "傾向得分"

  des
此時，我們就得到一組有關於健康狀況的數據，其基本的描述如下：
  obs:         4,000                          
 vars:             9                          6 Nov 2020 18:24
--
              storage   display    value
variable name   type    format     label      variable label
--
id              long    %12.0g                個體id
treated         float   %9.0g                 是否處理組
period          byte    %8.0g                 是否處理後
effect          float   %9.0g                 處理效應的交互
gender          byte    %8.0g                 性別
age             float   %9.0g                 年齡
weight          float   %9.0g                 體重
fitness         float   %9.0g                 健康程度
score           float   %9.0g                 傾向得分
--
Sorted by: id  period
實際上，該樣本為包含 treated、period 和 effect 分別是 DID 模型的分組變量、時期變量和交乘變量；變量 gender、age 和 weight 是個體的性別、年齡和體重；變量 fitness 是個體的健康情況，變量 score 為個體進入處理組的傾向得分。
基礎語法
  ultimatch score if period == 0, treated(treated)
  des _match _weight _distance
運行上述代碼後，將生成三個變量：
_match，是一個非負的自增長序列，具有相同 _match 值的對照組觀測值匹配相同的處理組觀測值；
_weight，對於匹配成功的處理組觀測值，_weight 等於 _weight 等於匹配的累加權重。_weight 對應於 pweigth 類型權重。
_distance，是匹配成功的對照組觀測值到與最近的處理組觀測值的距離。
              storage   display    value
variable name   type    format     label      variable label
---
_match          long    %12.0g                match id
_weight         double  %10.0g                pweight
_distance       double  %10.0g                neighbor distance
ultimatch 默認報告匹配前後處理組和觀測值的樣本數量。具體如下，匹配前處理組 (treated) 有 
Nearest Neighbor

         Support |        Treated         Control
--+--
           Total |            387            1613
         Without |              0               0
            With |            387            1613
--+--
         Matched |            387             448
       Clustered |              0               0
        Clusters |            387             448

report(varlist) 選項使用
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age)
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age) unmatched	
選項 report(varlist) 報告匹配後處理組與觀測值在指定變量 varlist 上的差異及對應 t 檢驗，其附屬選項 unmatched 進一步報告匹配前 varlist 的差異及對應 t 檢驗。
. ultimatch score if period == 0, treated(treated) report(gender age)
Nearest Neighbor

    Support |        Treated         Control
--+--
      Total |            387            1613
    Without |              0               0
       With |            387            1613
--+--
    Matched |            387             448
  Clustered |              0               0
   Clusters |            387             448
--+----
    Matched |        Treated         Control |    StdErr        t  p>|t|
--+--+-
     gender |     .599483204      .589147287 |  .0390341     0.26  0.791
        age |     34.8165375      34.6925065 |   .967815     0.13  0.898
--

.ultimatch score if period == 0, treated(treated) report(gender age) unmatched  
Nearest Neighbor
--
    Support |     Treated         Control
--+----
      Total |         387            1613
    Without |           0               0
       With |         387            1613
--+----
    Matched |         387             448
  Clustered |           0               0
   Clusters |         387             448
--+-
  Unmatched |     Treated         Control |    StdErr        t  p>|t|
--+----+-
     gender |  .599483204      .458152511 |  .0281268     5.02  0.000
        age |  34.8165375      45.2461252 |   .783574   -13.31  0.000
--+----+-
    Matched |     Treated         Control |    StdErr        t  p>|t|
--+----+-
     gender |  .599483204      .589147287 |  .0390341     0.26  0.791
        age |  34.8165375      34.6925065 |   .967815     0.13  0.898
----
匹配方法使用
ultimatch 支持距離匹配，包括歐氏距離匹配 euclid、基於歐式距離的百分等級匹配 rank 和馬氏距離匹配 mahalanobis。
  cap drop _*
  ultimatch gender age weight if period == 0, treated(treated) report(weight) euclid

  cap drop _*
  ultimatch gender age weight if period == 0, treated(treated) report(weight) rank

  cap drop _*
  ultimatch gender age weight if period == 0, treated(treated) report(weight) mahalanobis
結果如下：
. ultimatch gender age weight if period == 0, treated(treated) report(weight) euclid

Euclidean Distance-based Neighborhood Matching
--
    Support |     Treated         Control
--+----
      Total |         387            1613
    Without |           0               0
       With |         387            1613
--+----
    Matched |         387             600
  Clustered |           0               0
   Clusters |         387             600
--+-
    Matched |     Treated         Control |    StdErr        t  p>|t|       SDM
--+----+-
     weight |  73.4444444      73.4384413 |   .522879     0.01  0.991   0.00087
----

. ltimatch gender age weight if period == 0, treated(treated) report(weight) rank
Euclidean Distance-based Percentile Rank Neighborhood Matching
--
    Support |     Treated         Control
--+----
      Total |         387            1613
    Without |           0               0
       With |         387            1613
--+----
    Matched |         387             456
  Clustered |           0               0
   Clusters |         387             456
--+-
    Matched |     Treated         Control |    StdErr        t  p>|t|       SDM
--+----+-
     weight |  73.4444444      73.4470284 |   .541415    -0.00  0.996  -0.00038
----

. ultimatch gender age weight if period == 0, treated(treated) report(weight) mahalanobis 
Mahalanobis Distance-based Neighborhood Matching
--
    Support |     Treated         Control
--+----
      Total |         387            1613
    Without |           0               0
       With |         387            1613
--+----
    Matched |         387             503
  Clustered |           0               0
   Clusters |         387             503
--+-
    Matched |     Treated         Control |    StdErr        t  p>|t|       SDM
--+----+-
     weight |  73.4444444      73.4668389 |    .55054    -0.04  0.968  -0.00325
----
ultimatch 還支持得分匹配。由於得分匹配最終轉化為單變量匹配，因此這裡僅介紹單變量匹配的選項，如抽樣次數 draw(integer)，前後抽樣 between，額外的單變量百分等級約束 limit(string)。
  cap drop _*
  qui ultimatch score if period == 0, treated(treated)
  sort period score treated
  list id treated _distance if _match == 1
    +--+
      |   id   treated   _dista~e |
      |--|
  38. | 1617         0          0 |
  39. |  328         1          0 |
      +--+
  cap drop _*
  qui ultimatch score if period == 0, treated(treated) draw(3)
  sort period score treated
  list id treated _distance if _match == 1
      +---+
      |   id   treated   _distance |
      |---|
  35. |  813         0   .00016189 |
  36. | 1922         0   .00004054 |
  37. | 1148         0   .00004054 |
  38. | 1617         0           0 |
  39. |  328         1           0 |
      +---+
  cap drop _*
  ultimatch score if period == 0, treated(treated) between
  sort period score treated
  list id treated _distance if _match == 1
      +--+
      |   id   treated   _dista~e |
      |--|
  38. | 1617         0          0 |
  39. |  328         1          0 |
  40. |  288         0   .0017835 |
  41. |  961         0   .0017835 |
      +--+
  cap drop _*
  ultimatch score if period == 0, treated(treated) between limit(age 1 weight 10)
  list id treated _distance if _match == 1
      +---+
      |   id   treated   _distance |
      |---|
 632. | 1617         0           0 |
 652. |  226         0   .03584137 |
 657. |  112         0   .03584137 |
1049. |  328         1           0 |
      +---+
single 和 greedy 選項使用
ultimatch 默認將距離相同的觀測值視為同一個觀測值，你也可以通過選項 single 指定從最近鄰中隨機抽取一個作為反事實觀測值。此外，你也可以通過選項 greedy 指定不放回抽樣。
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) single
  sort period score treated
  list id treated _match _distance if _match <= 3
      +-+
      |   id   treated   _match   _dista~e |
      |-|
  38. | 1617         0        1          0 |
  39. |  328         1        1          0 |
  42. | 1643         0        2          0 |
  45. | 1041         1        2          0 |
 121. |  905         0        3          0 |
      |-|
 122. |  646         1        3          0 |
      +-+
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) single greedy
  sort period score treated
  list id treated _match _distance if _match <= 3
      +-+
      |   id   treated   _match   _dista~e |
      |-|
  38. | 1617         0        1          0 |
  39. |  328         1        1          0 |
  42. |  432         0        2          0 |
  45. | 1041         1        2          0 |
 121. | 1375         0        3          0 |
      |-|
 122. |  646         1        3          0 |
      +-+
copy 和 support 選項使用
ultimatch 還提供一系列額外選項，比如，選項 copy 及其附屬選項 full 生成的。單獨使用 copy 時，匹配後每個處理組仍然只有一個副本 (_weight 等於 _match 和 _weight。若聯合使用 copy 和 full，每個處理組觀測值也會為它匹配的每個對照組觀測值生成一個副本，並生成各自對應的 _match 和 _weight。再比如，選項 support 生成變量 _support，當 _support 等於 
  cap drop if _copy == 1
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) copy
  tab _copy
  tab _weight if treated == 1
     copied |
observation |      Freq.     Percent        Cum.
--+
          1 |        212      100.00      100.00
--+
      Total |        212      100.00

. tab _weight if treated == 1

    pweight |      Freq.     Percent        Cum.
--+
          1 |        387      100.00      100.00
--+
      Total |        387      100.00
  cap drop if _copy == 1
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) copy full
  tab _copy
  tab _weight if treated == 1	
     copied |
observation |      Freq.     Percent        Cum.
--+
          1 |        485      100.00      100.00
--+
      Total |        485      100.00

. tab _weight if treated == 1     

    pweight |      Freq.     Percent        Cum.
--+
   .1666667 |          6        0.91        0.91
         .2 |         30        4.55        5.45
        .25 |         72       10.91       16.36
   .3333333 |        123       18.64       35.00
         .5 |        216       32.73       67.73
          1 |        213       32.27      100.00
--+
      Total |        660      100.00
  cap drop if _copy == 1
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) support
  tab _support
     common |
    support |      Freq.     Percent        Cum.
--+
          0 |         41        2.05        2.05
          1 |      1,959       97.95      100.00
--+
      Total |      2,000      100.00
exact(varlist) 和 exp(string) 選項使用
ultimatch也支持豐富的定製功能。選項 exact(varlist) 支持粗化精確匹配，並且粗化精確匹配的變量在匹配後無差異。選項 exp(string) 和算子 t. 提供定製的匹配要求，比如選項 exp(gender) 和 exp(gender==t.gender) 都要求匹配成功的觀測值同性別。
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) exact(gender)
+-
  Matched |     Treated      Control |   StdErr       t  p>|t|       SDM
+-+----
   gender |  .599483204   .599483204 |  .038298   -0.00  1.000   0.00000
      age |  34.8165375   34.8165375 |  .956433   -0.00  1.000   0.00000
   weight |  73.4444444   73.4547804 |  .527746   -0.02  0.984  -0.00153
--
  cap drop _*
  ultimatch score if period == 0, treated(treated) report(gender age weight) exp(gender == t.gender)
----+-
 Matched |     Treated      Control |   StdErr       t  p>|t|       SDM
----+-+----
  gender |  .599483204   .599483204 |  .038298   -0.00  1.000   0.00000
     age |  34.8165375   34.8165375 |  .956433   -0.00  1.000   0.00000
  weight |  73.4444444   73.4547804 |  .527746   -0.02  0.984  -0.00153
-
 
 
🍓 空間計量 專題   ⌚ 2020.12.10-13
🍏 主講：楊海生 (中山大學)；範巧 (蘭州大學)
👉 課程主頁：https://gitee.com/arlionn/SP
 
🍏 🍏 🍏 🍏
連享會主頁：🍎 www.lianxh.cn
直播視頻：lianxh.duanshu.com
 
免費公開課：
直擊面板數據模型：https://gitee.com/arlionn/PanelData - 連玉君，時長：1小時40分鐘Stata 33 講：https://gitee.com/arlionn/stata101 - 連玉君, 每講 15 分鐘.部分直播課課程資料下載 👉 https://gitee.com/arlionn/Live (PPT，dofiles等)溫馨提示： 文中連結在微信中無法生效，請點擊底部「閱讀原文」。
 
 
關於我們🍎 連享會 ( 主頁：lianxh.cn ) 由中山大學連玉君老師團隊創辦，定期分享實證分析經驗。👉 直達連享會：【百度一下：連享會】即可直達連享會主頁。亦可進一步添加 主頁，知乎，面板數據，研究設計 等關鍵詞細化搜索。連享會主頁  lianxh.cn🎦  連享會小程序：掃一掃，看推文，看視頻……
🍉 掃碼加入連享會微信群，提問交流更方便
🍅 連享會學習群-常見問題解答匯總：
👉  https://gitee.com/arlionn/WD

Stata:終極匹配 ultimatch

相關焦點

stata統計學軟體IC\SE\MP版本的區別與選擇

零基礎的同學如何用stata做一元線性回歸模型?

免費資源|stata 16 中文版,32G統計和meta教程24h免費領

零基礎的同學如何用stata做多元線性回歸模型?

Propensity Score Matching 傾向得分匹配

如何用Matlab/Python/Stata做簡單回歸分析

Stata:格蘭傑因果檢驗

似然比檢驗、Wald檢驗和拉格朗日檢驗的Stata實現討論

零基礎的同學如何用Stata做logistic回歸?

內生性問題和傾向得分匹配, 獻給準自然試驗的厚禮

【ECON-01】貿易引力模型及其Stata實現

Stata 15 統計數據分析軟體

廣州站|中國經濟與STATA教學高級研討會

怎麼在Stata實現

《小小大戰爭》三部曲終章《終極戰爭》年內上架雙平臺

今天就簡單解讀一下廣泛匹配、詞組匹配和精準匹配的區別,以求幫助...

阻抗匹配電路的作用,阻抗匹配的理想模型

《終極筆記》青銅門後的「終極」是永生的秘密嗎?

正則表達式 – 匹配規則

Stata:終極匹配 ultimatch

相關焦點

stata統計學軟體IC\SE\MP版本的區別與選擇

零基礎的同學如何用stata做一元線性回歸模型?

免費資源|stata 16 中文版,32G統計和meta教程24h免費領

零基礎的同學如何用stata做多元線性回歸模型?

Propensity Score Matching 傾向得分匹配

如何用Matlab/Python/Stata做簡單回歸分析

Stata:格蘭傑因果檢驗

似然比檢驗、Wald檢驗和拉格朗日檢驗的Stata實現 討論

零基礎的同學如何用Stata做logistic回歸?

內生性問題和傾向得分匹配, 獻給準自然試驗的厚禮

【ECON-01】貿易引力模型及其Stata實現

Stata 15 統計數據分析軟體

廣州站|中國經濟與STATA教學高級研討會

怎麼在Stata實現

《小小大戰爭》三部曲終章《終極戰爭》年內上架雙平臺

今天就簡單解讀一下廣泛匹配、詞組匹配和精準匹配的區別,以求幫助...

阻抗匹配電路的作用,阻抗匹配的理想模型

《終極筆記》青銅門後的「終極」是永生的秘密嗎?

正則表達式 – 匹配規則

似然比檢驗、Wald檢驗和拉格朗日檢驗的Stata實現討論