關於 psmatch2 與 teffects psmatch 估計結果差異的一個原因

2021-03-02 小熊的胡說八道
關於 psmatch2 與 teffects psmatch 估計結果差異的一個原因

關於具體 PSM 方法的原理,我不做過多闡述,這裡我僅討論teffects psmatch和psmatch2在stata中的估計結果不相同的一個原因。

stata15及之後的版本中有個teffects模塊,PSM 方法也可以用其實現,一般的psmatch2命令用來做 psm 是比較多的,但,psmatch2對標準差的估計是有問題的,其報告結果的時候都會提示Note: S.E. does not take into account that the propensity score is estimated.,而teffects psmath的標準差你大可以放心。

這篇文章Propensity Score Matching in Stata using teffects (連接:https://www.ssc.wisc.edu/sscc/pubs/stata_psmatch.htm)關於psmatch2和teffects psmatch的講解是比較詳細的,該文中也指出對命令選項進行調整,理應可以獲得同樣的係數,psmatch2 t x1 x2, out(y) logit ate 和teffects psmatch (y) (t x1 x2), atet應該可以獲得同樣的ATT。而,部分學者使用psmatch2和teffects psmatch命令對同一個數據進行估計時,往往卻發現兩個命令的估計結果不相同,甚至結論完全相反。

至於為什麼會導致這種情況發生,原因在於psmatch2在最近鄰匹配時,如果多個控制組個體與幹預組個體具有相同的最近距離,那麼不加ties選項的psmatch2將會選擇最先遇到的控制組個體作為匹配,因此,樣本的順序會影響匹配樣本,而影響估計結果,如果加了ties選項,將會用到所有相同最近距離的控制組個體的平均結果作為幹預組個體的匹配,而teffects psmatch則是採用後者的方法。

故,當你使用兩個命令卻發現獲得不同的結果時,例,att 與teffects psmatch (y) (treat xlist), atet相差很大時,你應當檢查你的psmatch2 treat xlist,out(y) logit ate是否有ties選項,這個可能是係數差異的一個可能原因。

clear all

frames create data
frames change data

webuse cattaneo2

frames copy data frames1
frames change frames1

sum bweight mbsmoke mmarried c.mage##c.mage fbaby medu

    Variable |        Obs        Mean    Std. Dev.       Min        Max
---+--
bweight | 4,642 3361.68 578.8196 340 5500
mbsmoke | 4,642 .1861267 .3892508 0 1
mmarried | 4,642 .6996984 .4584385 0 1
mage | 4,642 26.50452 5.619026 13 45
|
c.mage#|
c.mage | 4,642 734.0564 305.2242 169 2025
---+--
|
fbaby | 4,642 .4379578 .4961893 0 1
medu | 4,642 12.68957 2.520661 0 17

1.Logit
logit mbsmoke mmarried c.mage##c.mage fbaby medu

Iteration 0:   log likelihood = -2230.7484  
Iteration 1: log likelihood = -2053.769
Iteration 2: log likelihood = -2043.2897
Iteration 3: log likelihood = -2043.2504
Iteration 4: log likelihood = -2043.2504

Logistic regression Number of obs = 4,642
LR chi2(5) = 375.00
Prob > chi2 = 0.0000
Log likelihood = -2043.2504 Pseudo R2 = 0.0841

----
mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----+----
mmarried | -1.145706 .0918962 -12.47 0.000 -1.32582 -.965593
mage | .321518 .0638472 5.04 0.000 .1963798 .4466563
|
c.mage#c.mage | -.0060368 .0011849 -5.09 0.000 -.0083592 -.0037144
|
fbaby | -.3864258 .0880445 -4.39 0.000 -.5589898 -.2138618
medu | -.1420833 .0173215 -8.20 0.000 -.1760328 -.1081338
_cons | -2.950915 .8102504 -3.64 0.000 -4.538976 -1.362853
----

predict pr_

(option pr assumed; Pr(mbsmoke))

2.psmatch22.1 using pscore()
psmatch2 mbsmoke, out(bweight) pscore(pr_) ate logit

---
Variable Sample | Treated Controls Difference S.E. T-stat
---+----
bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83
ATT | 3137.65972 3334.84259 -197.18287 55.6185293 -3.55
ATU | 3412.91159 3164.00185 -248.909741 . .
ATE | -239.281991 . .
---+----
Note: S.E. does not take into account that the propensity score is estimated.

| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor | Total
-+-+
Untreated | 3,778 | 3,778
Treated | 864 | 864
-+-+
Total | 4,642 | 4,642

2.2 general
psmatch2 mbsmoke mmarried c.mage##c.mage fbaby medu, out(bweight) ate logit n(1)

Logistic regression                             Number of obs     =      4,642
LR chi2(5) = 375.00
Prob > chi2 = 0.0000
Log likelihood = -2043.2504 Pseudo R2 = 0.0841

----
mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----+----
mmarried | -1.145706 .0918962 -12.47 0.000 -1.32582 -.965593
mage | .321518 .0638472 5.04 0.000 .1963798 .4466563
|
c.mage#c.mage | -.0060368 .0011849 -5.09 0.000 -.0083592 -.0037144
|
fbaby | -.3864258 .0880445 -4.39 0.000 -.5589898 -.2138618
medu | -.1420833 .0173215 -8.20 0.000 -.1760328 -.1081338
_cons | -2.950915 .8102504 -3.64 0.000 -4.538976 -1.362853
----
---
Variable Sample | Treated Controls Difference S.E. T-stat
---+----
bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83
ATT | 3137.65972 3334.84259 -197.18287 55.6185293 -3.55
ATU | 3412.91159 3164.00185 -248.909741 . .
ATE | -239.281991 . .
---+----
Note: S.E. does not take into account that the propensity score is estimated.

| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor | Total
-+-+
Untreated | 3,778 | 3,778
Treated | 864 | 864
-+-+
Total | 4,642 | 4,642

3. teffects psmatch
teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu),atet nn(1)

Treatment-effects estimation                   Number of obs      =      4,642
Estimator : propensity-score matching Matches: requested = 1
Outcome model : matching min = 1
Treatment model: logit max = 74
---
| AI Robust
bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
ATET |
mbsmoke |
(smoker vs nonsmoker) | -236.7848 26.57789 -8.91 0.000 -288.8765 -184.693
---

兩個命令的估計結果不同?

4. 計算結果不同?
frames create frame2
frames change frame2

use http://ssc.wisc.edu/sscc/pubs/files/psm

sum

    Variable |        Obs        Mean    Std. Dev.       Min        Max
---+--
x1 | 1,000 -.012963 1.000053 -3.6593 3.084742
x2 | 1,000 -.0246025 1.034555 -3.363018 3.399474
t | 1,000 .333 .4715224 0 1
y | 1,000 .3474242 1.957462 -5.494524 6.873514

psmatch2 t x1 x2, out(y) logit ate

Logistic regression                             Number of obs     =      1,000
LR chi2(2) = 222.78
Prob > chi2 = 0.0000
Log likelihood = -524.89072 Pseudo R2 = 0.1751

---
t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
x1 | .9068298 .0885341 10.24 0.000 .7333062 1.080353
x2 | .8100408 .0816962 9.92 0.000 .6499192 .9701624
_cons | -.8528442 .0788823 -10.81 0.000 -1.007451 -.6982378
---
---
Variable Sample | Treated Controls Difference S.E. T-stat
---+----
y Unmatched | 1.8910736 -.423243358 2.31431696 .109094342 21.21
ATT | 1.8910736 .930722886 .960350715 .168252917 5.71
ATU |-.423243358 .625587554 1.04883091 . .
ATE | 1.01936701 . .
---+----
Note: S.E. does not take into account that the propensity score is estimated.

| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor | Total
-+-+
Untreated | 667 | 667
Treated | 333 | 333
-+-+
Total | 1,000 | 1,000

teffects psmatch (y) (t x1 x2), atet

Treatment-effects estimation                   Number of obs      =      1,000
Estimator : propensity-score matching Matches: requested = 1
Outcome model : matching min = 1
Treatment model: logit max = 1
---
| AI Robust
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
ATET |
t |
(1 vs 0) | .9603507 .1204748 7.97 0.000 .7242245 1.196477
---

兩個命令的估計結果又相同了?

5. 差異以及潛在的原因

兩個命令的估計結果為何有時相同有時不同?

區別:

可能的原因:

psmatch2 直接匹配的第一個,即使在有相同距離的其他個體存在情況下;

猜想:

而 teffects psmatch 將匹配到的最近距離的所有個體計算,即,存在1:1匹配,但是某一個樣本與其他多個樣本的距離相同6. 檢驗
frame copy data simu,replace
frame change simu

(note: frame simu not found)


6.1 獲取匹配得分並排序
logit mbsmoke mmarried c.mage##c.mage fbaby medu

Iteration 0:   log likelihood = -2230.7484  
Iteration 1: log likelihood = -2053.769
Iteration 2: log likelihood = -2043.2897
Iteration 3: log likelihood = -2043.2504
Iteration 4: log likelihood = -2043.2504

Logistic regression Number of obs = 4,642
LR chi2(5) = 375.00
Prob > chi2 = 0.0000
Log likelihood = -2043.2504 Pseudo R2 = 0.0841

----
mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----+----
mmarried | -1.145706 .0918962 -12.47 0.000 -1.32582 -.965593
mage | .321518 .0638472 5.04 0.000 .1963798 .4466563
|
c.mage#c.mage | -.0060368 .0011849 -5.09 0.000 -.0083592 -.0037144
|
fbaby | -.3864258 .0880445 -4.39 0.000 -.5589898 -.2138618
medu | -.1420833 .0173215 -8.20 0.000 -.1760328 -.1081338
_cons | -2.950915 .8102504 -3.64 0.000 -4.538976 -1.362853
----

cap drop pr_
predict pr_

(option pr assumed; Pr(mbsmoke))

sort pr_ mbsmoke
gen id = _n

6.2 兩個命令的ATT計算結果
teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu),nn(1) atet
//teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu),nn(1) ate

Treatment-effects estimation                   Number of obs      =      4,642
Estimator : propensity-score matching Matches: requested = 1
Outcome model : matching min = 1
Treatment model: logit max = 74
---
| AI Robust
bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
ATET |
mbsmoke |
(smoker vs nonsmoker) | -236.7848 26.57789 -8.91 0.000 -288.8765 -184.693
---

psmatch2 mbsmoke mmarried c.mage##c.mage fbaby medu, out(bweight) ate logit n(1) //ATT = -248.515046

Logistic regression                             Number of obs     =      4,642
LR chi2(5) = 375.00
Prob > chi2 = 0.0000
Log likelihood = -2043.2504 Pseudo R2 = 0.0841

----
mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----+----
mmarried | -1.145706 .0918962 -12.47 0.000 -1.32582 -.965593
mage | .321518 .0638472 5.04 0.000 .1963798 .4466563
|
c.mage#c.mage | -.0060368 .0011849 -5.09 0.000 -.0083592 -.0037144
|
fbaby | -.3864258 .0880445 -4.39 0.000 -.5589898 -.2138618
medu | -.1420833 .0173215 -8.20 0.000 -.1760328 -.1081338
_cons | -2.950915 .8102504 -3.64 0.000 -4.538976 -1.362853
----
---
Variable Sample | Treated Controls Difference S.E. T-stat
---+----
bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83
ATT | 3137.65972 3386.17477 -248.515046 54.7442913 -4.54
ATU | 3412.91159 3166.47327 -246.438327 . .
ATE | -246.82486 . .
---+----
Note: S.E. does not take into account that the propensity score is estimated.

| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor | Total
-+-+
Untreated | 3,778 | 3,778
Treated | 864 | 864
-+-+
Total | 4,642 | 4,642

6.3 獲取 teffect psmatch 的匹配信息
teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu),nn(1) atet gen(match)
// -236.78475

Treatment-effects estimation                   Number of obs      =      4,642
Estimator : propensity-score matching Matches: requested = 1
Outcome model : matching min = 1
Treatment model: logit max = 74
---
| AI Robust
bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
ATET |
mbsmoke |
(smoker vs nonsmoker) | -236.7848 26.57789 -8.91 0.000 -288.8765 -184.693
---

6.4 保留處理組並合併匹配組的數據6.4.1 保留處理組數據
frame copy simu simu2
frame change simu2

keep id bweight mbsmoke match*
keep if mbsmoke == 1

(3,778 observations deleted)

6.4.2 生成匹配對應表
reshape long match,i(id) j(match_id)
keep if match != .

(note: j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 
> 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74)

Data wide -> long
--
Number of obs. 864 -> 63936
Number of variables 77 -> 5
j variable (74 values) -> match_id
xij variables:
match1 match2 ... match74 -> match
--

6.4.3 建立frames 連接
frlink m:1 match,frame(simu id) gen(simu_21)

  (all observations in frame simu2 matched)

6.4.4 獲取匹配組結果變量
frget bweight ,from(simu_21) pre(_0)

  (1 variable copied from linked frame)

6.5 計算單個匹配的處置效應
gen att = bweight - _0bweight

sum att
return list

    Variable |        Obs        Mean    Std. Dev.       Min        Max
---+--
att | 15,721 -238.8935 796.7148 -4082 3884



scalars:
r(N) = 15721
r(sum_w) = 15721
r(mean) = -238.8935182240315
r(Var) = 634754.4716125642
r(sd) = 796.7147994185649
r(min) = -4082
r(max) = 3884
r(sum) = -3755645

6.6 計算樣本加權均值
bysort id:gen num = _N
sum att [aweight = 1/num]
return list

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
---+
att | 15,721 864 -236.7848 808.3941 -4082 3884



scalars:
r(N) = 15721
r(sum_w) = 864
r(mean) = -236.7847508257834
r(Var) = 653501.0708042918
r(sd) = 808.394130857153
r(min) = -4082
r(max) = 3884
r(sum) = -204582.0247134768

frames simu : teffects psmatch (bweight) (mbsmoke mmarried c.mage##c.mage fbaby medu),nn(1) atet

Treatment-effects estimation                   Number of obs      =      4,642
Estimator : propensity-score matching Matches: requested = 1
Outcome model : matching min = 1
Treatment model: logit max = 74
---
| AI Robust
bweight | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---+----
ATET |
mbsmoke |
(smoker vs nonsmoker) | -236.7848 26.57789 -8.91 0.000 -288.8765 -184.693
---

frames simu : psmatch2 mbsmoke mmarried c.mage##c.mage fbaby medu, out(bweight) ate logit n(1)  ties

Logistic regression                             Number of obs     =      4,642
LR chi2(5) = 375.00
Prob > chi2 = 0.0000
Log likelihood = -2043.2504 Pseudo R2 = 0.0841

----
mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----+----
mmarried | -1.145706 .0918962 -12.47 0.000 -1.32582 -.965593
mage | .321518 .0638472 5.04 0.000 .1963798 .4466563
|
c.mage#c.mage | -.0060368 .0011849 -5.09 0.000 -.0083592 -.0037144
|
fbaby | -.3864258 .0880445 -4.39 0.000 -.5589898 -.2138618
medu | -.1420833 .0173215 -8.20 0.000 -.1760328 -.1081338
_cons | -2.950915 .8102504 -3.64 0.000 -4.538976 -1.362853
----
---
Variable Sample | Treated Controls Difference S.E. T-stat
---+----
bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83
ATT | 3137.65972 3374.44447 -236.784751 26.0535546 -9.09
ATU | 3412.91159 3207.84728 -205.064318 . .
ATE | -210.968337 . .
---+----
Note: S.E. does not take into account that the propensity score is estimated.

| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor | Total
-+-+
Untreated | 3,778 | 3,778
Treated | 864 | 864
-+-+
Total | 4,642 | 4,642

6.7 計算匹配第一個的組的均值
sum att if match_id == 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
---+--
att | 864 -248.515 817.1815 -3403 2750

return list

scalars:
r(sum) = -214717
r(max) = 2750
r(min) = -3403
r(sd) = 817.1815138978078
r(Var) = 667785.6266563131
r(mean) = -248.5150462962963
r(sum_w) = 864
r(N) = 864

frames simu : psmatch2 mbsmoke mmarried c.mage##c.mage fbaby medu, out(bweight) ate logit n(1)

Logistic regression                             Number of obs     =      4,642
LR chi2(5) = 375.00
Prob > chi2 = 0.0000
Log likelihood = -2043.2504 Pseudo R2 = 0.0841

----
mbsmoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----+----
mmarried | -1.145706 .0918962 -12.47 0.000 -1.32582 -.965593
mage | .321518 .0638472 5.04 0.000 .1963798 .4466563
|
c.mage#c.mage | -.0060368 .0011849 -5.09 0.000 -.0083592 -.0037144
|
fbaby | -.3864258 .0880445 -4.39 0.000 -.5589898 -.2138618
medu | -.1420833 .0173215 -8.20 0.000 -.1760328 -.1081338
_cons | -2.950915 .8102504 -3.64 0.000 -4.538976 -1.362853
----
---
Variable Sample | Treated Controls Difference S.E. T-stat
---+----
bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83
ATT | 3137.65972 3386.17477 -248.515046 54.7442913 -4.54
ATU | 3412.91159 3166.47327 -246.438327 . .
ATE | -246.82486 . .
---+----
Note: S.E. does not take into account that the propensity score is estimated.

| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On suppor | Total
-+-+
Untreated | 3,778 | 3,778
Treated | 864 | 864
-+-+
Total | 4,642 | 4,642

相關焦點

  • 傾向得分匹配:psmatch2 還是 teffects psmatch
    事實上,psmatch2 在匯報 PSM 估計結果時,已經委婉暗示了此局限性。比如,在下圖 psmatch2 估計結果底部的 Note 中,已經聲明「所提供的標準誤並未考慮到傾向得分是估計的」(S.E. does not take into account that the propensity score is estimated)。
  • 統計計量丨傾向得分匹配:psmatch2 還是 teffects psmatch
    在 Stata 命令中,psmatch2 是較早出現的PSM非官方命令。然而,雖然 psmatch2 提供了豐富的具體匹配方法,但它最大的缺陷則在於其標準誤(standard errors)並不正確。基於標準誤對於統計推斷的重要性,這是 psmatch2 的致命弱點。事實上,psmatch2 在匯報 PSM 估計結果時,已經委婉暗示了此局限性。
  • 傾向匹配得分教程【pscore、psmatch2、官方命令Teffects操作及應用】
    teffects語法格式為:help teffects teffects psmatch (ovar) (tvar tmvarlist [, tmodel5、psmatch2命令進行操作psmatch2 t age edu black hisp
  • 即將開幕的STATA前沿培訓精講:帶異質性處理效應的雙向固定效應估計|從精確斷點、模糊斷點估計的實際操作|弱工具變量穩健推斷
    研究表明,它們估計了每一組和每一時期的平均處理效應(ATE)的加權和,其權重可能為負。例如,由於權值為負,線性回歸係數可能為負,而所有的ATEs都為正。提出了另一種估計方法來解決這個問題。在回顧的兩個應用中,它與線性回歸估計有顯著的不同。引言:估計一個處理對一個結果影響的一個流行方法是比較在一段時間內經歷不同演變的處理組。
  • Alpha多樣性指數的計算和差異分析(差異檢驗結果可視化)
    Alpha多樣性差異檢驗在微生物群落的alpha多樣性指數分析中,最常用的就是利用統計學分析檢驗不同組樣本間微生物群落alpha多樣性指數的差異顯著性。兩組樣本分析當研究的樣本只有兩組時,一般使用t-test檢驗組間差異。
  • 使用 ALDEx2 進行差異分析
    分步運行 ALDEX簡單來說,這種方法的過程只是依次調用 aldex.clr(),aldex.ttest() 和 aldex.effect() 函數,然後將數據合併到一個對象中。此外,因為 aldex.clr() 函數使用蒙特卡羅方法對數據進行採樣,所以所有結果中的值都是由 aldex.clr() 函數中的 mc.samples 變量給出的狄利克雷實例數的平均值。一行命令進行 ALDEX 差異分析目前,aldex 函數僅限於雙樣本檢驗和單因素方差分析。
  • FC/T檢驗/PLS-DA篩選差異代謝物方法介紹
    (O)PLS-DA法(VIP值)倍數變化法即根據代謝物的相對定量或絕對定量結果,計算某個代謝物在兩組間表達量的差異倍數(Fold Change),簡稱FC值。假設A物質在對照組中定量結果為1,在疾病組中定量結果為3,那麼此物質的FC值即為3。由於代謝物定量結果肯定是非負數,那麼FC的取值就是(0, +∞)。
  • 帶異質性處理效應的雙向固定效應估計不穩健時,Fuzzy-DID來幫忙|補充更新
    研究表明,它們估計了每一組和每一時期的平均處理效應(ATE)的加權和,其權重可能為負。例如,由於權值為負,線性回歸係數可能為負,而所有的ATEs都為正。提出了另一種估計方法來解決這個問題。在回顧的兩個應用中,它與線性回歸估計有顯著的不同。引言:估計一個處理對一個結果影響的一個流行方法是比較在一段時間內經歷不同演變的處理組。
  • reg2logit:用OLS估計Logit模型參數
    首先,LDM 可以轉換為一個 logit 模型,其次,線性概率模型 (LPM) 的參數估計能夠轉換為「第一步轉換後的 Logit 模型」參數的極大似然估計。2.2 "推導後 Logit" 參數的 MLE對由 LDM 推導後的 Logit 模型進行參數估計時,假定存在一個
  • Python數據科學:正態分布與t檢驗
    連續變量數量為一個,分類變量數量為兩個。方差,標準差反映數據的離散程度,其值越大,數據波動越大。在實際情況裡,總體的信息往往難以獲取,所以需要抽樣,通過樣本來估計總體。點估計和區間估計是通過樣本來估計總體的兩種方法。那麼樣本是否能夠代表總體就是關鍵點,樣本需要具有代表性。點估計:用樣本統計量去估計總體參數。
  • 管理心理學之統計(11)t分數
    當σ未知時,估計標準誤(SM)被用作實際標準誤的估計值,它提供了樣本平均數M到其總體平均數之間的標準距離的估計。M的估計標準誤公式為:之所以用方差來代替標準差的原因是樣本方差是無偏差的統計量,用樣本方差來估計總體方差是最準確的。