Covariate matching methods directly match units based on the similarity of their observed characteristics, rather than reducing them to a single propensity score. These methods preserve the multivariate structure of covariates and can achieve better balance.
Advantages Over Propensity Score Matching (PSM)
No information loss: Matching on full covariate space preserves all information
Better balance: Can achieve exact or near-exact balance on all covariates
Transparency: Easier to inspect and understand matches
Robustness: Not sensitive to propensity score model specification (e.g., non-parametric)
Mahalanobis Distance Matching
The Mahalanobis distance between units \(i\) and \(j\) is:
# Check imbalanceimbalance(group = df$treat, data = df, drop ="re78")
Multivariate Imbalance Measure: L1=1.000
Percentage of local common support: LCS=0.0%
Univariate Imbalance Measures:
statistic type L1 min 25% 50% 75% max
treat 1.0000000 (diff) 1.0000000 1 1 1.000 1.000 1.00
age -2.2140868 (diff) 0.0000000 1 1 0.000 -6.000 -7.00
educ 0.1105147 (diff) 0.2170730 4 0 0.000 0.000 -2.00
race 224.0708295 (Chi2) 0.6404460 NA NA NA NA NA
married -0.3236313 (diff) 0.3236313 0 0 -1.000 -1.000 0.00
nodegree 0.1113715 (diff) 0.1113715 0 0 0.000 0.000 0.00
re74 -3523.6628177 (diff) 0.0000000 0 0 -2547.047 -7985.660 9177.75
re75 -934.4291293 (diff) 0.0000000 0 0 -1086.726 -2064.135 6795.01
# Estimate treatment effect with weightsfit <-lm(re78 ~ treat + age + educ + race + nodegree + married + re74 + re75, data = df, weights = mat$w)summary(fit)
Call:
lm(formula = re78 ~ treat + age + educ + race + nodegree + married +
re74 + re75, data = df, weights = mat$w)
Weighted Residuals:
Min 1Q Median 3Q Max
-13127 0 0 0 27730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1803.7914 6692.1386 -0.270 0.7879
treat 744.2106 972.4550 0.765 0.4454
age 31.9503 97.7658 0.327 0.7443
educ 505.1141 439.6080 1.149 0.2526
racehispan 4572.7844 2215.6024 2.064 0.0409 *
racewhite 2300.2659 1941.7918 1.185 0.2382
nodegree 281.2088 1887.4331 0.149 0.8818
married -4922.4051 2541.8938 -1.937 0.0549 .
re74 0.3736 0.5127 0.729 0.4675
re75 1.3726 0.6705 2.047 0.0426 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5829 on 136 degrees of freedom
Multiple R-squared: 0.1188, Adjusted R-squared: 0.06044
F-statistic: 2.036 on 9 and 136 DF, p-value: 0.03975
# CEM not natively available in Python# Use rpy2 to call R's cem packageimport rpy2.robjects as rofrom rpy2.robjects import pandas2ripandas2ri.activate()ro.r('library(cem, quietly = TRUE)')# Convert pandas df to R dfr_data = pandas2ri.py2rpy(data)# Run CEM in Rresult = ro.r.cem('treat', data=r_data, drop='re78')weights = ro.r('result$w')
* Install cem* ssc install cemcem age educ black hispanic married nodegree re74 re75, treatment(treat)* Check balanceimbalance age educ black hispanic married nodegree re74 re75, by(treat)* Regression with CEM weightsreg re78 treat age educ black hispanic married nodegree re74 re75 [iw = cem_weights]
Genetic Matching
Concept
Genetic matching optimizes the weight matrix\(W\) in a generalized Mahalanobis distance to maximize covariate balance:
library(Matching, quietly =TRUE)# Datadata(lalonde) # same as df aboveattach(lalonde)# The covariates we want to match onX =cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74)# The covariates we want to obtain balance onBalanceMat <-cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74, I(re74*re75))# Genetic matching to find optimal weightsgenout <-GenMatch(Tr=treat, X=X, BalanceMatrix=BalanceMat, estimand="ATE", M=1,pop.size=16, max.generations=10, wait.generations=1, print.level =0)
Loading required namespace: rgenoud
# Match using optimized weightsmout <-Match(Y = re78,Tr = treat,X = X,estimand ="ATE",Weight.matrix = genout)# ATE Resultssummary(mout)
Estimate... 2079
AI SE...... 810.25
T-stat..... 2.5659
p.val...... 0.010291
Original number of observations.............. 445
Original number of treated obs............... 185
Matched number of observations............... 445
Matched number of observations (unweighted). 597
# Assess balancemb <-MatchBalance( treat ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re75 + re74,match.out = mout,nboots =500)
***** (V1) age *****
Before Matching After Matching
mean treatment........ 25.816 25.119
mean control.......... 25.054 24.994
std mean diff......... 10.655 1.8544
mean raw eQQ diff..... 0.94054 0.35511
med raw eQQ diff..... 1 0
max raw eQQ diff..... 7 8
mean eCDF diff........ 0.025364 0.0097054
med eCDF diff........ 0.022193 0.0083752
max eCDF diff........ 0.065177 0.025126
var ratio (Tr/Co)..... 1.0278 0.96842
T-test p-value........ 0.26594 0.52651
KS Bootstrap p-value.. 0.512 0.906
KS Naive p-value...... 0.7481 0.99172
KS Statistic.......... 0.065177 0.025126
***** (V2) educ *****
Before Matching After Matching
mean treatment........ 10.346 10.2
mean control.......... 10.088 10.225
std mean diff......... 12.806 -1.4276
mean raw eQQ diff..... 0.40541 0.10553
med raw eQQ diff..... 0 0
max raw eQQ diff..... 2 2
mean eCDF diff........ 0.028698 0.0075377
med eCDF diff........ 0.012682 0.0033501
max eCDF diff........ 0.12651 0.025126
var ratio (Tr/Co)..... 1.5513 1.0569
T-test p-value........ 0.15017 0.49442
KS Bootstrap p-value.. 0.014 0.712
KS Naive p-value...... 0.062873 0.99172
KS Statistic.......... 0.12651 0.025126
***** (V3) black *****
Before Matching After Matching
mean treatment........ 0.84324 0.8382
mean control.......... 0.82692 0.84045
std mean diff......... 4.4767 -0.60952
mean raw eQQ diff..... 0.016216 0.001675
med raw eQQ diff..... 0 0
max raw eQQ diff..... 1 1
mean eCDF diff........ 0.0081601 0.00083752
med eCDF diff........ 0.0081601 0.00083752
max eCDF diff........ 0.01632 0.001675
var ratio (Tr/Co)..... 0.92503 1.0114
T-test p-value........ 0.64736 0.65487
***** (V4) hisp *****
Before Matching After Matching
mean treatment........ 0.059459 0.08764
mean control.......... 0.10769 0.08764
std mean diff......... -20.341 0
mean raw eQQ diff..... 0.048649 0
med raw eQQ diff..... 0 0
max raw eQQ diff..... 1 0
mean eCDF diff........ 0.024116 0
med eCDF diff........ 0.024116 0
max eCDF diff........ 0.048233 0
var ratio (Tr/Co)..... 0.58288 1
T-test p-value........ 0.064043 1
***** (V5) married *****
Before Matching After Matching
mean treatment........ 0.18919 0.16854
mean control.......... 0.15385 0.16629
std mean diff......... 8.9995 0.59963
mean raw eQQ diff..... 0.037838 0.001675
med raw eQQ diff..... 0 0
max raw eQQ diff..... 1 1
mean eCDF diff........ 0.017672 0.00083752
med eCDF diff........ 0.017672 0.00083752
max eCDF diff........ 0.035343 0.001675
var ratio (Tr/Co)..... 1.1802 1.0108
T-test p-value........ 0.33425 0.31731
***** (V6) nodegr *****
Before Matching After Matching
mean treatment........ 0.70811 0.78202
mean control.......... 0.83462 0.78202
std mean diff......... -27.751 0
mean raw eQQ diff..... 0.12432 0
med raw eQQ diff..... 0 0
max raw eQQ diff..... 1 0
mean eCDF diff........ 0.063254 0
med eCDF diff........ 0.063254 0
max eCDF diff........ 0.12651 0
var ratio (Tr/Co)..... 1.4998 1
T-test p-value........ 0.0020368 1
***** (V7) u74 *****
Before Matching After Matching
mean treatment........ 0.70811 0.73258
mean control.......... 0.75 0.73034
std mean diff......... -9.1895 0.50714
mean raw eQQ diff..... 0.037838 0.001675
med raw eQQ diff..... 0 0
max raw eQQ diff..... 1 1
mean eCDF diff........ 0.020946 0.00083752
med eCDF diff........ 0.020946 0.00083752
max eCDF diff........ 0.041892 0.001675
var ratio (Tr/Co)..... 1.1041 0.99472
T-test p-value........ 0.33033 0.56385
***** (V8) u75 *****
Before Matching After Matching
mean treatment........ 0.6 0.64494
mean control.......... 0.68462 0.64944
std mean diff......... -17.225 -0.93815
mean raw eQQ diff..... 0.081081 0.0033501
med raw eQQ diff..... 0 0
max raw eQQ diff..... 1 1
mean eCDF diff........ 0.042308 0.001675
med eCDF diff........ 0.042308 0.001675
max eCDF diff........ 0.084615 0.0033501
var ratio (Tr/Co)..... 1.1133 1.0058
T-test p-value........ 0.068031 0.4143
***** (V9) re75 *****
Before Matching After Matching
mean treatment........ 1532.1 1295.4
mean control.......... 1266.9 1328.7
std mean diff......... 8.2363 -1.1277
mean raw eQQ diff..... 367.61 128.59
med raw eQQ diff..... 0 0
max raw eQQ diff..... 2110.2 8195.6
mean eCDF diff........ 0.050834 0.0082012
med eCDF diff........ 0.061954 0.0067002
max eCDF diff........ 0.10748 0.023451
var ratio (Tr/Co)..... 1.0763 1.0046
T-test p-value........ 0.38527 0.69008
KS Bootstrap p-value.. 0.036 0.792
KS Naive p-value...... 0.16449 0.99663
KS Statistic.......... 0.10748 0.023451
***** (V10) re74 *****
Before Matching After Matching
mean treatment........ 2095.6 2002
mean control.......... 2107 2053.7
std mean diff......... -0.23437 -1.0605
mean raw eQQ diff..... 487.98 197.67
med raw eQQ diff..... 0 0
max raw eQQ diff..... 8413 7870.3
mean eCDF diff........ 0.019223 0.0061813
med eCDF diff........ 0.0158 0.0050251
max eCDF diff........ 0.047089 0.023451
var ratio (Tr/Co)..... 0.7381 0.8693
T-test p-value........ 0.98186 0.58258
KS Bootstrap p-value.. 0.568 0.668
KS Naive p-value...... 0.97023 0.99663
KS Statistic.......... 0.047089 0.023451
Before Matching Minimum p.value: 0.0020368
Variable Name(s): nodegr Number(s): 6
After Matching Minimum p.value: 0.31731
Variable Name(s): married Number(s): 5
Entropy Balancing
Concept
Entropy balancing reweights control units to exactly match treatment group moments while minimizing information loss (maximum entropy).