Matching Methods

Covariate Matching

Introduction

Covariate matching methods directly match units based on the similarity of their observed characteristics, rather than reducing them to a single propensity score. These methods preserve the multivariate structure of covariates and can achieve better balance.

Advantages Over Propensity Score Matching (PSM)

  1. No information loss: Matching on full covariate space preserves all information
  2. Better balance: Can achieve exact or near-exact balance on all covariates
  3. Transparency: Easier to inspect and understand matches
  4. Robustness: Not sensitive to propensity score model specification (e.g., non-parametric)

Mahalanobis Distance Matching

The Mahalanobis distance between units \(i\) and \(j\) is:

\[ D_M(X_i, X_j) = \sqrt{(X_i - X_j)^\top S^{-1} (X_i - X_j)} \]

where \(S\) is the sample covariance matrix of covariates.

Why Not Euclidean Distance?

Euclidean distance treats all variables equally:

\[ D_E(X_i, X_j) = \sqrt{(X_i - X_j)^\top (X_i - X_j)} \]

Problems:

  • Variables with larger scales dominate the distance
  • Ignores correlation structure among covariates

Mahalanobis distance standardizes by \(S^{-1}\), making it:

  • Scale-invariant: Variables measured in different units are comparable
  • Correlation-adjusted: Accounts for dependencies among covariates

Properties

  1. Invariant under affine transformations: Distance unchanged if covariates are linearly transformed
  2. Multivariate: Operates in full covariate space
  3. Non-parametric: No model for treatment assignment required

Limitations

  • Curse of dimensionality: Performance degrades with many covariates
  • Sensitive to outliers: Extreme values can distort \(S\)
  • Assumes multivariate normality (implicitly)

Estimation

Again, we use the lalonde data from the MatchIt R package as described in Propensity Score Matching.

library(MatchIt, quietly = TRUE)
library(Matching, quietly = TRUE)

# Download the lalonde data 
df <- MatchIt::lalonde

# Mahalanobis distance matching
m.out <- matchit(
    treat ~ age + educ + race + married + nodegree + re74 + re75,
    data = df,
    method = "nearest",         # Nearest neighbor
    distance = "mahalanobis",   # Mahalanobis matching
    ratio = 1,                  # 1:1 matching
    replace = FALSE
)

# Example with exact matching
m.out2 <- matchit(
    treat ~ age + educ + race + nodegree + married + re74 + re75,
    data = df,
    distance = "mahalanobis",
    replace = FALSE,
    exact = ~ married + race)
Warning: Fewer control units than treated units in some `exact` strata; not all
treated units will get a match.
# Check Balance
summary(m.out2, un = TRUE)

Call:
matchit(formula = treat ~ age + educ + race + nodegree + married + 
    re74 + re75, data = df, distance = "mahalanobis", exact = ~married + 
    race, replace = FALSE)

Summary of Balance for All Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
age              25.8162       28.0303         -0.3094     0.4400    0.0813
educ             10.3459       10.2354          0.0550     0.4959    0.0347
raceblack         0.8432        0.2028          1.7615          .    0.6404
racehispan        0.0595        0.1422         -0.3498          .    0.0827
racewhite         0.0973        0.6550         -1.8819          .    0.5577
nodegree          0.7081        0.5967          0.2450          .    0.1114
married           0.1892        0.5128         -0.8263          .    0.3236
re74           2095.5737     5619.2365         -0.7211     0.5181    0.2248
re75           1532.0553     2466.4844         -0.2903     0.9563    0.1342
           eCDF Max
age          0.1577
educ         0.1114
raceblack    0.6404
racehispan   0.0827
racewhite    0.5577
nodegree     0.1114
married      0.3236
re74         0.4470
re75         0.2876

Summary of Balance for Matched Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
age              26.2414       25.4655          0.1084     0.5382    0.0703
educ             10.3276       10.2414          0.0429     0.7281    0.0145
raceblack         0.7500        0.7500          0.0000          .    0.0000
racehispan        0.0948        0.0948          0.0000          .    0.0000
racewhite         0.1552        0.1552          0.0000          .    0.0000
nodegree          0.6810        0.6379          0.0948          .    0.0431
married           0.2672        0.2672          0.0000          .    0.0000
re74            861.0568     2839.4506         -0.4049     0.3698    0.1534
re75            761.7040     1722.7378         -0.2985     0.4547    0.1284
           eCDF Max Std. Pair Dist.
age          0.2759          0.7398
educ         0.0517          0.4459
raceblack    0.0000          0.0000
racehispan   0.0000          0.0000
racewhite    0.0000          0.0000
nodegree     0.0431          0.2844
married      0.0000          0.0000
re74         0.4569          0.5753
re75         0.3276          0.4827

Sample Sizes:
          Control Treated
All           429     185
Matched       116     116
Unmatched     313      69
Discarded       0       0
# Visualize covariate balance
plot(m.out2, type = "density", which.xs = ~age + educ, interactive = FALSE)

# Extract matched data
matched_data <- match.data(m.out2)

# Estimate treatment effect
library(lmtest, quietly = TRUE)
library(sandwich, quietly = TRUE)

fit <- lm(re78 ~ treat + age + educ + race + married + nodegree + re74 + re75,
        data = matched_data, # Data already has matches
        weights = weights # Re-weights for PSM
        )

# Robust standard errors (no clustering needed with replacement matching)
coeftest(fit, vcov = vcovHC, type = "HC3")

t test of coefficients:

               Estimate  Std. Error t value Pr(>|t|)   
(Intercept) -3.6223e+03  3.5920e+03 -1.0084 0.314338   
treat        8.5002e+02  8.5536e+02  0.9938 0.321422   
age          8.0097e+01  5.7126e+01  1.4021 0.162277   
educ         5.7361e+02  2.1916e+02  2.6173 0.009473 **
racehispan   2.3218e+03  1.7286e+03  1.3432 0.180582   
racewhite    1.9451e+03  1.1677e+03  1.6658 0.097165 . 
married     -6.8797e+02  1.1151e+03 -0.6169 0.537909   
nodegree     5.6023e+02  1.2967e+03  0.4321 0.666119   
re74        -2.6058e-02  1.7309e-01 -0.1505 0.880474   
re75         4.1668e-01  2.5088e-01  1.6609 0.098144 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
import numpy as np
from scipy.spatial.distance import mahalanobis
from sklearn.covariance import EmpiricalCovariance
import pandas as pd

# Load data
data = pd.read_csv('lalonde.csv')

# Compute covariance matrix
X = data[['age', 'educ', 'race', 'married','nodegree', 're74', 're75']].values
cov = EmpiricalCovariance().fit(X)
S_inv = np.linalg.inv(cov.covariance_)

# Calculate distance between two units
dist = mahalanobis(X[i], X[j], S_inv)

# Using causalinference package
from causalinference import CausalModel

Y = data['re78'].values
D = data['treat'].values
X = X = data[['age', 'educ', 'race', 'married','nodegree', 're74', 're75']].values

causal = CausalModel(Y, D, X)
causal.est_via_matching(matches=1, bias_adj=True)
print(causal.estimates)
* Install mahapick if needed
* ssc install mahapick

* Load data 
import delimited "lalonde.csv", clear

mahapick treat age educ black hispanic married nodegree re74 re75, ///
    idvar(id) nummatches(1) scorevar(mahal_dist) matchedvar(matched)

* Keep matched observations
keep if matched == 1

* Estimate treatment effect
reg re78 treat age educ black hispanic married nodegree re74 re75

Coarsened Exact Matching (CEM)

Concept

CEM works by:

  1. Coarsening continuous variables into discrete bins
  2. Exact matching on coarsened variables
  3. Pruning strata with no treatment or control units
  4. Analysis on original (uncoarsened) data

Algorithm

Input: Treatment \(T\), covariates \(X\), outcome \(Y\)

  1. Define coarsening scheme \(C(X)\) (e.g., age → age groups)
  2. Sort units into strata based on \(C(X)\)
  3. Discard strata containing only treated or only control units
  4. Compute stratum weights to create balance
  5. Estimate treatment effect using weighted data

Monotonic Imbalance Bounding

CEM guarantees that finer coarsening cannot increase imbalance.

Let \(L(C(X))\) measure imbalance under coarsening \(C\). If \(C_1 \preceq C_2\) (finer), then:

\[ L(C_1(X)) \leq L(C_2(X)) \]

This provides ex ante control over balance.

Estimation

library(cem, quietly = TRUE)

# Automatic coarsening
mat <- cem(
    treatment = "treat",
    data = df,
    drop = "re78",
    keep.all = TRUE
)

Using 'treat'='1' as baseline group
# Check imbalance
imbalance(group = df$treat, data = df, drop = "re78")

Multivariate Imbalance Measure: L1=1.000
Percentage of local common support: LCS=0.0%

Univariate Imbalance Measures:

             statistic   type        L1 min 25%       50%       75%     max
treat        1.0000000 (diff) 1.0000000   1   1     1.000     1.000    1.00
age         -2.2140868 (diff) 0.0000000   1   1     0.000    -6.000   -7.00
educ         0.1105147 (diff) 0.2170730   4   0     0.000     0.000   -2.00
race       224.0708295 (Chi2) 0.6404460  NA  NA        NA        NA      NA
married     -0.3236313 (diff) 0.3236313   0   0    -1.000    -1.000    0.00
nodegree     0.1113715 (diff) 0.1113715   0   0     0.000     0.000    0.00
re74     -3523.6628177 (diff) 0.0000000   0   0 -2547.047 -7985.660 9177.75
re75      -934.4291293 (diff) 0.0000000   0   0 -1086.726 -2064.135 6795.01
# Estimate treatment effect with weights
fit <- lm(re78 ~ treat + age + educ + race + nodegree + married + re74 + re75, 
    data = df, 
    weights = mat$w)
summary(fit)

Call:
lm(formula = re78 ~ treat + age + educ + race + nodegree + married + 
    re74 + re75, data = df, weights = mat$w)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-13127      0      0      0  27730 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1803.7914  6692.1386  -0.270   0.7879  
treat         744.2106   972.4550   0.765   0.4454  
age            31.9503    97.7658   0.327   0.7443  
educ          505.1141   439.6080   1.149   0.2526  
racehispan   4572.7844  2215.6024   2.064   0.0409 *
racewhite    2300.2659  1941.7918   1.185   0.2382  
nodegree      281.2088  1887.4331   0.149   0.8818  
married     -4922.4051  2541.8938  -1.937   0.0549 .
re74            0.3736     0.5127   0.729   0.4675  
re75            1.3726     0.6705   2.047   0.0426 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5829 on 136 degrees of freedom
Multiple R-squared:  0.1188,    Adjusted R-squared:  0.06044 
F-statistic: 2.036 on 9 and 136 DF,  p-value: 0.03975
# CEM not natively available in Python
# Use rpy2 to call R's cem package

import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()

ro.r('library(cem, quietly = TRUE)')

# Convert pandas df to R df
r_data = pandas2ri.py2rpy(data)

# Run CEM in R
result = ro.r.cem('treat', data=r_data, drop='re78')
weights = ro.r('result$w')
* Install cem
* ssc install cem

cem age educ black hispanic married nodegree re74 re75, treatment(treat)

* Check balance
imbalance age educ black hispanic married nodegree re74 re75, by(treat)

* Regression with CEM weights
reg re78 treat age educ black hispanic married nodegree re74 re75 [iw = cem_weights]

Genetic Matching

Concept

Genetic matching optimizes the weight matrix \(W\) in a generalized Mahalanobis distance to maximize covariate balance:

\[ D_{\text{gen}}(X_i, X_j) = \sqrt{(X_i - X_j)^\top W (X_i - X_j)} \]

where \(W\) is learned using a genetic algorithm. This follows an iterative, evolutionary pattern to determine the covariate matching set.

Genetic Algorithm

  1. Initialize population of weight matrices
  2. Evaluate balance for each weight matrix
  3. Select best-performing matrices (“fittest”)
  4. Crossover and mutate to create new generation
  5. Repeat until convergence

Balance Criteria

Minimize:

\[ B = \sum_{k=1}^p w_k b_k^2 \]

where \(b_k\) is a balance metric (e.g., standardized mean difference, KS statistic) for covariate \(k\).

Estimation

library(Matching, quietly = TRUE)

# Data
data(lalonde) # same as df above
attach(lalonde)

# The covariates we want to match on
X = cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74)
# The covariates we want to obtain balance on
BalanceMat <- cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74, I(re74*re75))

# Genetic matching to find optimal weights
genout <- GenMatch(
    Tr=treat, 
    X=X, 
    BalanceMatrix=BalanceMat, 
    estimand="ATE", 
    M=1,
    pop.size=16, 
    max.generations=10, 
    wait.generations=1, 
    print.level = 0)
Loading required namespace: rgenoud
# Match using optimized weights
mout <- Match(
    Y = re78,
    Tr = treat,
    X = X,
    estimand = "ATE",
    Weight.matrix = genout
)

# ATE Results
summary(mout)

Estimate...  2079 
AI SE......  810.25 
T-stat.....  2.5659 
p.val......  0.010291 

Original number of observations..............  445 
Original number of treated obs...............  185 
Matched number of observations...............  445 
Matched number of observations  (unweighted).  597 
# Assess balance
mb <- MatchBalance(
    treat ~ age + educ + black + hisp + married + nodegr + u74 + u75 + re75 + re74,
    match.out = mout,
    nboots = 500
)

***** (V1) age *****
                       Before Matching       After Matching
mean treatment........     25.816            25.119 
mean control..........     25.054            24.994 
std mean diff.........     10.655            1.8544 

mean raw eQQ diff.....    0.94054           0.35511 
med  raw eQQ diff.....          1                 0 
max  raw eQQ diff.....          7                 8 

mean eCDF diff........   0.025364         0.0097054 
med  eCDF diff........   0.022193         0.0083752 
max  eCDF diff........   0.065177          0.025126 

var ratio (Tr/Co).....     1.0278           0.96842 
T-test p-value........    0.26594           0.52651 
KS Bootstrap p-value..      0.512             0.906 
KS Naive p-value......     0.7481           0.99172 
KS Statistic..........   0.065177          0.025126 


***** (V2) educ *****
                       Before Matching       After Matching
mean treatment........     10.346              10.2 
mean control..........     10.088            10.225 
std mean diff.........     12.806           -1.4276 

mean raw eQQ diff.....    0.40541           0.10553 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          2                 2 

mean eCDF diff........   0.028698         0.0075377 
med  eCDF diff........   0.012682         0.0033501 
max  eCDF diff........    0.12651          0.025126 

var ratio (Tr/Co).....     1.5513            1.0569 
T-test p-value........    0.15017           0.49442 
KS Bootstrap p-value..      0.014             0.712 
KS Naive p-value......   0.062873           0.99172 
KS Statistic..........    0.12651          0.025126 


***** (V3) black *****
                       Before Matching       After Matching
mean treatment........    0.84324            0.8382 
mean control..........    0.82692           0.84045 
std mean diff.........     4.4767          -0.60952 

mean raw eQQ diff.....   0.016216          0.001675 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 1 

mean eCDF diff........  0.0081601        0.00083752 
med  eCDF diff........  0.0081601        0.00083752 
max  eCDF diff........    0.01632          0.001675 

var ratio (Tr/Co).....    0.92503            1.0114 
T-test p-value........    0.64736           0.65487 


***** (V4) hisp *****
                       Before Matching       After Matching
mean treatment........   0.059459           0.08764 
mean control..........    0.10769           0.08764 
std mean diff.........    -20.341                 0 

mean raw eQQ diff.....   0.048649                 0 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 0 

mean eCDF diff........   0.024116                 0 
med  eCDF diff........   0.024116                 0 
max  eCDF diff........   0.048233                 0 

var ratio (Tr/Co).....    0.58288                 1 
T-test p-value........   0.064043                 1 


***** (V5) married *****
                       Before Matching       After Matching
mean treatment........    0.18919           0.16854 
mean control..........    0.15385           0.16629 
std mean diff.........     8.9995           0.59963 

mean raw eQQ diff.....   0.037838          0.001675 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 1 

mean eCDF diff........   0.017672        0.00083752 
med  eCDF diff........   0.017672        0.00083752 
max  eCDF diff........   0.035343          0.001675 

var ratio (Tr/Co).....     1.1802            1.0108 
T-test p-value........    0.33425           0.31731 


***** (V6) nodegr *****
                       Before Matching       After Matching
mean treatment........    0.70811           0.78202 
mean control..........    0.83462           0.78202 
std mean diff.........    -27.751                 0 

mean raw eQQ diff.....    0.12432                 0 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 0 

mean eCDF diff........   0.063254                 0 
med  eCDF diff........   0.063254                 0 
max  eCDF diff........    0.12651                 0 

var ratio (Tr/Co).....     1.4998                 1 
T-test p-value........  0.0020368                 1 


***** (V7) u74 *****
                       Before Matching       After Matching
mean treatment........    0.70811           0.73258 
mean control..........       0.75           0.73034 
std mean diff.........    -9.1895           0.50714 

mean raw eQQ diff.....   0.037838          0.001675 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 1 

mean eCDF diff........   0.020946        0.00083752 
med  eCDF diff........   0.020946        0.00083752 
max  eCDF diff........   0.041892          0.001675 

var ratio (Tr/Co).....     1.1041           0.99472 
T-test p-value........    0.33033           0.56385 


***** (V8) u75 *****
                       Before Matching       After Matching
mean treatment........        0.6           0.64494 
mean control..........    0.68462           0.64944 
std mean diff.........    -17.225          -0.93815 

mean raw eQQ diff.....   0.081081         0.0033501 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....          1                 1 

mean eCDF diff........   0.042308          0.001675 
med  eCDF diff........   0.042308          0.001675 
max  eCDF diff........   0.084615         0.0033501 

var ratio (Tr/Co).....     1.1133            1.0058 
T-test p-value........   0.068031            0.4143 


***** (V9) re75 *****
                       Before Matching       After Matching
mean treatment........     1532.1            1295.4 
mean control..........     1266.9            1328.7 
std mean diff.........     8.2363           -1.1277 

mean raw eQQ diff.....     367.61            128.59 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....     2110.2            8195.6 

mean eCDF diff........   0.050834         0.0082012 
med  eCDF diff........   0.061954         0.0067002 
max  eCDF diff........    0.10748          0.023451 

var ratio (Tr/Co).....     1.0763            1.0046 
T-test p-value........    0.38527           0.69008 
KS Bootstrap p-value..      0.036             0.792 
KS Naive p-value......    0.16449           0.99663 
KS Statistic..........    0.10748          0.023451 


***** (V10) re74 *****
                       Before Matching       After Matching
mean treatment........     2095.6              2002 
mean control..........       2107            2053.7 
std mean diff.........   -0.23437           -1.0605 

mean raw eQQ diff.....     487.98            197.67 
med  raw eQQ diff.....          0                 0 
max  raw eQQ diff.....       8413            7870.3 

mean eCDF diff........   0.019223         0.0061813 
med  eCDF diff........     0.0158         0.0050251 
max  eCDF diff........   0.047089          0.023451 

var ratio (Tr/Co).....     0.7381            0.8693 
T-test p-value........    0.98186           0.58258 
KS Bootstrap p-value..      0.568             0.668 
KS Naive p-value......    0.97023           0.99663 
KS Statistic..........   0.047089          0.023451 


Before Matching Minimum p.value: 0.0020368 
Variable Name(s): nodegr  Number(s): 6 

After Matching Minimum p.value: 0.31731 
Variable Name(s): married  Number(s): 5 

Entropy Balancing

Concept

Entropy balancing reweights control units to exactly match treatment group moments while minimizing information loss (maximum entropy).

Optimization Problem

Find weights \(\{w_i\}_{i:T_i=0}\) that:

  1. Balance constraints: \[ \sum_{i:T_i=0} w_i X_i = \bar{X}_T \]

  2. Minimize KL divergence: \[ \min \sum_{i:T_i=0} w_i \log(w_i / w_i^{(0)}) \]

where \(w_i^{(0)} = 1/n_0\) are uniform base weights.

Advantages

  • Exact balance on specified moments (means, variances, etc.)
  • Unique solution via convex optimization
  • No iteration required

Estimation

library(ebal, quietly = TRUE)

# Entropy balancing
eb.out <- ebalance(
  Treatment = treat,
  X = cbind(age, educ, black, married, re74, re75)
)
Converged within tolerance 
# Construct weights
weights <- c(eb.out$w, rep(1, sum(treat)))

# Estimate ATE
fit <- lm(re78 ~ treat, weights = weights)
summary(fit)

Call:
lm(formula = re78 ~ treat, weights = weights)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
 -6848  -4518  -1626   2638  38706 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4518.0      371.2  12.172  < 2e-16 ***
treat         1874.1      633.8   2.957  0.00327 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5787 on 443 degrees of freedom
Multiple R-squared:  0.01935,   Adjusted R-squared:  0.01714 
F-statistic: 8.743 on 1 and 443 DF,  p-value: 0.003274

Comparison of Methods

Method Distance Pros Cons
Mahalanobis Multivariate Scale-invariant, preserves structure High-dimensional issues
CEM Exact (discretized) Monotonic balance, interpretable Information loss from coarsening
Genetic Optimized Data-adaptive, superior balance Computationally intensive
Entropy Reweighting Exact balance, efficient Requires all units, potential extreme weights

References