Matching Methods

Inverse Probability Weighting (IPW)

Introduction

Inverse probability weighting (IPW), also known as propensity score weighting, is an alternative to matching that uses the propensity score to create a pseudo-population where treatment is independent of measured confounders (Rosenbaum 1987; Hirano, Imbens, and Ridder 2003).

Rather than discarding unmatched units, IPW reweights the sample so that treated and control groups have similar covariate distributions. This approach:

  • Retains all observations (no discarding)
  • Directly estimates population-level treatment effects
  • Provides a link between causal inference and survey sampling methods

The Core Idea

Each observation is weighted by the inverse of the probability of receiving the treatment they actually received:

\[ w_i = \frac{1}{P(\text{Treatment}_i | X_i)} \]

This upweights underrepresented observations and downweights overrepresented ones, creating balance in expectation.

Propensity Score Review

As with propensity score matching, we first estimate the propensity score:

\[ e(X_i) = P(T_i = 1 | X_i) \]

typically using logistic regression:

\[ \log \left( \frac{e(X)}{1 - e(X)} \right) = X^\top \beta \]

The estimated propensity score \(\hat{e}(X_i)\) is then used to construct weights.

IPW Estimators

Average Treatment Effect (ATE)

To estimate the population average treatment effect, use ATE weights:

\[ w_i^{ATE} = \frac{T_i}{e(X_i)} + \frac{1 - T_i}{1 - e(X_i)} \]

  • Treated observations get weight \(1/e(X_i)\)
  • Control observations get weight \(1/(1-e(X_i))\)

The weighted mean difference estimates the ATE:

\[ \hat{\tau}^{ATE} = \frac{1}{n}\sum_{i=1}^n w_i^{ATE} \cdot Y_i \cdot (2T_i - 1) \]

Alternatively: \[ \hat{\tau}^{ATE} = \frac{\sum_{i=1}^n \frac{T_i \cdot Y_i}{e(X_i)}}{\sum_{i=1}^n \frac{T_i}{e(X_i)}} - \frac{\sum_{i=1}^n \frac{(1-T_i) \cdot Y_i}{1-e(X_i)}}{\sum_{i=1}^n \frac{1-T_i}{1-e(X_i)}} \]

Target population: The entire population (both treated and untreated)

Average Treatment Effect on the Treated (ATT)

To estimate the effect for the treated population, use ATT weights:

\[ w_i^{ATT} = T_i + (1 - T_i) \cdot \frac{e(X_i)}{1 - e(X_i)} \]

  • Treated observations get weight 1
  • Control observations get weight \(e(X_i)/(1-e(X_i))\)

This reweights controls to look like the treated group.

\[ \hat{\tau}^{ATT} = \frac{1}{n_1}\sum_{i=1}^n w_i^{ATT} \cdot Y_i \cdot (T_i - (1-T_i)) \]

Target population: The treated group

Overlap Weights

To improve stability by focusing on the region of common support, use overlap weights (Li, Morgan, and Zaslavsky 2018):

\[ w_i^{OW} = (1 - T_i) \cdot e(X_i) + T_i \cdot (1 - e(X_i)) \]

These weights emphasize the population where treatment assignment is most uncertain (near \(e(X) = 0.5\)) and downweight extreme propensity scores.

Stabilized Weights

Extreme propensity scores can lead to highly variable weights. Stabilized weights reduce variance by normalizing:

For ATE: \[ w_i^{ATE, stab} = \frac{P(T_i)}{P(T_i | X_i)} \]

where \(P(T_i = 1) = \bar{T} = n_1/n\) is the marginal treatment probability.

More explicitly: \[ w_i^{ATE, stab} = T_i \cdot \frac{\bar{T}}{e(X_i)} + (1 - T_i) \cdot \frac{1 - \bar{T}}{1 - e(X_i)} \]

Stabilized weights have the same expected value but lower variance (Robins, Hernan, and Brumback 2000).

Workflow

Step 1: Estimate Propensity Scores

Fit a logistic regression (or more flexible model) to predict treatment:

# R example
ps_model <- glm(treatment ~ age + education + income + gender, 
                data = df, 
                family = binomial(link = "logit"))

df$propensity <- predict(ps_model, type = "response")

Step 2: Calculate Weights

Choose the appropriate weight for your estimand:

# ATE weights
df$weight_ate <- with(df, treatment / propensity + 
                            (1 - treatment) / (1 - propensity))

# ATT weights
df$weight_att <- with(df, treatment + 
                          (1 - treatment) * propensity / (1 - propensity))

# Stabilized ATE weights
p_treat <- mean(df$treatment)
df$weight_ate_stab <- with(df, 
  treatment * p_treat / propensity + 
  (1 - treatment) * (1 - p_treat) / (1 - propensity))

Step 3: Check Balance

Verify that weighting improves covariate balance by computing weighted standardized mean differences:

# Weighted SMD
library(cobalt, quietly = TRUE)
bal.tab(treatment ~ age + education + income + gender,
        data = df,
        weights = "weight_ate",
        method = "weighting")

Step 4: Estimate Treatment Effect

Fit a weighted regression:

# Weighted outcome model
effect_model <- lm(outcome ~ treatment, 
                data = df, 
                weights = weight_ate)
summary(effect_model)

Estimation

library(tidyverse, quietly = TRUE)
library(cobalt, quietly = TRUE)
# Download the lalonde data 
data <- MatchIt::lalonde

# Step 1: Estimate propensity scores
ps_model <- glm(treat ~ age + educ + race + married + nodegree + re74 + re75,
        data = data,
        family = binomial(link = "logit"))

data <- data %>%
    mutate(propensity = predict(ps_model, type = "response"))

# Step 2: Calculate IPW weights
data <- data %>%
    mutate(
    # ATE weights
    weight_ate = treat / propensity + 
                (1 - treat) / (1 - propensity),
    
    # ATT weights  
    weight_att = treat + 
                 (1 - treat) * propensity / (1 - propensity),
    
    # Stabilized ATE weights
    weight_ate_stab = (treat * mean(treat) / propensity) + 
                      ((1 - treat) * (1 - mean(treat)) / (1 - propensity))
    )

# Step 3: Check balance
bal.tab(treat ~ age + educ + race + married + nodegree + re74 + re75,
        data = data,
        weights = "weight_ate",
        stats = c("m", "v"),
        thresholds = c(m = 0.1))
Note: `s.d.denom` not specified; assuming "pooled".
Balance Measures
               Type Diff.Adj        M.Threshold V.Ratio.Adj
age         Contin.  -0.1676 Not Balanced, >0.1      0.3689
educ        Contin.   0.1296 Not Balanced, >0.1      0.5657
race_black   Binary   0.0499     Balanced, <0.1           .
race_hispan  Binary   0.0047     Balanced, <0.1           .
race_white   Binary  -0.0546     Balanced, <0.1           .
married      Binary  -0.0944     Balanced, <0.1           .
nodegree     Binary  -0.0547     Balanced, <0.1           .
re74        Contin.  -0.2740 Not Balanced, >0.1      0.8208
re75        Contin.  -0.1579 Not Balanced, >0.1      0.9562

Balance tally for mean differences
                   count
Balanced, <0.1         5
Not Balanced, >0.1     4

Variable with the greatest mean difference
 Variable Diff.Adj        M.Threshold
     re74   -0.274 Not Balanced, >0.1

Effective sample sizes
           Control Treated
Unadjusted  429.    185.  
Adjusted    329.01   58.33
# Visualize weights
ggplot(data, aes(x = weight_ate, group = as.factor(treat), fill = as.factor(treat))) +
    geom_histogram(bins = 50, alpha = 0.6, position = "identity") +
    labs(title = "Distribution of ATE Weights",
        x = "Weight", y = "Count",
        fill = "Treatment") +
    scale_fill_manual(values = c("0" = "steelblue", "1" = "tomato"))

# Step 4: Estimate treatment effect
ate_model <- lm(re78 ~ treat, 
                data = data, 
                weights = weight_ate)

# Robust standard errors
library(sandwich, quietly = TRUE)
library(lmtest, quietly = TRUE)
coeftest(ate_model, vcov = vcovHC(ate_model, type = "HC3"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6422.84     366.30 17.5344   <2e-16 ***
treat         224.68     939.66  0.2391   0.8111    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Alternative: Use WeightIt package
library(WeightIt, quietly = TRUE)
W <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
            data = data,
            method = "ps",
            estimand = "ATE")

summary(W)
                  Summary of weights

- Weight ranges:

          Min                                  Max
treated 1.172 |---------------------------| 40.077
control 1.009 |-|                            4.743

- Units with the 5 most extreme weights by group:
                                           
            137    124    116     68     10
 treated 13.545 15.988 23.297 23.389 40.077
            412    388    226    196    118
 control   4.03  4.059   4.24  4.523  4.743

- Weight statistics:

        Coef of Var   MAD Entropy # Zeros
treated       1.478 0.807   0.534       0
control       0.552 0.391   0.118       0

- Effective Sample Sizes:

           Control Treated
Unweighted  429.    185.  
Weighted    329.01   58.33
bal.tab(W)
Balance Measures
                Type Diff.Adj
prop.score  Distance   0.1360
age          Contin.  -0.1676
educ         Contin.   0.1296
race_black    Binary   0.0499
race_hispan   Binary   0.0047
race_white    Binary  -0.0546
married       Binary  -0.0944
nodegree      Binary  -0.0547
re74         Contin.  -0.2740
re75         Contin.  -0.1579

Effective sample sizes
           Control Treated
Unadjusted  429.    185.  
Adjusted    329.01   58.33
# Estimate effect
library(marginaleffects, quietly = TRUE)
fit <- glm(re78 ~ treat, 
            data = data, 
            weights = W$weights,
            family = gaussian())

avg_comparisons(fit, variables = "treat")

 Estimate Std. Error     z Pr(>|z|)   S 2.5 % 97.5 %
      225        578 0.389    0.697 0.5  -908   1357

Term: treat
Type: response
Comparison: 1 - 0
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Load data
data = pd.read_csv('lalonde.csv')

# Step 1: Estimate propensity scores
X = data[['age', 'educ', 'race', 'married', 'nodegree', 're74', 're75']]
y = data['treat']

ps_model = LogisticRegression(random_state=42)
ps_model.fit(X, y)

data['propensity'] = ps_model.predict_proba(X)[:, 1]

# Step 2: Calculate weights
p_treat = data['treat'].mean()

data['weight_ate'] = (data['treat'] / data['propensity'] + 
                      (1 - data['treat']) / (1 - data['propensity']))

data['weight_att'] = (data['treat'] + 
                      (1 - data['treat']) * data['propensity'] / 
                      (1 - data['propensity']))

data['weight_ate_stab'] = (data['treat'] * p_treat / data['propensity'] + 
                           (1 - data['treat']) * (1 - p_treat) / 
                           (1 - data['propensity']))

# Step 3: Check balance
def weighted_smd(data, var, treatment_col, weight_col):
    """Calculate weighted standardized mean difference"""
    treated = data[data[treatment_col] == 1]
    control = data[data[treatment_col] == 0]
    
    # Weighted means
    mean_t = np.average(treated[var], weights=treated[weight_col])
    mean_c = np.average(control[var], weights=control[weight_col])
    
    # Pooled standard deviation
    var_t = np.average((treated[var] - mean_t)**2, weights=treated[weight_col])
    var_c = np.average((control[var] - mean_c)**2, weights=control[weight_col])
    pooled_sd = np.sqrt((var_t + var_c) / 2)
    
    return (mean_t - mean_c) / pooled_sd

# Calculate SMDs for all covariates
covariates = ['age', 'educ', 'race', 'married', 'nodegree', 're74', 're75']
for var in covariates:
    smd = weighted_smd(data, var, 'treat', 'weight_ate')
    print(f"Weighted SMD for {var}: {smd:.3f}")

# Step 4: Estimate treatment effect with weighted regression
outcome_model = smf.wls('re78 ~ treat', 
                        data=data, 
                        weights=data['weight_ate']).fit()

print(outcome_model.summary())

# Get robust standard errors
print(outcome_model.get_robustcov_results(cov_type='HC3').summary())
* Load data 
import delimited "lalonde.csv", clear

* Step 1: Estimate propensity scores
logit treat age i.race married i.nodegree re74 re75
predict propensity, pr

* Step 2: Calculate IPW weights
* ATE weights
gen weight_ate = treat/propensity + (1-treat)/(1-propensity)

* ATT weights
gen weight_att = treat + (1-treat)*propensity/(1-propensity)

* Stabilized ATE weights
sum treat
local p_treat = r(mean)
gen weight_ate_stab = treat*`p_treat'/propensity + ///
                    (1-treat)*(1-`p_treat')/(1-propensity)

* Step 3: Check balance
* teffects package provides balance diagnostics
teffects ipw (re78) (treat age i.race married i.nodegree re74 re75), ///
    atet

tebalance summarize

* Or manually check weighted balance
foreach var of varlist age educ married re74 re75 {
    reg `var' treat [aweight=weight_ate]
}

* Step 4: Estimate treatment effect
* Weighted regression
reg re78 treat [aweight=weight_ate], robust

* Alternative: Use teffects directly
teffects ipw (re78) (treat age i.race married i.nodegree re74 re75), ///
    ate

* For ATT
teffects ipw (re78) (treat age i.race married i.nodegree re74 re75), ///
    atet

Advantages and Limitations

Advantages

  • Retains full sample - No observations discarded (unlike matching)
  • Efficient - Uses all available data
  • Flexible estimands - Easy to target ATE, ATT, or other parameters
  • Transparent - Clear connection to survey weighting methods
  • Doubly robust option - Can combine with outcome regression (see AIPW)

Limitations

  • Sensitive to extreme weights - Poor overlap can create unstable estimates
  • Model dependence - Relies heavily on correct propensity score specification
  • Extrapolation - May estimate effects in regions with little empirical support
  • Variance estimation - Requires careful handling of uncertainty in weights
  • Positivity violations - Fails when \(e(X) \approx 0\) or \(e(X) \approx 1\)

Diagnostics and Robustness

Weight Distribution

Check for extreme weights:

# Summary statistics
summary(df$weight_ate)

# Flag extreme weights
quantile(df$weight_ate, c(0.95, 0.99, 1.0))

# Visualize
ggplot(df, aes(x = weight_ate)) +
    geom_histogram() +
    geom_vline(xintercept = quantile(df$weight_ate, 0.99), 
            color = "red", linetype = "dashed")

Trimming Weights

If weights are extreme, consider:

  1. Truncation - Cap weights at a threshold
  2. Trimming propensity scores - Exclude observations with \(e(X) < 0.05\) or \(e(X) > 0.95\)
  3. Overlap weights - Use alternative weighting scheme
# Trim propensity scores
df_trimmed <- df %>%
    filter(propensity > 0.05 & propensity < 0.95)

# Truncate weights
max_weight <- quantile(df$weight_ate, 0.99)
df$weight_ate_trimmed <- pmin(df$weight_ate, max_weight)

Propensity Score Overlap

Visualize common support:

ggplot(df, aes(x = propensity, fill = factor(treatment))) +
    geom_density(alpha = 0.5) +
    labs(title = "Propensity Score Overlap",
        x = "Propensity Score",
        fill = "Treatment") +
    theme_minimal()

Comparison to Matching

Feature IPW Matching
Sample size Retains all data May discard observations
Efficiency Generally more efficient Can lose precision
Estimand Flexible (ATE/ATT/ATU) Usually ATT
Extreme scores Problematic Excluded via caliper
Implementation Straightforward weighting Complex algorithms
Transparency Less intuitive More intuitive pairing

When to use IPW:

  • You want to estimate population ATE
  • Sample size is limited
  • Good propensity score overlap

When to use matching:

  • Poor overlap in propensity scores
  • You want ATT specifically
  • Pairing interpretation is important

Key Articles

Key papers on IPW:

  • Rosenbaum (1987) - Early development of propensity score methods
  • Hirano, Imbens, and Ridder (2003) - Efficient estimation with IPW
  • Robins, Hernan, and Brumback (2000) - Marginal structural models and stabilized weights
  • Li, Morgan, and Zaslavsky (2018) - Overlap weights for improved balance
  • Austin and Stuart (2015) - Comparison of propensity score methods

References

Austin, Peter C., and Elizabeth A. Stuart. 2015. “Moving Towards Best Practice When Using Inverse Probability of Treatment Weighting (IPTW) Using the Propensity Score to Estimate Causal Treatment Effects in Observational Studies.” Statistics in Medicine 34 (28): 3661–79. https://doi.org/10.1002/sim.6607.
Hirano, Keisuke, Guido W. Imbens, and Geert Ridder. 2003. “Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score.” Econometrica 71 (4): 1161–89. https://doi.org/10.1111/1468-0262.00442.
Li, Fan, Kari Lock Morgan, and Alan M. Zaslavsky. 2018. “Balancing Covariates via Propensity Score Weighting.” Journal of the American Statistical Association 113 (521): 390–400. https://doi.org/10.1080/01621459.2016.1260466.
Robins, James M., Miguel Angel Hernan, and Babette Brumback. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11 (5): 550–60. https://doi.org/10.1097/00001648-200009000-00011.
Rosenbaum, Paul R. 1987. “Model-Based Direct Adjustment.” Journal of the American Statistical Association 82 (398): 387–94. https://doi.org/10.1080/01621459.1987.10478441.