Sensitivity Analysis

Matching estimates are only valid if the ignorability assumption holds—that all confounders are observed. Since this assumption is untestable, sensitivity analyses assess how robust the results are to potential violations.

Assessing Common Support

Definition

Common support (or overlap) requires that for every covariate value, there is a positive probability of receiving both treatment and control:

\[ 0 < P(T = 1 | X = x) < 1 \quad \forall x \]

Without overlap, comparisons require extrapolation beyond the empirical data, making causal inference unreliable.

Diagnosing Lack of Overlap

1. Propensity Score Histograms

library(ggplot2, quietly = TRUE)
# Visualize propensity score distributions
ggplot(data, aes(x = pscore, fill = factor(treat))) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 30) +
  labs(x = "Propensity Score", y = "Count", fill = "Treatment") +
  theme_minimal()

Look for:

Non-overlapping regions at extreme values
Treatment group with no control counterparts (or vice versa)

2. Convex Hull Method

The convex hull identifies the multidimensional space where interpolation (not extrapolation) is possible (King and Zeng 2006).

Discard units outside the convex hull of the opposite group
More restrictive than propensity score trimming

3. Discarding Units Outside Common Support

By propensity score:

# Define range
ps_min <- max(min(data$pscore[data$treat == 1]), 
            min(data$pscore[data$treat == 0]))
ps_max <- min(max(data$pscore[data$treat == 1]), 
            max(data$pscore[data$treat == 0]))

# Keep only units in common support
data_trimmed <- data %>%
  filter(pscore >= ps_min & pscore <= ps_max)

Implications for Estimands

ATT: It may be acceptable to discard control units outside the treated range
ATE: Discarding any units changes the target population and reduces external validity

Balance Diagnostics

After matching, verify that covariates are balanced between treated and control groups.

Standardized Mean Differences

The standardized mean difference (SMD) for covariate \(X\) is:

\[ \text{SMD} = \frac{\bar{X}_T - \bar{X}_C}{\sqrt{\frac{s_T^2 + s_C^2}{2}}} \]

where \(\bar{X}_T, \bar{X}_C\) are means and \(s_T^2, s_C^2\) are variances in treated and control groups.

Benchmark: SMD \(< 0.1\) indicates good balance (Rubin 2001)

Variance Ratios

Compare variances of continuous covariates:

\[ \text{VR} = \frac{s_T^2}{s_C^2} \]

Benchmark: \(0.5 < \text{VR} < 2.0\) (Rubin 2001)

Propensity Score Balance

Standardized difference of propensity score means: Should be \(< 0.25\)
Variance ratio of propensity scores: Should be between 0.5 and 2.0

Graphical Diagnostics

Love Plots

Display standardized mean differences before and after matching:

library(cobalt, quietly = TRUE)
bal.plot(m.out, var.name = "distance", which = "both")

love.plot(m.out, threshold = 0.1, 
        abs = TRUE, 
        var.order = "unadjusted")

QQ Plots

Compare empirical distributions of covariates:

plot(m.out, type = "qq", which.xs = c("age", "educ", "re74"))

Empirical CDFs

plot(m.out, type = "ecdf")

Avoid Using P-values

Do not use t-tests or other hypothesis tests to assess balance:

P-values conflate effect size with sample size
Large samples will show “significant” differences even for trivial imbalances
Balance is a design property, not a hypothesis test

Rosenbaum Sensitivity Analysis

Motivation

Even after matching, hidden bias from unobserved confounders \(U\) may remain. Rosenbaum bounds quantify how strong this bias must be to overturn the findings.

Setup

Let \(\pi_i = P(T_i = 1)\) be the true probability of treatment for unit \(i\). Define the odds of treatment:

\[ \text{Odds}_i = \frac{\pi_i}{1 - \pi_i} \]

For two matched units \(i\) and \(j\), the odds ratio is bounded by \(\Gamma\):

\[ \frac{1}{\Gamma} \leq \frac{\text{Odds}_i}{\text{Odds}_j} \leq \Gamma \]

Interpretation

\(\Gamma = 1\): No hidden bias (treatment is “as good as randomly assigned” conditional on \(X\))
\(\Gamma = 2\): One unit could be twice as likely to receive treatment due to an unobserved confounder
\(\Gamma = 3\): Three times as likely

Procedure

Choose a range of \(\Gamma\) values (e.g., 1.0, 1.1, 1.2, …, 3.0)
For each \(\Gamma\), compute bounds on the p-value or treatment effect estimate
Report the critical \(\Gamma\) at which the effect becomes insignificant

Estimation

library(rbounds, quietly = TRUE)
library(MatchIt, quietly = TRUE)

# Download the lalonde data 
df <- MatchIt::lalonde

# Perform matching
m.out <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, 
    data = df, 
    method = "nearest")
matched_data <- match.data(m.out)

# Split into groups
treat_y <- matched_data$re78[matched_data$treat == 1]
control_y <- matched_data$re78[matched_data$treat == 0]

# P-value sensitivity
psens(treat_y, control_y, Gamma = 3, GammaInc = 0.1)


 Rosenbaum Sensitivity Test for Wilcoxon Signed Rank P-Value 
 
Unconfounded estimate ....  0.1928 

 Gamma Lower bound Upper bound
   1.0      0.1928      0.1928
   1.1      0.0796      0.3709
   1.2      0.0285      0.5640
   1.3      0.0091      0.7300
   1.4      0.0026      0.8489
   1.5      0.0007      0.9226
   1.6      0.0002      0.9633
   1.7      0.0000      0.9837
   1.8      0.0000      0.9932
   1.9      0.0000      0.9973
   2.0      0.0000      0.9990
   2.1      0.0000      0.9996
   2.2      0.0000      0.9999
   2.3      0.0000      1.0000
   2.4      0.0000      1.0000
   2.5      0.0000      1.0000
   2.6      0.0000      1.0000
   2.7      0.0000      1.0000
   2.8      0.0000      1.0000
   2.9      0.0000      1.0000
   3.0      0.0000      1.0000

 Note: Gamma is Odds of Differential Assignment To
 Treatment Due to Unobserved Factors

# Hodges-Lehmann point estimate sensitivity
hlsens(treat_y, control_y, Gamma = 3, GammaInc = 0.1)


 Rosenbaum Sensitivity Test for Hodges-Lehmann Point Estimate 
 
Unconfounded estimate ....  686.1615 

 Gamma Lower bound Upper bound
   1.0     686.160      686.16
   1.1     165.060      932.06
   1.2     -50.038     1308.00
   1.3    -407.340     1581.90
   1.4    -680.640     1853.70
   1.5    -938.240     2039.30
   1.6   -1219.100     2268.00
   1.7   -1452.600     2514.10
   1.8   -1716.700     2756.70
   1.9   -1960.600     3070.00
   2.0   -2185.800     3282.40
   2.1   -2422.000     3508.80
   2.2   -2630.700     3721.00
   2.3   -2820.400     3892.10
   2.4   -2943.000     4044.80
   2.5   -3081.200     4208.40
   2.6   -3223.600     4361.10
   2.7   -3355.900     4531.70
   2.8   -3472.100     4685.60
   2.9   -3607.900     4816.70
   3.0   -3720.700     4950.00

 Note: Gamma is Odds of Differential Assignment To
 Treatment Due to Unobserved Factors

Output

Tables showing p-values and confidence intervals at different \(\Gamma\) levels
Identify \(\Gamma\) where effect loses significance

Interpretation

If the critical \(\Gamma\) is:

< 1.5: Results are very sensitive to hidden bias
1.5 - 2.5: Moderately robust
> 2.5: Highly robust—unobserved confounder would need to be very strong

# No direct Rosenbaum bounds in Python
# Use R via rpy2 or implement manually using Wilcoxon signed-rank test
import rpy2.robjects as ro
ro.r('library(rbounds, quietly = TRUE)')
result = ro.r('psens(treat_y, control_y, Gamma=2)')

* After matching, save matched sample
ssc install rbounds

* Sensitivity analysis
rbounds outcome, gamma(1(0.1)3)

Relative Correlation Restrictions

An alternative sensitivity approach bounds the treatment effect under assumptions about selection on unobservables relative to selection on observables (Altonji, Elder, and Taber 2005).

Framework

Assume:

\[ \text{Cov}(T, \epsilon) = \lambda \cdot \text{Cov}(T, X^\top \gamma) \]

where:

\(\epsilon\) contains unobserved confounders
\(\lambda\) is the relative selection ratio

Interpretation

\(\lambda = 0\): No selection on unobservables (standard OLS)
\(\lambda = 1\): Selection on unobservables is equal to selection on observables
\(\lambda > 1\): Unobservables are more important than observables

Critical value \(\lambda^*\): The value at which the treatment effect reaches zero

Implementation

library(rcrbounds, quietly = TRUE)
# Estimate bounds
rcr_result <- rcr(
  formula = outcome ~ treat | age + educ + married,
  data = mydata,
  rc_range = c(0, 10)
)

summary(rcr_result)
plot(rcr_result)

Coefficient Stability (Oster’s delta)

Oster (2019) proposes testing robustness by examining how coefficients change as controls are added.

Key Metrics

Bias-adjusted coefficient beta_star(delta, Rmax^2)
Breakdown point delta_star: Value of delta at which beta_star = 0

Interpretation:

delta_star > 1: Unobservables would need to be more selective than observables to eliminate the effect
delta_star < 1: Results fragile to hidden confounding

Implementation

library(robomit, quietly = TRUE)
# Compute delta*
delta_result <- o_delta(
  y = "outcome",
  x = "treat",
  con = "age + educ + married + re74 + re75",
  beta = 0,
  R2max = 1.3 * summary(lm_model)$r.squared,
  type = "lm",
  data = mydata
)

# Visualize
o_delta_rsq_viz(
  y = "outcome", x = "treat",
  con = "age + educ + married + re74 + re75",
  beta = 0,
  data = mydata
)

Falsification Tests

Placebo Outcomes

Test whether the treatment “affects” outcomes it should not affect (e.g., pre-treatment variables, irrelevant outcomes).

# Match as usual
m.out <- matchit(treat ~ covariates, data = mydata)

# Test on placebo outcome
placebo_model <- lm(pre_treatment_outcome ~ treat, 
                    data = match.data(m.out),
                    weights = weights)

# Should find no effect
summary(placebo_model)

Permutation Tests

Randomly reassign treatment status and re-estimate the effect. The true effect should be an outlier in the permutation distribution.

Untestable Limitations

Some assumptions cannot be tested empirically:

Unconfoundedness: We can never prove all confounders are observed
SUTVA: Interference effects are hard to detect directly
Functional form: Matching is somewhat robust, but outcome models still require assumptions

Strategies:

Domain knowledge: Use theory to justify covariate selection
Sensitivity analyses: Quantify how robust conclusions are
Robustness checks: Try multiple matching methods and specifications
Transparency: Report all decisions and diagnostics

References

Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber. 2005. “Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools.” Journal of Political Economy 113 (1): 151–84. https://doi.org/10.1086/426036.

King, Gary, and Langche Zeng. 2006. “The Dangers of Extreme Counterfactuals.” Political Analysis 14 (2): 131–59. https://doi.org/10.1093/pan/mpj004.

Oster, Emily. 2019. “Unobservable Selection and Coefficient Stability: Theory and Evidence.” Journal of Business and Economic Statistics 37 (2): 187–204. https://doi.org/10.1080/07350015.2016.1227711.

Rubin, Donald B. 2001. “Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation.” Health Services and Outcomes Research Methodology 2 (3–4): 169–88. https://doi.org/10.1023/A:1020363010465.