Any time we are handling observational data (as opposed to that from a randomized study) and trying to measure the effect of some treatment, bias can potentially confound our estimates—undermining the analysis. To reduce this bias, matching attempts to find groups that are similar across measured variables, despite receiving alternate sides of treatment. Matching requires special care when treatment(s) (and related matching variables) happen at multiple multiple times, as in some longitudinal data.
The R package rsmatch implements various forms of matching on time-varying observational studies to help in causal inference.
This short vignette aims to provide enough information on time-varying matching and the rsmatch package for someone to start data analysis. The package implements two different approaches, and we recommend a thorough read-through of the corresponding paper to understand the methodology, potential pitfalls, and other crucial considerations.
brsmatch()
, based on the work of Li, Propert, and Rosenbaum
(2001)coxph_match()
, based on the work
of Lu (2005)We start by showing a data set that can illustrate the structure of a time-varying observational data set.
Consider the oasis
data with 51 patients from the Open Access Series of Imaging
Studies that are classified at each time point as having Alzheimer’s
disease (AD) or not.
data(oasis)
head(oasis, n = 10)
#> subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv asf
#> 1 OAS2_0002 1 NA M 12 -1 75 0 1678 0.736 1.046
#> 2 OAS2_0002 2 NA M 12 -1 76 560 1738 0.713 1.010
#> 3 OAS2_0002 3 NA M 12 -1 80 1895 1698 0.701 1.034
#> 4 OAS2_0007 1 3 M 16 -1 71 0 1357 0.748 1.293
#> 5 OAS2_0007 3 3 M 16 -1 73 518 1365 0.727 1.286
#> 6 OAS2_0007 4 3 M 16 -1 75 1281 1372 0.710 1.279
#> 7 OAS2_0009 1 NA M 12 2 68 0 1457 0.806 1.205
#> 8 OAS2_0009 2 NA M 12 2 69 576 1480 0.791 1.186
#> 9 OAS2_0010 1 NA F 12 3 66 0 1447 0.769 1.213
#> 10 OAS2_0010 2 NA F 12 3 68 854 1482 0.752 1.184
The data set has been modified to a “long” time-varying format with the following variables:
subject_id
- A unique patient identifiervisit
- An identifier of the time point, from 1 up to
4time_of_ad
- The visit in which a patient was
identified to have AD, or NA if this patient has not yet been identified
with ADm_f
, educ
,ses
,
age
- Baseline patient characteristics for gender,
education level, socioeconomic status, agemr_delay
, e_tiv
, n_wbv
,
asf
- Time-varying patient characteristics for MR delay
time, estimated total intracranial volume, and Atlas scaling
factor.Importantly, visits have a similar time difference between them. This sets up a panel data structure. The package does not help with data sets where subject visits happen irregularly or continuously.
This data set illustrates a few of the standard causal inference elements for time-varying matching, including
Unfortunately, the data is missing a key aspect for causal analysis, the outcome on which to measure our treatment effect. Open-source longitudinal data sets are somewhat scarce, and this was the most illustrative example we could find while including in the package. However, this data fully demonstrates the matching process. Applying a set of matches to determine the treatment effect is, thankfully, a typically straightforward process.
Our goal is to match subjects similar in their gender, education level, socioeconomic status, age, and other MRI measurements, but one of which moved from a cognitively normal state to AD in the next period and the other who remained cognitively normal. On these similar groups, we can hopefully isolate the effect of an AD diagnosis on an outcome of interest.
Based on the work of Li et al. (2001), the brsmatch()
function performs “Balanced Risk Set Matching”. Given our data
structure, it finds matches based on minimizing the Mahalanobis distance
between covariates. If balancing is desired, the model will try to
minimize the imbalance in terms of specified balancing covariates in the
final pair output. Each treated individual is matched to one other
individual. Full details are provided in the available reference.
For example, we can find 5 matched pairs based on all covariates with balance across gender and age:
pairs <- brsmatch(
n_pairs = 5,
data = oasis,
id = "subject_id",
time = "visit",
trt_time = "time_of_ad",
covariates = c("m_f", "educ", "ses", "age",
"mr_delay", "e_tiv", "n_wbv", "asf"),
balance = TRUE,
balance_covariates = c("m_f", "age")
)
na.omit(pairs)
#> subject_id pair_id type
#> 5 OAS2_0014 1 trt
#> 15 OAS2_0043 2 all
#> 22 OAS2_0079 2 trt
#> 32 OAS2_0112 3 all
#> 36 OAS2_0124 1 all
#> 40 OAS2_0140 5 all
#> 41 OAS2_0150 3 trt
#> 43 OAS2_0160 4 trt
#> 49 OAS2_0182 4 all
#> 50 OAS2_0184 5 trt
The output includes a long-format data frame with
subject_id
, pair_id
, and type
to
indicate whether the subject was considered in the treatment
(trt
) group, or the control (all
) group for
the indicated time point. The type
variable is helpful
because subjects who got AD could match in either the treatment or
control group depending on which visit they were matched in.
brsmatch()
considers all of the possibilities in finding an
optimal match. The output will include all subjects, with an
NA
value for pair_id
and type
for
unmatched subjects.
Taking, for example, the first pair, OAS2_0014 to OAS2_0124, we know that the matching happened at visit 2, and in that visit, the two subjects have very similar covariates.
first_pair <- pairs[which(pairs$pair_id == 1), "subject_id"]
oasis[which(oasis$subject_id %in% first_pair), ]
#> subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv asf
#> 11 OAS2_0014 1 2 M 16 3 76 0 1602 0.697 1.096
#> 12 OAS2_0014 2 2 M 16 3 77 504 1590 0.696 1.104
#> 80 OAS2_0124 1 NA M 16 3 70 0 1463 0.749 1.200
#> 81 OAS2_0124 2 NA M 16 3 71 472 1479 0.750 1.187
Full details of the brsmatch()
function’s arguments can
be found with ?brsmatch
. One option not used in this
example allows for exact matching constraints. These can be very useful
to improve the optimization speed for large data sets.
Based on the work of Lu (2005) “Propensity Score Matching with
Time-Dependent Covariates”, the coxpsmatch()
function
matches subjects based on time-dependent propensity scores from a Cox
proportional hazards model. If balancing is desired, the model will try
to minimize the imbalance in terms of specified balancing covariates in
the final pair output. Each treated individual is matched to one other
individual.
The coxpsmatch()
interface is similar to
brsmatch()
in both inputs and outputs as the following
example shows. However, coxpsmatch()
can pair all
individuals if desired and does not allow (or need) the option of
separate balance variables.
pairs <- coxpsmatch(
n_pairs = 5,
data = oasis,
id = "subject_id",
time = "visit",
trt_time = "time_of_ad",
covariates = c("m_f", "educ", "ses", "age",
"mr_delay", "e_tiv", "n_wbv", "asf")
)
na.omit(pairs)
#> subject_id pair_id type
#> 3 OAS2_0009 1 all
#> 5 OAS2_0014 2 trt
#> 16 OAS2_0046 1 trt
#> 29 OAS2_0104 5 trt
#> 32 OAS2_0112 5 all
#> 34 OAS2_0114 4 trt
#> 36 OAS2_0124 2 all
#> 39 OAS2_0139 4 all
#> 41 OAS2_0150 3 trt
#> 49 OAS2_0182 3 all
In this example, OAS2_0014 still matches to OAS2_0124, but other matches differ.
second_pair <- pairs[which(pairs$pair_id == 2), "subject_id"]
oasis[which(oasis$subject_id %in% second_pair), ]
#> # A tibble: 4 × 11
#> subject_id visit time_of_ad m_f educ ses age mr_delay e_tiv n_wbv asf
#> <chr> <int> <dbl> <chr> <int> <fct> <int> <int> <int> <dbl> <dbl>
#> 1 OAS2_0014 1 2 M 16 3 76 0 1602 0.697 1.10
#> 2 OAS2_0014 2 2 M 16 3 77 504 1590 0.696 1.10
#> 3 OAS2_0124 1 NA M 16 3 70 0 1463 0.749 1.2
#> 4 OAS2_0124 2 NA M 16 3 71 472 1479 0.75 1.19
This quick vignette provides a primer to get you started with time-varying matching methods. Some necessary concepts not covered include:
time_lag
option in both functions)Li, Yunfei Paul, Kathleen J Propert, and Paul R Rosenbaum. 2001. “Balanced Risk Set Matching.” Journal of the American Statistical Association 96 (455): 870–82.
Lu, Bo. 2005. “Propensity Score Matching with Time-Dependent Covariates.” Biometrics 61 (3): 721–28.