Introduction: Program Evaluation in Education

In education, support programs are offered with a particular goal in mind, and in most cases, that goal is to increase the students’ likelihood of success. Whether that success is short-term (completion of a course) or long-term (achievement of a degree) depends on the intent of the program. A support program can be broad and open to all students, or targeted and open to a subset of students.

From an accountability perspective, program evaluation is critical for institutions to understand if their programs are effective, and if so, by how much. Quantifying participation, utilization, student perception, and outcomes of participants are important elements in understanding the value of the program. However, these metrics cannot answer an important question of interest: did the program truly increase student success? That is, in the absence of the program, would students fare in the same way (i.e., program had no effect) or would they have been worse off (i.e., program did have positive impact)? Along the same vein, one might also ask whether the measured impact would apply to all students if the program were expanded.

These causal effect questions could be answered by a randomized-controlled experiment where the program administrator or investigator could randomly assign students to treatment (program) and control (no program) groups, and from there, assess the program’s effectiveness by comparing the two groups. However, in most institutional settings, conducting an experiment is infeasible and impractical as institutions would prefer to implement their programs sooner to support students, collecting evidence along the way to demonstrate accountability to stakeholders. Moreover, withholding support programs from a control group may not be justifiable if students are eligible for the program, which makes it even more challenging to make the case for an experiment. In an observational setting (non-experimental) as described, estimating the causal impact is much more difficult due to confounding induced by student self-selection into the program. That is, students who choose to participate may be quite different than students that do not participate. When this happens, it is unclear whether a perceived positive impact from program participation could be attributed to the program or to other factors that differ between the two groups (confounding).

A popular and fine method for accounting for potential confounders is to adjust for them in a regression model. This allows one to interpret the contrast of interest (e.g., mean difference, odds ratio, relative rates, etc.) as conditional among populations that have equal measurements in the variables being adjusted for.

Another popular quasi-experimental approach relies on estimating the students’ likelihood of participation in the program, the so-called propensity score. Assuming one can estimate this well using the available data, the estimated propensity scores aid in an apples-to-apples comparison by one of two subsequent steps:

Match participants and non-participants based on the estimated propensity scores (similar values are matched), and from this matched sample where some observations are excluded, draw inference between the two groups. If there are no unmeasured confounders (an untestable assumption), then conditioning on the propensity scores in this way allows one to estimate the causal effect of the program.
Use the estimated propensity scores to weight each observation in the participant group and non-participant group. The weights could be derived to allow one to estimate the average treatment effect (ATE) on the entire population (i.e., the effect if the program were to generalize to the entire student population) or the average treatment effect on the treated (ATT; the effect if the program was specific to those that self-selected into the program). The idea is that some demographic attributes are under/over-represented in each of the two groups, and re-weighting each student allows one to construct a sample that is representative of the target population.

This vignette illustrates the latter approach using the twang package in R. The twang package allows one to estimate the propensity scores using gradient boosted trees, a data-driven, nonparametric tree-based method that estimates the propensity score by combining many decision trees built sequentially, each modeling the residuals from the previous tree. Estimating the propensity scores this way allows one to rely on a robust algorithm for arriving at the estimated propensity scores rather than investing additional effort to manually model the propensity scores using a parametric model like logistic regression. Once the propensity score weights are derived, one could leverage the survey package in R to incorporate the weights in drawing inference via functions like svyglm (Generalized Linear Models for complex survey samples).

Although this vignette leverages R, the folks at RAND have also released code and tutorials for STATA, SAS, and other platforms. The interested reader should go there for additional tutorials.

Toy Data Set: Writing Center at a Community College

To assist with demonstrating how one might leverage propensity score weights in program evaluation, we employ the writing_center data set from the IRexamples package. First, let’s load the data.

library(IRexamples) ## load package with data
data(writing_center) ## load data
## For a description of the data, type ?writing_center in the console
head(writing_center) ## print top few observations

##   Student_ID        Term Section_ID Instructor_ID Main_CourseID Gender Age
## 1          1 Year 1 Fall        112          9569         ENG 1   Male  21
## 2          2 Year 1 Fall         89          1515         ENG 1 Female  24
## 3          3 Year 1 Fall         99          2357         ENG 1 Female  24
## 4          9 Year 1 Fall        102          1672         ENG 1   Male  24
## 5         11 Year 1 Fall        102          1672         ENG 1   Male  26
## 6         14 Year 1 Fall        102          1672         ENG 1 Female  21
##             Ethnicity FirstGen Military FosterYouth DSPS FinAid
## 1     Mixed Ethnicity        0        0           0    0      1
## 2     Mixed Ethnicity        0        0           0    0      0
## 3   Hispanic / Latino        0        0           0    0      1
## 4 White, Non-Hispanic        1        0           0    0      0
## 5               Asian        1        0           0    0      0
## 6 White, Non-Hispanic        0        0           0    0      0
##   Units_Attempted_Beg_Of_Term GPA_Beg_Of_Term TermUnitsAttempted K12_Student
## 1                        42.0        2.761905                 10           0
## 2                        93.5        2.374332                 10           0
## 3                        42.0        3.309524                 10           0
## 4                        49.0        2.020408                 10           0
## 5                        25.0        1.780000                  7           0
## 6                        29.0        3.137931                 13           0
##   First_Time_College_Student Nonresident_Tuition_Exempt International
## 1                          0                          0             0
## 2                          0                          0             0
## 3                          0                          1             0
## 4                          0                          0             0
## 5                          0                          0             0
## 6                          0                          0             0
##   Nonresident WR_Center Main_Course_SuccessFlag Main_Course_GradePoints HS_GPA
## 1           0         0                       1                       2     NA
## 2           0         1                       0                       1     NA
## 3           0         0                       1                       4  2.113
## 4           0         0                       1                       3  2.249
## 5           0         0                       1                       2  3.100
## 6           0         0                       1                       4     NA
##   Online N_Center_Visits Center_Attendance_Hours N_Conf
## 1      1              NA                      NA     NA
## 2      0              17                18.35194      1
## 3      0              NA                      NA     NA
## 4      0              NA                      NA     NA
## 5      0              NA                      NA     NA
## 6      0              NA                      NA     NA

For a description of each variable in the data set, bring up the help file for the data set by typing ?writing_center in the R console.

In a nutshell, the data set contains students’ first enrollment/attempt in ENG 1, a first year college level writing course, over several years. Students may choose to receive additional support outside of the classroom at the college’s writing center by enrolling in a 0.5-unit writing center/conference support course. Students may use the writing center as a study center where they have access to computers, dictionaries, thesauri, etc. At the center, students may also schedule 1:1 conferences with the instructor on duty, to get assistance and/or feedback on their writing by bringing a writing prompt or draft of their paper. Students registered for this support course are required to spend at least 12 hours in the center, and to have at least 2 formal conferences.

Before evaluating the impact of the writing center, let’s first explore the writing_center data set.

library(dplyr) ## Assist with data manipulations
nrow(writing_center) ## Number of rows
## [1] 4727
length(unique(writing_center$Student_ID)) ## Distinct students
## [1] 4727
unique(writing_center$Main_CourseID) ## primary courses
## [1] "ENG 1"
table(writing_center$Term) ## terms/semesters
## 
##   Year 1 Fall   Year 2 Fall Year 2 Spring Year 2 Summer Year 3 Spring 
##          1121          1266           878           218          1074 
## Year 3 Summer 
##           170

length(unique(writing_center$Section_ID)) ## Num sections
## [1] 224
length(unique(writing_center$Instructor_ID)) ## Num instructors
## [1] 52

## Overall success rates and grade points in the ENG 1 course.
writing_center %>%
  group_by(WR_Center) %>%
  summarize_at(.vars=c('Main_Course_SuccessFlag', 'Main_Course_GradePoints')
               , .funs=mean
               )
## # A tibble: 2 x 3
##   WR_Center Main_Course_SuccessFlag Main_Course_GradePoints
##       <dbl>                   <dbl>                   <dbl>
## 1         0                   0.767                    2.53
## 2         1                   0.782                    2.58

## Online modality and Age by writing center utilization
writing_center %>%
  group_by(WR_Center) %>%
  summarize_at(.vars=c('Online', 'Age')
               , .funs=mean
  )
## # A tibble: 2 x 3
##   WR_Center Online   Age
##       <dbl>  <dbl> <dbl>
## 1         0 0.132   19.8
## 2         1 0.0171  21.9

Based on this exploration, students that enrolled in the writing center course have an overall ENG 1 course success rate of 78.2%, whereas those that did not enroll in the writing center have an overall ENG 1 course success rate of 76.7%. On the surface, it appears the writing center is marginally effective at increasing course success.

Estimating Propensity Score Weights Using the `twang` Package

In evaluating the writing center, let’s assume that course success in ENG 1 is the outcome of interest (Main_Course_SuccessFlag). This is a binary outcome.

Before evaluating the writing center, we need to identify variables that we would like to attempt to control for, i.e., variables for which we would like to achieve balance when comparing the two groups, writing center users and non-writing center users. By balance, we mean that both groups have the same distribution of the covariate, approximating what we would expect under a randomized-controlled experiment. Identifying adjustment variables is typically the most difficult step in any analysis of observational data as it requires careful consideration. One strategy is to list out all factors that one believes causally impacts the outcome of interest. From this list, identify the variables that may be correlated with program participation. This latter list constitutes a list of potential confounders that could be included in the estimation of the propensity scores. Anything on the list for which we do not have data or measurements on should be acknowledged and listed as a limitation of the analysis.

Suppose the adjustment variables have been identified. Before proceeding with the analysis, let’s convert any variables with a character data type to a factor data type. That way, any function that uses these variables in modeling will know to treat it as a group variable (e.g., creating dummy or indicator variables on the fly). From there, we could leverage the ps function from the twang package to return a list of gradient boosted tree models, modeling the likelihood that a student participates in the writing center.

library(twang) ## package has ps function for modeling propensity scores
adj_vars <- c('Age', 'Ethnicity', 'FirstGen', 'Military', 'FosterYouth', 'DSPS', 'FinAid', 'Units_Attempted_Beg_Of_Term', 'GPA_Beg_Of_Term', 'TermUnitsAttempted', 'K12_Student', 'First_Time_College_Student', 'International', 'Nonresident_Tuition_Exempt', 'Nonresident', 'Online', 'HS_GPA') ## list of pre-specified adjustment variables

## Which variables are character?
names(writing_center)[sapply(writing_center, is.character)]
## [1] "Term"          "Main_CourseID" "Gender"        "Ethnicity"

## Convert data type: character to factor
for (i in seq_along(adj_vars)) {
  if (is.character(writing_center[[adj_vars[i]]])) {
    writing_center[[adj_vars[i]]] <- factor(writing_center[[adj_vars[i]]])
  }
}

## Set random seed
set.seed(123)

## Fit propensity score model
ps_fit_att <- ps(WR_Center ~ . # Use all variables
                 , data=writing_center %>% select(WR_Center, one_of(adj_vars)) ## include only "outcome" (writing center utilization) and explanatory variables
                 , estimand='ATT' # on the treated, can be ATE
                 , verbose=FALSE # don't print intermediate logs
                 )
## above step takes some time because it is fitting many trees iteratively to fit the unexplained residuals

length(ps_fit_att) ## number of objects returned
## [1] 18
names(ps_fit_att) ## names of objects
##  [1] "gbm.obj"    "treat"      "treat.var"  "desc"       "ps"        
##  [6] "w"          "sampw"      "estimand"   "version"    "datestamp" 
## [11] "parameters" "alerts"     "iters"      "balance.ks" "balance.es"
## [16] "balance"    "n.trees"    "data"

In the previous code, we returned the results of ps to an object named ps_fit_att. When we called ps, we specified the estimand to be estimand='ATT', which states that our goal is to draw inference on the average treatment effect among the treated. Alternatively, one could have specified estimand='ATE' which would state that our goal is to draw inference on the average treatment effect for the population. The choice of estimand should depend on the research question. The following are a few guiding questions that might help the analyst decide which to use:

Can the program be implemented for all students? If this is not feasible, then the average treatment effect (ATE) for all students is a hypothethical estimand and may not make sense. ATT should probably be used.
Is the goal to gather evidence to expand the program to everyone? If so, then the average treatment effect (ATE) for all students should probably be used.
Is the goal to assess the impact of the program among those that participated? If so, then ATT should be used.

From the previous code, the returned object ps_fit_att actually contains a lot of data objects and information for the user to select which gradient boosted tree model (which tree to stop at) the user wants to use to estimate the propensity scores. A user would typically choose from one of two stopping points or trees:

The tree that minimizes the average difference in the standardized means of covariates between treatment vs. control (es.mean.ATT).
The tree that minimizes the average difference in the maximum difference in the distribution of each covariate between treatment vs. control, as measured by the Komolgorov-Smirnoff statistic (ks.mean.ATT).

For each covariate, balance could be measured by calculating the standardized mean for each group, then taking the difference in standardized means between treatment and control. A covariate is balanced for the two groups, treatment and control, if this mean difference is small. If one were to leverage a set of propensity score weights to weight each obversvation and calculate the mean differences, then the hope is that the difference would be small for each covariate, which would suggest that the two groups are similar on each covariate. This could be visualized as below:

plot(ps_fit_att, plots='es') ## plots=3
#dev.copy(png, filename='ps_plot_es.png', width=9, height=5, units='in', res=150)
#dev.off()

First, let’s restrict our attention to the left pane, es.mean.ATT. This pane shows us results of a tree that minimizes the average standardized difference for the ATT estimand. Each point represents the mean difference between treatment and control for one covariate on the vertical axis. On the horizontal axis, the mean difference is shown for the “Unweighted” (no adjustment) scenario and the “Weighted” (adjusted) scenario. The visualization attempts to show the lack of balance across covariates before incorporating propensity score weights, and the (hopefully) balance in covariates after incorporating propensity score weights. A standardized difference below 0.1 would indicate reasonable balance.

The second pane on the right shows the same results but for propensity scores estimated using a tree that minimizes the average maximum difference in the covariate distributions. Typically, results for both trees are very similar, so the recommendation is to pick one tree, es.mean.ATT or ks.mean.ATT, and stick with that tree for the remainder of the analysis. For this vignette, we pick es.mean.ATT to continue our illustration.

One could also review the distribution of the propensity score weights, as seen below.

plot(ps_fit_att, plots='boxplot') ## plots=2
#dev.copy(png, filename='ps_plot_boxplot.png', width=9, height=5, units='in', res=150)
#dev.off()
## plot(ps_fit_att, plots=4) ## p-values of mean differences
## plot(ps_fit_att, plots=5) ## p-values of KS statistics

Focusing on the left pane of the previous visualization, we see that those in group 2 (treatment) generally have higher propensity to be in the treatment group compared to those in group 1 (control). In general, although the distributions differ between the two groups, there is still overlap across the propensity score space (0-1). If there is very little overlap (e.g., those in treatment are concentrated towards 1 and those in control are concentrated towards 0), then the use of propensity scores may not be as useful in drawing causal inference.

The following shows summaries for each variable using the bal.table function. First, we look at the results for the unweighted case. Then, we look at the results for the weighted case. The columns tx.mean, ct.mean, and std.eff.sz show the mean of the covariate in the treatment group, mean of the covariate in the control group, and the standardized difference between the two groups. The std.eff.sz is what is being visualized when plot(ps_fit_att, plots='es') was called previously.

## Unweighted
bal.table(ps_fit_att)$unw
##                                tx.mn  tx.sd  ct.mn  ct.sd std.eff.sz    stat
## Age                           21.872  7.548 19.846  4.651      0.268   8.651
## Ethnicity:African American     0.027  0.163  0.017  0.130      0.063   3.305
## Ethnicity:Asian                0.291  0.454  0.310  0.462     -0.041      NA
## Ethnicity:Decline to State     0.022  0.148  0.021  0.145      0.006      NA
## Ethnicity:Hispanic / Latino    0.158  0.364  0.118  0.323      0.108      NA
## NA                                NA     NA     NA     NA         NA      NA
## Ethnicity:Native American      0.001  0.029  0.001  0.024      0.010      NA
## Ethnicity:Pacific Islander     0.001  0.029  0.002  0.047     -0.048      NA
## Ethnicity:White, Non-Hispanic  0.322  0.467  0.316  0.465      0.012      NA
## FirstGen                       0.381  0.486  0.341  0.474      0.082   2.432
## Military                       0.021  0.142  0.021  0.143     -0.002  -0.051
## FosterYouth                    0.007  0.083  0.006  0.075      0.015   0.452
## DSPS                           0.127  0.333  0.065  0.246      0.186   5.845
## FinAid                         0.591  0.492  0.481  0.500      0.223   6.590
## Units_Attempted_Beg_Of_Term   16.939 19.032 14.490 16.989      0.129   3.918
## GPA_Beg_Of_Term                2.155  1.536  1.923  1.560      0.151   4.466
## TermUnitsAttempted            10.827  4.384 10.997  4.495     -0.039  -1.139
## K12_Student                    0.009  0.097  0.021  0.142     -0.115  -3.004
## First_Time_College_Student     0.294  0.456  0.370  0.483     -0.167  -4.881
## International                  0.086  0.280  0.083  0.275      0.011   0.320
## Nonresident_Tuition_Exempt     0.029  0.168  0.036  0.187     -0.042  -1.224
## Nonresident                    0.035  0.184  0.023  0.151      0.064   1.980
## Online                         0.017  0.130  0.132  0.338     -0.883 -16.802
## HS_GPA                         2.977  0.588  3.113  0.492     -0.232  -6.276
## HS_GPA:<NA>                    0.251  0.434  0.139  0.346      0.258   8.013
##                                pval    ks ks.pval
## Age                           0.000 0.114   0.000
## Ethnicity:African American    0.002 0.010   0.002
## Ethnicity:Asian                  NA 0.019      NA
## Ethnicity:Decline to State       NA 0.001      NA
## Ethnicity:Hispanic / Latino      NA 0.039      NA
## NA                               NA    NA      NA
## Ethnicity:Native American        NA 0.000      NA
## Ethnicity:Pacific Islander       NA 0.001      NA
## Ethnicity:White, Non-Hispanic    NA 0.006      NA
## FirstGen                      0.015 0.040   0.127
## Military                      0.959 0.000   1.000
## FosterYouth                   0.651 0.001   1.000
## DSPS                          0.000 0.062   0.002
## FinAid                        0.000 0.110   0.000
## Units_Attempted_Beg_Of_Term   0.000 0.101   0.000
## GPA_Beg_Of_Term               0.000 0.072   0.000
## TermUnitsAttempted            0.255 0.036   0.206
## K12_Student                   0.003 0.011   1.000
## First_Time_College_Student    0.000 0.076   0.000
## International                 0.749 0.003   1.000
## Nonresident_Tuition_Exempt    0.221 0.007   1.000
## Nonresident                   0.048 0.012   1.000
## Online                        0.000 0.115   0.000
## HS_GPA                        0.000 0.128   0.000
## HS_GPA:<NA>                   0.000 0.112   0.000

## Weighted
bal.table(ps_fit_att)$es.mean.ATT
##                                tx.mn  tx.sd  ct.mn  ct.sd std.eff.sz   stat
## Age                           21.872  7.548 21.229  6.730      0.085  1.845
## Ethnicity:African American     0.027  0.163  0.026  0.159      0.009  0.522
## Ethnicity:Asian                0.291  0.454  0.292  0.454     -0.001     NA
## Ethnicity:Decline to State     0.022  0.148  0.018  0.134      0.027     NA
## Ethnicity:Hispanic / Latino    0.158  0.364  0.145  0.352      0.035     NA
## NA                                NA     NA     NA     NA         NA     NA
## Ethnicity:Native American      0.001  0.029  0.000  0.013      0.023     NA
## Ethnicity:Pacific Islander     0.001  0.029  0.002  0.040     -0.025     NA
## Ethnicity:White, Non-Hispanic  0.322  0.467  0.325  0.468     -0.007     NA
## FirstGen                       0.381  0.486  0.371  0.483      0.021  0.566
## Military                       0.021  0.142  0.021  0.142      0.000  0.001
## FosterYouth                    0.007  0.083  0.007  0.084     -0.004 -0.100
## DSPS                           0.127  0.333  0.106  0.308      0.062  1.653
## FinAid                         0.591  0.492  0.566  0.496      0.051  1.388
## Units_Attempted_Beg_Of_Term   16.939 19.032 16.332 18.657      0.032  0.670
## GPA_Beg_Of_Term                2.155  1.536  2.107  1.539      0.031  0.839
## TermUnitsAttempted            10.827  4.384 10.834  4.407     -0.002 -0.043
## K12_Student                    0.009  0.097  0.012  0.107     -0.023 -0.699
## First_Time_College_Student     0.294  0.456  0.301  0.459     -0.016 -0.452
## International                  0.086  0.280  0.088  0.283     -0.007 -0.189
## Nonresident_Tuition_Exempt     0.029  0.168  0.033  0.180     -0.026 -0.739
## Nonresident                    0.035  0.184  0.031  0.173      0.023  0.619
## Online                         0.017  0.130  0.024  0.153     -0.053 -1.692
## HS_GPA                         2.977  0.588  3.002  0.544     -0.044 -1.066
## HS_GPA:<NA>                    0.251  0.434  0.240  0.427      0.024  0.599
##                                pval    ks ks.pval
## Age                           0.065 0.039   0.203
## Ethnicity:African American    0.806 0.001   0.806
## Ethnicity:Asian                  NA 0.000      NA
## Ethnicity:Decline to State       NA 0.004      NA
## Ethnicity:Hispanic / Latino      NA 0.013      NA
## NA                               NA    NA      NA
## Ethnicity:Native American        NA 0.001      NA
## Ethnicity:Pacific Islander       NA 0.001      NA
## Ethnicity:White, Non-Hispanic    NA 0.003      NA
## FirstGen                      0.572 0.010   1.000
## Military                      0.999 0.000   1.000
## FosterYouth                   0.920 0.000   1.000
## DSPS                          0.098 0.021   0.916
## FinAid                        0.165 0.025   0.747
## Units_Attempted_Beg_Of_Term   0.503 0.028   0.630
## GPA_Beg_Of_Term               0.402 0.022   0.869
## TermUnitsAttempted            0.966 0.015   0.996
## K12_Student                   0.485 0.002   1.000
## First_Time_College_Student    0.651 0.007   1.000
## International                 0.850 0.002   1.000
## Nonresident_Tuition_Exempt    0.460 0.004   1.000
## Nonresident                   0.536 0.004   1.000
## Online                        0.091 0.007   1.000
## HS_GPA                        0.286 0.024   0.862
## HS_GPA:<NA>                   0.549 0.011   1.000

For the interested user, one could also view the important variables in the tree model.

summary(ps_fit_att$gbm.obj, n.trees=ps_fit_att$desc$es.mean.ATT$n.trees)
#dev.copy(png, filename='ps_plot_influence.png', width=9, height=5, units='in', res=150)
#dev.off()

##                                                     var    rel.inf
## HS_GPA                                           HS_GPA 22.4510331
## Age                                                 Age 18.5602099
## Online                                           Online 15.5637054
## TermUnitsAttempted                   TermUnitsAttempted 12.6056086
## Units_Attempted_Beg_Of_Term Units_Attempted_Beg_Of_Term  8.4724233
## GPA_Beg_Of_Term                         GPA_Beg_Of_Term  6.8915135
## Ethnicity                                     Ethnicity  5.0042068
## FinAid                                           FinAid  4.7646862
## DSPS                                               DSPS  4.5630620
## Military                                       Military  0.3394101
## First_Time_College_Student   First_Time_College_Student  0.2447535
## Nonresident                                 Nonresident  0.1859316
## Nonresident_Tuition_Exempt   Nonresident_Tuition_Exempt  0.1816653
## FirstGen                                       FirstGen  0.1717908
## FosterYouth                                 FosterYouth  0.0000000
## K12_Student                                 K12_Student  0.0000000
## International                             International  0.0000000

The default visualizations provided in the twang package are very useful from an analyst’s perspective. However, one might want to do some tweaking to better present the impact of the propensity score weights on balance in the covariates to a wider audience.

library(ggplot2) # Visualization package
library(ggthemes) # themes
library(forcats) # reorder categories

# Store balance table results for both unweight and weighted scenarios in a single data frame
d_demo_summ <- bind_rows(
  bal.table(ps_fit_att)$unw %>% 
    mutate(Comparison_Type='Unadj.', Variable=rownames(bal.table(ps_fit_att)$unw))
  , bal.table(ps_fit_att)$es.mean.ATT %>% 
    mutate(Comparison_Type='Adj.', Variable=rownames(bal.table(ps_fit_att)$es.mean.ATT))
  ) %>%
  # filter(!(Variable %in% c('NA'))) %>%
  filter(!is.na(std.eff.sz)) %>%
  select(Comparison_Type, Variable, std.eff.sz)

# Visualize
d_demo_summ %>% 
  arrange(desc(Comparison_Type), desc(std.eff.sz)) %>%
  group_by(Variable) %>%
  mutate(std.eff.sz0=std.eff.sz[Comparison_Type=='Unadj.']) %>%
  ungroup %>% 
  ggplot(data=., mapping=aes(x=fct_reorder(factor(Variable), std.eff.sz0), y=std.eff.sz, group=Variable)) +
  geom_point(mapping=aes(size=ifelse(Comparison_Type=='Unadj.', 1, 2))) +
  scale_size_continuous(range = c(1,4)) +
  geom_line() +
  geom_hline(yintercept=0, lty=2) +
  coord_flip() +
  ylab('Standardized Difference') +
  xlab('') +
  ggtitle('Writing Center vs. Not') +
  theme_tufte(base_family='Arial Narrow') +
  theme(text=element_text(size=14)) +
  # theme_ipsum(grid=FALSE) +
  # scale_color_manual(values=c('slategrey', 'darkblue'), guide=FALSE) +
  theme(legend.position='none') +
  labs(caption='Small dots: unweighted difference\nLarge dots: weighted difference')
#dev.copy(png, filename='ps_plot_balance.png', width=9, height=5, units='in', res=150)
#dev.off()

From this visualization, we see that those that enrolled in the writing center tend to be older than those that did not enroll in the writing center (top row, small dot). However, after incorporating the propensity score weights, the difference in age is much closer to zero (top row, large dot). Similarly, for the last row, those not enrolled in the writing center have higher rates of online modality in the primary ENG 1 course compared to those enrolled in the writing center (bottom row, small dot). However, after incorporating the propensity score weights, the difference in online modality is much closer to zero (bottow row, large dot). This visualization shows that the propensity score weights are doing a reasonable job at attaining balance in the covariates (large dots close to 0).

Statistical Inference Using Propensity Score Weights Using the `survey` Package

Before incorporating the propensity score weights, let’s first illustrate how one might estimate the impact of the writing center on course success using logistic regression.

fit_unw <- glm(Main_Course_SuccessFlag ~ WR_Center
             , data=writing_center
             , family=binomial(link='logit')
               )
summary(fit_unw)
## 
## Call:
## glm(formula = Main_Course_SuccessFlag ~ WR_Center, family = binomial(link = "logit"), 
##     data = writing_center)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7446   0.7019   0.7288   0.7288   0.7288  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.19026    0.03964   30.03   <2e-16 ***
## WR_Center    0.08521    0.08117    1.05    0.294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5093.0  on 4726  degrees of freedom
## Residual deviance: 5091.9  on 4725  degrees of freedom
## AIC: 5095.9
## 
## Number of Fisher Scoring iterations: 4

# Odds ratio
exp(fit_unw$coef)
## (Intercept)   WR_Center 
##    3.287952    1.088943

Those that enrolled in the writing center are associated with a 1.09-fold increase in the odds of course success.

Now, to incorporate weights, we need to rely on the survey package in order to properly calculate standard errors in our inference. Note that by default, the survey package uses robust standard errors, which is a very good default to guard against model mis-specification.

library(survey) # Load package
# Extract weights
writing_center$w_ATT <- get.weights(ps_fit_att, stop.method='es.mean')
# Create a survey design package to be used as a data object subsequently
design_ps_att <- svydesign(ids= ~ 1, weights= ~ w_ATT, data=writing_center)
# Fit model
fit_att <- svyglm(Main_Course_SuccessFlag ~ WR_Center, design=design_ps_att, family=binomial(link='logit'))
summary(fit_att)
## 
## Call:
## svyglm(formula = Main_Course_SuccessFlag ~ WR_Center, design = design_ps_att, 
##     family = binomial(link = "logit"))
## 
## Survey design:
## svydesign(ids = ~1, weights = ~w_ATT, data = writing_center)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09391    0.05080  21.534   <2e-16 ***
## WR_Center    0.18156    0.08717   2.083   0.0373 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.000212)
## 
## Number of Fisher Scoring iterations: 5

# Odds ratio
exp(fit_att$coef)
## (Intercept)   WR_Center 
##    2.985924    1.199090

# 95% confidence interval
exp(confint(fit_att))
##                2.5 %   97.5 %
## (Intercept) 2.702948 3.298526
## WR_Center   1.010771 1.422495

After incorporating the propensity score weights, the writing center is associated with a 1.20-fold increase in the odds for course success (95% CI: 1.01 – 1.42).

Exercise: Doubly Robust Estimation

The so-called doubly-robust estimator can be estimated by incorporating adjustment variables into the weighted regression model in the previous section (adding adjustment variables to the right of ~ in the call to svyglm). This estimator incorporates both propensity score weights and adjustment variables in the regression (doubly), and is robust in the sense that only one of the following is required to be true to consistently estimate that causal effect of the writing center:

the regression model is correctly-specified and represents the causal mechanism of course success, or
the propensity score model is correctly-specified and represents the treatment-selection mechanism.

Both are unknown to the user.

As an exercise, the reader is encouraged to try this on their own by adapting the code from the previous section.

Other Considerations In Evaluation

The previous analysis shows a small effect on course success from the writing center. As described in the data description, students enrolled in the writing center section (what is represented by the WR_Center flag) are required to visit the writing center and have conferences with the instructor on duty. There are other things one might consider for comparison in evaluating the writing center that may be insightful for the program administrator:

those that actually visited the center vs. those that did not visit and those not enrolled in the writing center (N_Center_Visits);
those that actually had writing conferences vs. not (N_Conf);
those that had at least X number of writing conferences vs. not (N_Conf); and
those that spend at least X amount of time in the center vs. not (Center_Attendance_Hours).

Finally, one should also assess the assumptions that underly their analysis. In this vignette, we leveraged propensity scores and weighted logistic regression. Are there other potential confounders that we left out in the analysis? What else could explain the effect that we are observing? Regression assumes the observations are independent. Is this truly the case? Groups of students enroll in a section. Each section is taught by an instructor. Given how students are naturally clustered in a section, they do not seem independent of one another (think of instructors that have high course success rates or low success rates: knowing whether or not a student was successful in the course reveals information about whether or classmate was successful in the course). One might want to account for this clustering or instructor-level effect. Not addressing for the non-independent observations yields invalid inference from both an estimation and a quantifaction of uncertainty (standard error) perspectives. Accounting for instructor-level effects is illustrated in another vignette.

Appendix: R and R Package Versions

This vignette was generated using an R session with the following packages. There may be some discrepancies when the reader replicates the code caused by version mismatch.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] survey_4.0           survival_3.2-11      Matrix_1.3-4        
##  [4] forcats_0.5.1        ggthemes_4.2.4       ggplot2_3.3.3       
##  [7] twang_2.3            dplyr_1.0.6          IRexamples_0.0.1    
## [10] RevoUtils_11.0.2     RevoUtilsMath_11.0.0
## 
## loaded via a namespace (and not attached):
##  [1] prettydoc_0.4.1     gbm_2.1.8           tidyselect_1.1.1   
##  [4] xfun_0.23           bslib_0.2.5.1       purrr_0.3.4        
##  [7] mitools_2.4         splines_4.0.2       lattice_0.20-44    
## [10] colorspace_2.0-1    vctrs_0.3.8         generics_0.1.0     
## [13] htmltools_0.5.1.1   yaml_2.2.1          utf8_1.2.1         
## [16] rlang_0.4.11        jquerylib_0.1.4     pillar_1.6.1       
## [19] glue_1.4.2          withr_2.4.2         DBI_1.1.1          
## [22] xgboost_1.4.1.1     RColorBrewer_1.1-2  jpeg_0.1-8.1       
## [25] lifecycle_1.0.0     stringr_1.4.0       MatrixModels_0.5-0 
## [28] munsell_0.5.0       gtable_0.3.0        evaluate_0.14      
## [31] labeling_0.4.2      latticeExtra_0.6-29 knitr_1.33         
## [34] fansi_0.5.0         highr_0.9           xtable_1.8-4       
## [37] scales_1.1.1        jsonlite_1.7.2      farver_2.1.0       
## [40] png_0.1-7           digest_0.6.27       stringi_1.4.6      
## [43] cli_2.5.0           tools_4.0.2         magrittr_2.0.1     
## [46] sass_0.4.0          tibble_3.1.2        crayon_1.4.1       
## [49] pkgconfig_2.0.3     ellipsis_0.3.2      data.table_1.14.0  
## [52] rmarkdown_2.9       rstudioapi_0.13     R6_2.5.0           
## [55] compiler_4.0.2

Example 1: Propensity Score Weights Using Gradient Boosted Trees

Vinh Nguyen

2023-10-05

Introduction: Program Evaluation in Education

Toy Data Set: Writing Center at a Community College

Estimating Propensity Score Weights Using the `twang` Package

Statistical Inference Using Propensity Score Weights Using the `survey` Package

Exercise: Doubly Robust Estimation

Other Considerations In Evaluation

Appendix: R and R Package Versions

Introduction: Program Evaluation in Education

Toy Data Set: Writing Center at a Community College

Estimating Propensity Score Weights Using the twang Package

Statistical Inference Using Propensity Score Weights Using the survey Package

Exercise: Doubly Robust Estimation

Other Considerations In Evaluation

Appendix: R and R Package Versions

Estimating Propensity Score Weights Using the `twang` Package

Statistical Inference Using Propensity Score Weights Using the `survey` Package