Skip to content

Adapting evaluation design to challenging circumstances

In 2018, we designed a quasi-experimental evaluation to measure the impact of Southern New Hampshire University’s (SNHU) tertiary education program in Cape Town, South Africa. Our recruitment efforts revealed differences between the recruited comparison group and the treatment group, rendering our original evaluation design no longer viable. Below we describe how we amended our design, what we were still able to learn and inform, and how we were transparent about what we could and could not measure.

Southern New Hampshire University (SNHU) designed a tertiary education program, which employs a blended learning, competencies-based approach in low- and middle-income countries. In Rwanda, the program is implemented with Kepler, a local education organization, and targets low-income students who otherwise might have few opportunities to enroll in a university program. Our six-year evaluation of that program found that SNHU-Kepler had large and statistically significant improvements in student learning and post-graduation employment outcomes, compared to a matched comparison group. When SNHU introduced the program in South Africa with its implementing partner, the Scalabrini Centrewe wanted to understand the program’s impact in a new country, with a new implementing partner. Unlike the program in Kigali, this program specifically targeted students from refugee and asylum-seeker communities.

As in the original study in Rwanda, SNHU and the Scalabrini Centre did not want to randomly select students to receive access to the program, since this would mean withholding the program from the most deserving applicants. We, therefore, decided to mimic the evaluation design used in Rwanda which had worked well: recruit a group of students from local universities and then construct a comparison group by matching university students to SNHU-Scalabrini students on observable characteristics. Through a similar recruitment and screening process, we were able to identify a large number of university students with similar demographic backgrounds and baseline learning levels to that of the SNHU-Kepler students who did not apply to the program for likely arbitrary reasons (e.g. they were unaware of the program).

We ran into two primary challenges with this process in Cape Town. First, as described in a previous blog post, we faced difficulties finding and recruiting enough students with refugee or asylum seeker status enrolled in universities. Second, on average, the comparison group differed from the treatment group on key socioeconomic demographics and baseline learning levels. Moreover, the relatively small comparison group did not allow enough flexibility to drop students who did not match with the treatment students. We were ultimately left with only a fraction of university students who were appropriate matches for SNHU-Scalabrini students, leaving us with an insufficient sample size to rigorously estimate program effects.

As a result, we had to pause and reconsider our design. We needed to either abandon the evaluation or figure out how to amend our approach mid-way through the study. We were committed to understanding how the program impacts students from refugee and asylum seeker communities since SNHU was focusing on these students as it scaled the original program beyond Kigali. We, therefore, decided to amend the evaluation design to include three methods (subgroup matching, difference-in-differences, and weighted differences), even if this meant a more complex interpretation of results or limitations around impact claims. Below we reflect on lessons learned during this process.

Lesson 1: Triangulating estimates using several methods can provide indicative evidence

The differences between treatment and comparison groups meant the original evaluation design was no longer viable. We, therefore, explored alternatives and defined what these designs would and would not inform about the program’s impact:

1. Matched design using a subgroup of comparable SNHU-Scalabrini & university students.1 While there were some SNHU-Scalabrini students who we were still able to match with university students, these students had higher baseline learning levels on average. We therefore would not be able to use the results from this group to generalize the program’s impact to all SNHU-Scalabrini students. Moreover, the relatively small sample of matched pairs meant a less precise impact estimate.

2. Difference-in-differences. We also considered a difference-in-differences design,2 which compares the differences in the learning gains of both groups across time. While this method would allow us to consider all SNHU-Scalabrini students in the analysis, an unbiased estimate of program impact, in this case, rests on the assumption that university students and SNHU-Scalabrini students would have similar learning trajectories in absence of the program. Given the observed differences between the two groups of students, this assumption was not realistic.

3. Weighted difference-in-differences. We also considered a weighted difference-in-differences design,3 meaning the SNHU-Scalabrini and comparison students who are similar across observable characteristics are weighted more heavily in the analysis. While this increases the similarity across the two groups and reduces bias, it was less clear for which student population our estimates were relevant.

As no single design would give us what needed, we decided that we would gain a better understanding of the program’s impact by using all three designs. In addition to the concerns about potential bias, each approach also involved a different SNHU-Scalabrini student population. As such, we did not expect the treatment effects to be the same. We, therefore, decided to compare estimates across the three models to understand the direction of the program’s impact: positive, negative, or null. Before collecting endline data, we worked with SNHU-Scalabrini to pre-specify how we would interpret the results from these models: If the estimates all pointed in the same direction, we would infer that the program had an impact in that direction. If the estimates pointed in different directions, we would not be able to present an indication of impact. The triangulation across different approaches would still provide an opportunity to generate hypotheses, though, to explain the different results and suggest specific further research.

Lesson 2: Use a detailed understanding of the program context to hypothesize differential impact estimates

At the midline evaluation, we found inconsistent directions across the three estimates.4 We, therefore, relied heavily on our understanding of the program model to hypothesize the why and how behind the results.

First, when interpreting the estimate generated by sub-sample matching, we sought to understand why similar students might learn various subjects differently in the SNHU-Scalabrini program compared to local universities. For instance, the SNHU-Scalabrini program involves intensive, in-person instruction in computer literacy and the English language before students transition to self-paced coursework. How did this differ from first-year coursework at local universities? Did local universities offer remedial courses to students who needed them? Or did access to and utilization of computers differ across the various programs? Our understanding of the SNHU-Scalabrini program’s coursework and curricula led us to several hypotheses about why the program would affect these students differently than local university programs.

We then sought to understand why we saw differences across the estimates. For example, were different types of students within the SNHU-Scalabrini program learning at different rates? If so, would the targeted instruction in computer literacy and English language at the start of the program help or hinder the progress of certain subgroups of students? And how might the transition to self-paced learning affect these subgroups? Again, we relied on program context to add nuance to our interpretations and hypothesize why interpretations across the models might be different.

Lesson 3: Conduct secondary analysis to probe the biases of individual models

To probe on our hypotheses of how and why impact estimates differed across the three models, we supplemented our primary analysis with ample secondary analysis. We first assessed differences between our matched students and unmatched students to generate ideas that might explain differing results across the methods. For example, understanding which specific skills and sociodemographic features were most dissimilar suggested possible explanations for why the estimate for our matched sample was different than that for the full sample.

We then explored the learning trajectories of these different types of students. For example, we looked at the difference in learning gains for SNHU-Scalabrini students who had baseline scores below the mean versus those who had baseline scores above the mean. This provided indicative evidence of how different types of students were able to grow specific skills over the course of the program and highlight possible program refinements for SNHU-Scalabrini to explore. For example, differential learning gains between student types raised ideas around whether access to remedial learning resources or supplemental material could be useful. We conducted a similar analysis for the comparison group, to understand trends in learning gains. Such analysis provided insight into differences in impact estimates generated by the three models. When we presented the results to SNHU and the Scalabrini Centre (see Lesson 4), these granular findings provided a starting point for discussing how and why the program may affect different students’ learning abilities.

Figure 1: Exploring differences in learning gains between high and low performers

Figure 2: Understanding the learning trajectory of the matched subgroup and overall sample

Lesson 4: Be transparent about what you can and cannot claim

As we outline above, there were key limitations to our evaluation design. First, we were not able to generate a precise impact estimate on learning outcomes but an indication of whether the impact was positive, negative, or null. It was important to be transparent with SNHU and Scalabrini that we would not be able to deliver a result that should drive scale-up decisions. Our results would rather highlight possible program refinements for them to explore. Second, given the mixed midline results and potential for similarly mixed endline results, we set clear parameters of what was needed to generate an indication of the program’s impact before conducting the analysis so that SNHU and Scalabrini had a clear understanding of how to use the results.

To ensure SNHU and Scalabrini understood how and when to use the results, we presented both the reasons for the change and the new evaluation design in a non-technical way. This exercise included intuitive explanations of technical concepts, accompanied by visualizations of the different evaluation design options.

Figure 3: Graphical representation of the parallel trends assumption behind a difference-in-differences analysis

Figure 4: Visualization used to explain which students would be included in subgroup matching analysis

Then, when presenting the results, we first presented the overall findings: whether or not we found a consistent direction of impact. We discussed the three different results, why they might be different, and what insights the results could provide for the SNHU-Scalabrini program going forward. This process ensured SNHU and Scalabrini understood what we could and could not say about the program’s impact.

Knowing what we know now about recruitment challenges, we likely would have designed the evaluation differently in the beginning. Given we did not foresee many of the challenges to the recruitment process and target population demographics, though, we re-designed our evaluation approach to still offer insight into the program’s impact. Through this process, we learned how to pivot the evaluation design in response to real-world constraints, and still generate valuable evidence to understand the program’s impact.

  1. 1. We used propensity score matching, a technique in which we predict the likelihood of receiving the treatment based on observable characteristics (covariates) and match on that likelihood. Given only a subsample of the treatment group matched to a subsample of the comparison group, we only included these scores in both groups who matched.
  2. 2. A difference-in-differences model estimates the treatment effect by comparing the average change over time in an outcome for the treatment group to the average change over time of the control group.
  3. 3. This technique is called Inverse Probability of Treatment Weighting which is designed to mitigate biases due to observable differences between treatment and comparison groups. The weights represent the inverse of the probability of selection into treatment. In other words we use observable characteristics, such as baseline learning outcomes, age, gender, household wealth, high school type, years of computer use, and whether both parents are alive to predict a student’s likelihood to be in the SNHU program. We take the inverse of these likelihoods to create the sample weights.
  4. 4. We have yet to present results publicly given the endline results are incomplete. Endline data collection may be further delayed due to COVID-19.