A new paper to equip researchers with the tools they need to make evaluations more decision-relevant. Read the full report here.
A tree in Mukobela Chiefdom, Southern Province, Zambia taken as part of IS Nano pilot. ©IDinsight/Natasha Siyumbwa graphics by Torben Fischer
Minister: I am considering implementing a school feeding program, but I would only like to do so if there is evidence of its effectiveness. Can you help with that?
Evaluator: Why yes, we can help you figure that out with an impact evaluation! Let’s see, similar programs have increased children’s height-for-age z-scores (HAZ) by .2 standard deviations, so, let’s see, a sample size calculation says we’ll need 40 schools in the treatment group and 40 in control.
Minister: Okay, but if we are going to scale this up, I want to be sure that the program actually increases HAZ by at least .2. If not, this program is not going to be worthwhile for me. Does that change anything?
Evaluator : Um, uhh, yeah, uh, give me a sec…
The situation above seems straightforward: a policy-maker wants to scale up a program only if its impact is above a certain threshold. They would like a researcher to help them determine if the program works, and if its effect is above the specified threshold. Seems simple, but this situation is surprisingly difficult to deal with using evaluators “standard” toolkit. The issue: we want to determine if the effect is greater than .2, but unless we have a huge sample size, we will only be able to statistically reject an effect of .2 if the measured effect is much larger than .2. This is likely not what the minister would like.
In a new paper (Fischer, Johnson, and Stein 2021), we present a methodological guide for how to design impact evaluations when the audience is a specific decision-maker, as opposed to the general public (as may be the case for an academic paper). We argue that in certain, commonly encountered scenarios, orienting the evaluation’s design and inference to fit the decision-makers’ needs can make it easier for evidence to inform policy. In addition, having a specific decision-maker in mind opens up the door to Bayesian analysis. This paper provides a practical guide to designing and analyzing evaluations to inform specific decisions, using both frequentist and Bayesian approaches.
We first consider less drastic deviations from standard evaluation tools, staying within the standard frequentist paradigm. For instance, we encourage researchers to consider whether both positive and negative effects are really decision-relevant. If not, one-sided hypothesis tests allow for smaller sample sizes while providing the same degree of decision relevant information. Further, it may be that the decision-maker doesn’t need a huge amount of certainty that there are positive treatment effects in order to scale or continue a program. In these cases, researchers may be able to deviate from the ‘conventional’ 5 percent significance level threshold to a higher level, say 20 percent. Again, such designs would produce decision-relevant evidence with a lower sample size, but high statistical power.
Going back to our school feeding example above, how would we tackle this using a frequentist framework? In this circumstance, we instead suggest a ‘double-barreled hypothesis test.’ We set the parameters so that in order to recommend the program, one must reject a null hypothesis of the program having a zero or smaller effect with high confidence (5 percent level of significance) and reject the hypothesis of the program having an effect of less than .2 with a much lower level of confidence (20 percent level of significance). With this type of setup, the researcher can recommend scale-up to the minister if they are very confident that the program has a positive effect, and modestly confident that the effect is larger than the minister’s desired threshold of .2.
While frequentist measures can be adapted to some of these tough decision frameworks, in many cases, Bayesian analysis is much better suited to the task. In Bayesian analysis, we use the decision-makers’ original beliefs about the program’s effectiveness (priors), and combine it with the data from the experiment to create final beliefs about impact (posterior). Besides being able to explicitly take into account the decision-makers’ prior beliefs, Bayesian analysis allows the researcher to make explicit probabilistic statements about the results such as “There is an 83 percent probability that the impact of the program is greater than .2.” This can make explaining results to decision-makers much easier.
Our paper provides an accessible primer for evaluators versed in frequentist methods who would like to explore a Bayesian approach. We provide an overview of how to apply Bayesian analysis to impact evaluations, and walk through how to formulate priors, conduct the analysis, and interpret the results. We also explain how to determine sample size for Bayesian evaluations, ensuring that the evaluation has acceptable rates of Type-I error and statistical power. We provide a code repository (in R) for all our examples to help readers become acquainted with Bayesian techniques. We hope people will find it accessible, and that it will ease the entry into Bayesian analysis for the uninitiated.
Coming back to our school feeding program, how do we approach this using Bayesian analysis? Well, first we have to elicit the prior beliefs and the decision rule of the minister. After an in-depth conversation, she states that she would be confident to scale the program if there is a larger than 70 percent chance of the program having an effect > 0.2 SD. Ex-ante, she thinks there’s around a 16 percent chance that the school feeding program will have a negative effect, and a 16 percent chance that that effect is greater than .2. This is the key input into the model, though we also have to make some assumptions about other parameters, such as the regression constant. Clearly, we would not want to trust every decision maker’s word, and provide more discussion around the types of elicitation methods and checks researchers should explore when engaging in Bayesian analysis. Further, it is always possible to specify an uninformative prior, which is akin to assuming that we do not have any ex-ante information about the program. Armed with the full set of priors (and a simulated dataset), we can calculate the full posterior distribution of the treatment effect.
In the figure below, we show a graph of the posterior distribution of the treatment effect. Here we see we have a high confidence (97 percent) that the treatment effect is greater than zero. At the same time, there is only a 33 percent chance that the treatment effect is greater than .2, which is lower than the required 70 percent by the minister. We therefore would not recommend a scale-up.
Overall, we hope that this paper can help researchers design evaluations that are better tailored to the decision and context of their audience. After reading our paper, we hope that the unfortunate researcher in the intro will be much better prepared.
Minister: I’m still waiting for a response.
Researcher: Yes ma’am, establishing a threshold of .2 will not be a problem. But let’s have a conversation about your priors…
Thoughts? Comments? Feedback? We’d love to hear from you in the comments!
Fischer, T; Johnson, D; Stein, D. (February 2021) Informing Specific decisions with Rigorous Evidence: Designing and Analyzing decision focused evaluations. Technical Report.
1 March 2019
7 March 2019
2 April 2019
1 May 2019