Blog

More outcomes, more problems: a practical guide to multiple hypothesis testing in impact evaluations (part 1)

Dan Stein 12 November 2019

This is part one of a three-part series on multiple hypothesis testing in impact evaluations. In this first post, I write about under what circumstances you need to worry about multiple hypothesis testing. In the second post, I share how to avoid multiple testing. And in the third post, I write about common methods for multiple testing. Note that the ideas detailed in this first post are not new, and draw heavy inspiration from excellent posts on this subject by Cyrus Samii, Daniel Lakens, and Jan Vanhove.

IDinsight Associate Madhav Seth works at our Delhi office. ©IDinsight/Siobhan McDonough

As standards of evidence become more stringent in applied economics, researchers need to take seriously the concept of multiple hypothesis testing. But even for seasoned researchers, it can be difficult to understand exactly when we need to apply multiple hypothesis corrections, as it relies not just on the number and type of tests you make, but on how you expect the findings to be used. As most academic research is not connected to a specific use, it can end up being unclear exactly when to apply multiple testing. But since IDinsight’s research is generally connected to specific decisions, this can provide clarity on when multiple testing is needed. In this series, we explore norms and techniques for research both targeted at specific decision-makers as well as general audiences, hopefully providing guidance for both situations.

From a high level, multiple testing is necessary when you have a number of hypotheses, and want to take a certain action if any of the hypotheses are true. However, if you have a number of hypotheses that will each individually lead to separate actions being taken, then the hypotheses can be seen as independent, and no multiple testing is needed.

Let’s illustrate with an example. Let’s say we are conducting an impact evaluation of an unconditional cash transfer program, and have three outcomes in mind: (monthly) consumption, (value of) assets, and food security (measured with an index). We produce estimates of the treatment on each of these three outcomes separately. Do we need to correct our inference for multiple hypotheses?

If we know how the results will be used, then the situation is relatively straight forward to navigate. For instance, assume there are two decision-makers acting separately. First is the Ministry of Social Protection, who are mostly concerned with economic outcomes. They view both consumption and assets as reliable measures of well-being, and will be happy to fund a continuation of the cash transfer program if either of these measures shows significant improvements. The second is the Ministry of Health, who are only concerned about food security. They will be happy to fund this program as long as there are significant improvements in food security.

In this case, since the Ministry of Social Protection is making a decision based jointly on the consumption and asset results, we need to correct for these multiple hypotheses. If we don’t, the joint probability of rejecting the null of no effects for either cash or assets will certainly be higher than our desired tolerance for Type I error (usually 5%).

But since the Ministry of Health is making their decision independently on the basis of the food security results, we do not have to integrate this into multiple hypothesis correction, and can simply perform inference on the test like normal. This is because the Ministry of Health is making a decision based on food security independent of any other outcomes. Therefore, standard p-values will correctly reflect the probability of Type I error.

However, let’s say I was writing up these results without a certain audience or decision-maker in mind. Then I’m really not sure what I would do. The responsible thing would be to report all results and report inference both adjusted and unadjusted. Then the reader can decide whether the adjusted or unadjusted results are most relevant for them. For instance, if the reader was assembling a systematic review on whether cash transfers affected food security, they would be most interested in the unadjusted results. If they were conducting a review on whether cash transfers had any effect on welfare, they might be most interested in the adjusted results. This is a lot of responsibility to place on a reader, especially for a subtle and confusing topic like this. But I’m not sure a better way forward.

As we discuss in further detail in the following posts in this series, it seems like the academic literature is defaulting to just requiring multiple hypothesis testing when there are multiple outcomes, essentially being conservative in this designation. But the researcher can take steps to avoid needing to perform multiple testing, as discussed in the next post.