Skip to content

Why you don’t always need a baseline — Part 1

Jeff McManus 18 September 2020

If you’re singing the blues, you need a bass line. But if you’re doing an impact evaluation, you might not need a baseline. Read Part 2: When to collect baseline data and when not to

Whether or not to collect baseline data: A decision tree

It is common practice in impact evaluations to collect ‘baseline’ data, or outcome data before the program begins. This data is often used to check that treatment and control groups have similar average outcomes at the start of the study, i.e. that they are ‘balanced’. In this post, I describe why this use of baseline data is misguided.

There are many valid and good reasons for collecting baseline data, such as measuring impact effects more precisely or strengthening non-experimental designs. But collecting baseline data can be expensive, and it may be more cost-effective to skip the baseline and spend research funds on a larger endline, multiple follow-up surveys, or other research activities.

Deciding whether or not to do a baseline is critical to the usefulness of evidence generated from an impact evaluation. To help clarify this tradeoff, I’ve created the following decision tree; I describe each consideration in more detail below.

Whether or not to collect baseline data: A decision tree

Is it necessary to prove that your study groups are balanced at baseline?

A common misconception in impact evaluation is that it’s necessary to show that treatment and control groups are ‘balanced’ at baseline. Balance is typically understood to mean that differences in treatment and control group average outcomes at baseline are not statistically significant. However, several things are wrong with this line of reasoning: showing balance at baseline is neither sufficient nor necessary for causal inference, and lack of statistical significance is not proof of balance.

Outside of the lab, outcomes are influenced by many factors. Take test scores as an example. Two students who score the same on a test at the beginning of the school year may have very different scores by the end of the year. Maybe one student has a better teacher than the other or a more supportive home environment. Maybe one student works harder than the other. Maybe one student is just less comfortable with taking tests at first but gets the hang of it by the end of the year.

Now imagine comparing a group of students to another group of students. Even if students were randomly assigned to groups and those groups have the same average scores to start, they might not end up at the same place. One group might have, by chance, better teachers or be more comfortable taking tests. If you assign one group to receive a program and the other one to not (i.e. to a control group), it’s possible that the program has no effect, yet one group outperforms the other.

This possibility is why researchers report p-values alongside point estimates of treatment effects, and why those p-values are never 0: there’s always some chance that the difference observed is due to a “lucky” or “unlucky” random assignment. Fortunately, if our sample is large enough, the odds are on our side: the likelihood of imbalance leading to a large difference in outcomes becomes really small. Collecting baseline data can reduce the likelihood of imbalance, but it can’t fully eliminate it.

When discussing the “meaning of the standard Table 1” (the table in impact evaluation papers that shows balance checks), Bruhn & McKenzie (2012) quote the late great statistician Doug Altman:

“[when] treatment allocation was properly randomized, a difference of any sort between the two groups … will necessarily be due to chance … performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance. Such a procedure is clearly absurd.”

In other words, randomization, not baseline balance, is necessary to make causal claims. If the program was not randomly assigned, then the researcher has to make the case that assignment was as-good-as-random. Showing baseline balance may help persuade sceptical readers that randomization was done properly or assignment was as good as random, but it is neither necessary nor sufficient, for these claims (Imai et al 2008 provide proofs of this and other common fallacies in causal inference).

When I propose not doing a baseline, I often hear the concern: “but how will we know an effect isn’t due to differences between treatment and control at baseline?” True, we won’t know. We will never know if a difference in outcomes isn’t due to an imbalance in outcomes at baseline, or to any other imbalance (measurable or unmeasurable) for that matter. But we can calculate the exact probability that the difference is due to chance (including imbalance at baseline) rather than due to a program effect. This calculation is what’s captured in the p-value. And if my sample is large enough, then the probability of imbalance leading to large differences between treatment and control isn’t quite 0, but it’s often small enough to not get in the way of informing program decisions. If the assumption of (as-if) random assignment is sound, then you can quantify the likelihood that an observed difference between treatment and control groups is due to imbalance.

To be clear, baseline imbalance can be concerning, particularly if it shows up in your outcome variable. For outcomes that are highly correlated over time, like test scores, the possibility of even a small imbalance at baseline can make your treatment effect estimates a lot less precise. As McKenzie tells us, again quoting Altman, “a small imbalance in a variable highly correlated with the outcome of interest can be far more important than a large and significant imbalance for a variable uncorrelated with the variable of interest.”

So if you can collect any data at baseline, focus on your key outcome or highly correlated characteristics since they will give you the biggest bang for your buck in terms of improving precision. But if you can’t collect baseline data, don’t sweat it too much: your p-values will reflect the likelihood of imbalance in these characteristics.