Skip to content

When to collect baseline data; and when not to — Part 2

Jeff McManus 21 September 2020

Part two in our series about baseline data shares when it is worth investing in baseline data, and when the costs don’t outweigh the benefits.

In the previous post, I discussed a common misconception in impact evaluations: that it’s necessary for treatment and control groups to be ‘balanced’ at baseline. In fact, a baseline is neither necessary nor sufficient to make causal claims. But there are some good reasons for doing a baseline, as well as some reasons to avoid doing one.

Whether or not to collect baseline data: A decision tree

When should you do a baseline?

In roughly decreasing order of importance:

1. You didn’t do a Randomized Controlled Trial (RCT)

If your treatment and control groups come from different populations, then collecting baseline data can help you build more credible evaluation designs that reduce or eliminate time-invariant differences between your study groups. For certain outcomes, it’s often more convincing to argue that your program group, in the absence of treatment, would have had a similar change in outcomes to your comparison group over the study period (difference-in-differences) than to claim that they would have the same outcomes at all points in time (cross-section comparison). You can also use baseline data to ‘match’ individuals who are starting with the same outcomes, enforcing baseline balance across your groups, as we did when evaluating a blended learning university program in Rwandaimproved chickens in Ethiopia, and household solar lighting systems in Uganda. However, caution is advised when employing such designs: while baseline imbalance can be a cause for concern in difference-in-differences designs, matching on baseline values can sometimes create more bias than it eliminates. There has been a flood of new research pressure-testing and improving difference-in-differences and matching designs; see these two summaries from the World Bank Development Impact blog and this excellent tutorial from Zeldow and Hatfield before starting your next quasi-experimental evaluation.

2. You need to estimate treatment effects more precisely

Controlling for baseline values can greatly improve the precision of your treatment effect estimate. It can help ensure you have actionable results. Collecting baseline data is particularly important when your sample size is constrained by program size. If the fixed costs to implementing the program in a new school or village are high, or if the program can only be rolled out where the implementer is already working, then increasing sample size may be a non-starter. Adding a round of baseline data collection may be the only way to gain as much precision as you need.

However, the informative value of baseline data varies widely by outcome depending on how correlated the outcome is over time.1 McKenzie 2012 explores this tradeoff, showing that as autocorrelation falls, the precision gains from collecting baseline data fall as well.2 McKenzie further shows that in some cases it’s more efficient to skip the baseline, conduct multiple endlines, and average measurements across rounds, rather than conduct a single baseline and endline.

So how do you know what autocorrelation will be ahead of time? Certain types of outcomes have reliably higher or lower autocorrelation. Test scores, for example, tend to be highly autocorrelated: in our evaluation of Educate Girls’ remedial education program in rural India, 1-year autocorrelation in reading scores was 0.82 and 2-year autocorrelation was 0.73. Doing a baseline can improve precision more than doubling or tripling the endline sample size in such cases. On the other hand, economic outcomes, like business profits and household income, tend to have much lower autocorrelation, so that controlling for baseline values in your analysis does little to improve precision. Since autocorrelation varies across contexts, subgroups, and duration baseline and endline, I’d recommend finding data from a similar population as your study group to estimate autocorrelation (IPA and J-PAL provide free access to much of their data. We’re in the process of preparing a similar dataverse for IDinsight data.).

A second but less important way that baseline data can improve precision is through stratification. Stratification enforces balance, and if you stratify on baseline values of your outcome, you effectively eliminate a potential source of imbalance. But, stratification only really matters in small samples: Bruhn & McKenzie show that in samples of 300 observations or more, stratification (and similar methods to achieve better balance) doesn’t yield much more precision above simple randomization. You can also often find relevant government or administrative data at the level of granularity that you need for stratifying, in part realizing the gains from stratifying on baseline values for much lower cost.3

3. You need a reliable list of evaluation participants before the program starts since the program could influence drop-out.

In most evaluations, you will need pre-program information on study units or clusters to enable random or quasi-random program assignment. But you usually don’t need to collect baseline data to get this information; existing administrative data sources (such as a list of villages in the most recent census) will typically have enough information for program assignment.

An exception to this rule happens if you are (quasi)-randomizing at the level of the clusters (such as villages or schools) and the program affects which individuals are in those clusters. For instance, a community nutrition program could reduce under-five mortality for malnourished children. Or a public works program may make people less likely to migrate to cities for jobs. If you don’t have a list of program participants from before the program started (because you randomly assigned or matched clusters but not individuals), and you only sample from individuals who are present at endline, then you risk introducing bias to your impact estimates: those who remain in treatment and control groups at endline may not be comparable to each other.

Collecting baseline data allows you to document who was present before the program started and get their contact information in case they drop out and you need to follow-up (or if you need to pivot to phone-based data collection for the endline, as many researchers have done due to COVID-19). Baseline data is critical for estimating how the program impacts the average person assigned to receive it (intention-to-treat effects) as well as the average person who actually received it (treatment-on-the-treated effects).

Even if it’s too expensive or not possible to follow up with everyone who dropped out, having data on all participants at the beginning of the program gives you options. For instance, you can sample from the list of attriters and only follow-up with that subsample, weighting their outcomes to account for attriters who you don’t survey. You can use baseline data to understand what type of people are more or less likely to remain in your study, and whether drop-out is likely correlated with treatment assignment and outcomes. Even if attrition is related to being in the program, knowing who dropped out can help you estimate bounds on your treatment effect.

4. You want to compare impact estimates for high baseline performers vs. low baseline performers.

You may want to know if a program is benefitting those at the top or bottom of the distribution: is it reducing inequality or making it worse? Collecting baseline data facilitates this analysis, though may not be necessary. For instance, you may be able to use data from standardized tests administered by the school prior to the program. Even if the school test is different than yours, or administered months prior to the start of the program, it may be sufficient to explore heterogeneity if performance on that test is correlated with performance on your test.

It is also possible to use proxy characteristics to understand who benefits more or less. If you are evaluating a school program you could explore whether students from different castes, races, ethnicities, wealth groups or income groups are differentially impacted. This data could all be collected during endline. It isn’t exactly the same as assessing whether the program improves test scores more for those who started with higher or lower scores. But knowing whether the program is reaching students from historically privileged or marginalized groups may tell you a lot about whether the program is increasing or decreasing inequality.

5. You want to benchmark your treatment effect to changes over time.

When you do an impact evaluation, you usually want a reference point for effect sizes: e.g. how does this program’s effect compare with similar programs in other contexts? You can usually find a range of relevant benchmarks in past studies, and Bayesian methods can help you interpret these benchmarks to figure out what effect size you might ‘expect’ given past studies (e.g. Vivalt 2020). As a last resort, you might fall back on a rule of thumb like Cohen’s d (< 0.2 SD -> small effect, > 0.5 SD -> large effect), though as Cohen himself advises, “avoid the use of these conventions….in favour of exact values provided by theory or experience” (p. 184).

Sometimes the most relevant benchmark, though, is business-as-usual change over time. You probably can’t guess what this benchmark will be for your study population — if you could, then you wouldn’t need a control group. In such cases, collecting baseline data allows you to benchmark how much the treatment group changed above and beyond the control group. For instance, in our evaluation of Educate Girls’ program, we find that treatment students outperformed control students by 0.44 SD after 3 -years. What does that really mean? By measuring growth in the control group, we see 0.44 SD is equivalent to 1.14 additional years of business-as-usual school in this setting, or an extra 3–4 months of school per year.

Besides improving impact estimation, baseline data may help implementers better understand their target population and thus inform program design. We’ll be exploring the beauty of the baseline beyond causal estimation in an upcoming blog post.

When should you not do a baseline?

You should not do a baseline when you can use the money more effectively elsewhere. There are also lots of ways you could use funding intended for a baseline to improve the decision-relevance of your research, such as:

  • A larger endline sample to improve precision and do interesting subgroup analysis
  • Multiple follow-ups to measure effects over time or improve precision (a la McKenzie 2012)
  • Expanding the survey instrument to measure other outcomes
  • An extra treatment arm to test a program variant or benchmark against a different type of intervention (e.g. a cash-only arm)
  • Collecting data to strengthen the credibility of your impact estimates, such as tracking down attriters or measuring test-retest reliability
  • Collecting data on implementation costs to facilitate cost-effectiveness analysis

Skipping the baseline could also be a way of managing financial risk. Karthik Muralidharan points out that there’s often a risk of an intervention not happening, particularly when you’re working with government partners. In these cases, it may make sense to avoid the risk of implementation failure and just plan for a larger endline sample.

On the flip side, doing a baseline could put a lot of logistical pressure on the implementer. For instance, educational interventions are often designed to run for a full semester or school year, but students can’t be tested until they’re back in classrooms. Waiting to conduct a 4-week baseline would force the implementer to change the design of their program or extend the school year; in either case, the evaluation results may not reflect the impact of the program at scale.

Deciding when to do a baseline in an impact evaluation

Every situation is different: the benefits and costs of doing a baseline will depend on the evidence needs, methods and data, program constraints, and research budget. At IDinsight, these considerations have sometimes pushed us toward doing a baseline, as in our evaluations of conditional cash transfers to incentivize immunizations in Nigeria, disseminating improved sweet potato varieties to smallholder farmers in Tanzania and Uganda, and unconditional cash transfers to households in a refugee settlement in Uganda. Other times we have opted not to do a baseline, as with our evaluations of a poverty graduation program in Kenya and Uganda, school handwashing interventions in the Philippines, and providing information to farmers in India on the appropriate amount of fertilizer to use. Given the unique considerations for each evaluation, it’s important to weigh the tradeoffs when deciding whether or not to do a baseline.

Jeff McManus is a Senior Economist on the Technical Team at IDinsight. Jeff oversees the technical design and analysis of impact evaluations, process evaluations, and machine learning applications. He also leads technical training at the organization, and he designed and delivers the curriculum for the semi-annual technical ‘boot camp’ for new staff.

  1. 1. What matters is technically correlation between baseline and endline outcomes conditional on other covariates in your model. If you can find data that is highly predictive of outcomes from another source — say government data on school enrollment, infrastructure, and student demographics — then it could negate the informative value of baseline data.
  2. 2. The gains from collecting baseline data depend on the analytical set-up. In a difference-in-differences model, it’s more efficient to skip the baseline and double the endline sample if autocorrelation is less than 0.75. In an ANCOVA model, it’s more efficient to skip the baseline and double the endline sample if autocorrelation is less than sqrt(0.5). See McKenzie 2012 for proofs. h/t to Johannes Haushofer for pointing this out.
  3. 3. Even if government or administrative data contains measurement error, stratified randomization will likely yield estimates that are at least as precise simple randomization. The one exception is if you stratify into very small groups (like pairs); then the degrees of freedom adjustment necessary when stratifying may increase your standard errors. See Bruhn & McKenzie (2009) for a discussion of the optimal stratum size when using data that is weakly correlated with outcomes of interest.