Health workers using the MomConnect Platform ©ReachDigital
Generative AI (GenAI) tools are evolving rapidly, reshaping how we work, communicate, and access information. These tools hold vast potential for transforming social program delivery, particularly in resource-constrained environments. From AI tutors helping students learn to chatbots providing agricultural advice to smallholder farmers, the social sector is increasingly embracing AI-powered solutions.
This enthusiasm is well-founded. As highlighted by Dhaliwal and Hou, AI has massive potential to fight poverty and climate change through improved targeting, increased access to services, enhanced frontline worker capabilities, reduced biases, and boosted efficiency. But as with any innovation promising social impact, we really should have rigorous evidence before scaling these solutions widely.
However, there’s growing concern in the social sector that traditional impact evaluations — particularly randomized controlled trials (RCTs) — are too slow for the fast-moving pace of AI development. Some argue we should abandon these “gold standard” evaluation approaches entirely in favor of more agile methods.
Our response? Don’t throw the baby out with the bathwater.
We suggest (0) ensuring the model and product are functioning as expected, and then doing (1) an impact evaluation to estimate a baseline level of impact. Then, as the model or product evolves, (2) iterate with A/B tests, (3) using proxy outcomes when necessary. At some point, (4) assess when enough has changed about the world, the users, or the product, to justify conducting another impact evaluation. And then the cycle repeats. For those familiar, this is a slightly different sequence, and more cyclical, than what was proposed in AI Evaluation Framework for the Development Sector, by The Agency Fund, Center for Global Development, and J-PAL.
Impact evaluations remain critical for understanding whether AI interventions actually deliver on their promise. They allow us to measure real-world impact (not just, for example, engagement metrics), provide cost-effectiveness estimates for resource allocation decisions, and surface unintended consequences—both positive and negative. This last point is especially important for AI, given how much we still don’t know about these technologies’ effects.
The concerns about traditional evaluations are valid, though. Implementing rigorous evaluations takes months or years, largely because meaningful outcomes like improved learning or higher crop yields take time to emerge. Meanwhile, GenAI tools evolve at lightning speed. By the time evaluation results are available, the tested version may be outdated, limiting the external validity of findings. Plus, conducting impact evaluations for every iteration isn’t financially viable (Note that some impact evaluations can be fast and cheap, so this is not a general statement).
We propose keeping the rigor of impact evaluations while adapting the approach. Here’s our proposed approach:
First, let the program stabilize. Don’t measure impact too early, as adoption and effects may take time to emerge. Instead, evaluate through the early stages of product development using the AI Evaluation Framework for the Development Sector: ensure (1) the AI model works as intended, (2) the product using the AI model functions properly, and (3) you see early shifts in user knowledge, attitudes, and behaviors. Only after this early-stage prototyping and iteration should you move to impact evaluation.
Then, conduct a well-timed impact evaluation. AI tools evolve quickly, but there’s usually a point when at least your product and theory of change have stabilized enough to meaningfully assess impact. This first evaluation establishes whether your intervention impacts real-world outcomes and provides an evidence base upon which future modifications can be tested.
Once you’ve established baseline impact, you can use A/B testing to nimbly assess impact—at least the direction of impact—as you iterate on your product. This has the benefit of not needing to recruit new subjects that will be randomized into treatment (users) and pure control (non-users) groups. You can simply use new or existing users. Initially, you can compare your established baseline product (A) to new versions (B). If B is an improvement, that becomes the new status quo, and new iterations will be compared to that. (You can also run several multi-armed experiments at the same time, A/B/C/D tests, for example.)
If you can use the same impact outcomes (as in your impact evaluation) to assess the relative effectiveness of A/B versions, great. Any incremental improvements can be added to your original impact estimate to get a rough sense of how much these iterations are improving impact. Of course, assuming that impacts are additive comes with some non-trivial assumptions, discussed below.
Sometimes, measuring final outcomes in A/B tests isn’t practical—they may take too long to materialize or require expensive independent data collection (the original concern with impact evaluations motivating this post). In these cases, you may need proxy outcomes: metrics that are highly correlated with your impact outcome, measured earlier in your theory of change, and ideally captured automatically by your tech solution. These are often referred to as “mediators” when they explain some of the impact pathway between intervention and outcome, and “surrogates” when they (combined into an index) can fully predict the impact outcome.
The key is identifying proxies that are theoretically linked to your final outcome and, ideally, empirically validated through your impact evaluation. For instance, the frequency and timing of AI tutor usage might correlate with test score improvements, or the types of questions asked of an AI business coach might predict how impactful the AI business coach is. Because you need to test correlations of proxies with final impact outcomes, they should be identified and measured before and alongside your impact evaluation.
With proxy outcomes, you may not capture the full magnitude of program impact, but you can at least determine whether modifications are moving impact in the right direction compared to your baseline version (assuming there aren’t other unmeasured “mediators” driving impact downwards—again, a strong assumption, along with other assumptions discussed below).
At some point, you’ll need to question whether results from your original impact evaluation are still informative, even after accounting for improvements measured through interim A/B tests.
Drawing inspiration from the generalizability puzzle framework, ask yourself: Has enough changed about (i) the baseline conditions (or, in this case, your counterfactual), (ii) the implementation (or in this case, the program or product design), or (iii) user behavioral responses, such that your original findings may no longer hold? More concretely, (i) are you now serving a fundamentally different population, or has the non-AI status quo changed? For example, have learning levels improved for the whole population, not just your EdTech users, since the last evaluation? Perhaps because of some simultaneous education reforms? Or (ii) has the product changed considerably since it was last evaluated, perhaps, increasing in scope or complexity? For example, did your original program just deliver math tutoring, and now it includes science, or incorporates a community chat feature? Or (iii) has the novelty of the product worn off, changing how users engage? For example, have users figured out how to game the tests without actually learning the content?
If you conclude that none of the following are true, you can continue with the prior step. However, this is the strong assumption we’ve been hinting at. And the further out you are from the original impact evaluation, the more tenuous this assumption becomes.
If significant changes have occurred in any of these areas (the context, the users, or the product), it’s probably time for a fresh impact evaluation with a pure control group. And then you repeat the steps above (including reassessing if (a) the AI model works as intended, (b) the product using the AI model functions properly before A/B testing).
As GenAI tools enter classrooms, clinics, and small businesses worldwide, we need evidence-informed approaches that balance rigor with agility. Impact evaluations remain essential, but they’re most valuable when thoughtfully timed and complemented by ongoing monitoring through A/B testing and proxy indicators.
The goal isn’t to slow down innovation, but to ensure that when we scale AI solutions, we’re scaling interventions that truly make a difference in people’s lives. The stakes are too high—and potentially too great—to do otherwise.
7 May 2026
23 April 2026
14 April 2026
9 April 2026
27 March 2026
17 March 2026
13 March 2026
12 March 2026
11 June 2025
9 December 2025
25 February 2026
28 January 2026