Picture credits: Poltu Shyamal from Getty Images
The recent AI evaluation framework proposed by the Center for Global Development, J-PAL, and The Agency Fund offers a lucid roadmap for funders, policymakers, and implementers to systematically assess whether AI tools work. It clarifies what to test at each stage of AI product maturity through four distinct “levels” of evaluation:
The framework is anchored to a “user funnel”, a tool originally used by tech product teams to understand how users move from initial exposure to active use, retention, and behavior change.
However, the user journey is only part of the story in even simple interventions, and a smaller part of the story in more complex systems. What’s missing is a systematic look at how the tool interacts with the wider context in which it is deployed.
There are three cases where we need to interrogate processes beyond the user funnel:
A purely user-focused evaluation framework that misses interrogating these broader factors could mean an otherwise strong product might deliver null or even negative impacts in a Level 4 trial, and Levels 1–3 will not have been designed to pick up why.
Implementation research asks: given the program delivery system, are the right actors doing the right things, at the right time, given the AI tool?
Crucially, this level is not about estimating counterfactual impact; it is about testing whether the AI product/application is implemented as planned within a programmatic ecosystem and ready for a fair impact test.
“Implementation research is the study of the ‘constellation of processes’ intended to get an intervention into use within a specific setting” ( Damschroder et al. 2009)
Implementation research—often called implementation science—emerged as public health, education, and social protection sectors repeatedly found that effective interventions failed when introduced in real-world settings. Foundational work by Fixsen et al. (2005) and Damschroder et al. (2009) showed that this “voltage drop” often stems from how an intervention interacts with existing systems, incentives, and frontline actors. Frameworks such as CFIR (Damschroder et al. 2009), RE-AIM (Glasgow et al. 1999), and PARIHS (Kitson et al. 1998) formalized this insight by focusing on feasibility, fidelity, adoption, and system readiness.
Implementation science is not a new concept in the development sector; it is the hard-won lesson of decades of public health interventions. For example: Oral Rehydration Therapy (ORT) languished not because the salt-sugar solution was chemically flawed; a systematic review of ORT implementation in LMICs found that clinical norms favored IV drips and mothers culturally prioritized “stopping” diarrhea over rehydration; “socio-cultural factors,” “lack of political commitment,” and “weak distribution systems” were the primary barriers to uptake, not the efficacy of the solution itself. Success ultimately required a massive pivot in delivery strategy: moving the intervention out of the clinic and into the community by training mothers and community health workers to prepare the solution themselves.
This lens is crucial for AI in development. Applying implementation science to AI helps determine whether the surrounding system is prepared to support the tool—before we judge its impact.
To see why this lens is needed for AI applications, consider two types of tools:
Consider an AI tool that helps midwives screen for high-risk pregnancies in rural clinics. In theory, machine learning models classify risk with decent accuracy → midwives refer more high-risk women to hospitals → survival rates improve.
But reducing maternal mortality requires more than changing the midwife’s behavior. Zoom out from the user-focused theory of change (the dark blue path in Figure 1), and you see a broader ecosystem: patient expectations, supervision norms, reporting burdens, transportation, and whether referral hospitals actually have the capacity to handle more high-risk patients flagged by an algorithm. It’s possible that the Interrogating program implementation can track whether necessary enablers (or prohibitive barriers) are present, which can ensure (or prevent) the final steps in a program’s theory of change from being realized.
Likewise, for an AI tutor for the student to succeed, factors beyond the student’s user journey can play an important role. Does the tool complement or displace other learning activities—a strong determinant of the effectiveness of EdTech interventions? The introduction of the tool may shift how teachers plan lessons, grade work, and interact with students in the class, for good or ill. In more radical implementations, teachers’ roles are reinvented, shifting from direct instruction to motivating and coaching students while the AI delivers most of the curriculum. Teachers may struggle to play the role the intervention intends. So even if the students (the users) do follow the script, the expected impacts may not be realized because other actors do not. Investigating program delivery can identify the program logic’s broken links that a level 3 evaluation may fail to pick up, but before the need for a more rigorous level 4 impact evaluation.
Implementation science provides the conceptual lens; process evaluation is an empirical tool to interrogate implementation dynamics. A process evaluation documents what happened during delivery and then compares it to what was expected, so we can understand where things aligned and where they did not. It proceeds along the following lines:
If we want AI investments to translate into real improvements in learning, health, and livelihoods, we must treat implementation as a first-order question. Process evaluations and implementation science principles help determine whether the delivery system is ready to support the tool, so that impact evaluations can test the intervention fairly and results are interpreted accurately.
9 April 2026
27 March 2026
17 March 2026
13 March 2026
12 March 2026
25 February 2026
17 February 2026
29 January 2026
28 January 2026
30 September 2025
26 November 2025