Skip to content
Blog

Bringing implementation science to AI: The case for process evaluations in the AI evaluation framework

Picture credits: Poltu Shyamal from Getty Images

The recent AI evaluation framework proposed by the Center for Global Development, J-PAL, and The Agency Fund offers a lucid roadmap for funders, policymakers, and implementers to systematically assess whether AI tools work. It clarifies what to test at each stage of AI product maturity through four distinct “levels” of evaluation:

  • Model performance (level 1) tests whether the underlying model is accurate, reliable, and unbiased.
  • Product evaluations (level 2) examine whether users can access, understand, and meaningfully interact with the tool.
  • User evaluation (level 3) assesses whether the tool actually shifts behaviors or intermediate outcomes as intended.
  • Impact evaluations (level 4) measure whether these changes translate into broader development outcomes at scale.

The framework is anchored to a “user funnel”, a tool originally used by tech product teams to understand how users move from initial exposure to active use, retention, and behavior change.

However, the user journey is only part of the story in even simple interventions, and a smaller part of the story in more complex systems. What’s missing is a systematic look at how the tool interacts with the wider context in which it is deployed.

The case for implementation research: Uncovering the “black box” of program delivery

There are three cases where we need to interrogate processes beyond the user funnel:

  1. When the user is not the beneficiary. Tools for frontline workers (the users)—midwives, teachers, extension agents—aim to benefit someone else downstream: a pregnant woman, student, farmer, or patient. The frontline worker may use the tool flawlessly. But if the end beneficiary does not receive, understand, or act on the information, impact will not follow.
  2. When the tool directly or indirectly influences other program implementers to behave in a way that can impact outcomes. AI tools can shift the work of people who never touch the app, whether that’s an intended part of program design (e.g., teachers shifting from lecturing to coaching when a tutor app delivers core instruction) or an unintended effect (e.g,. teachers becoming less motivated to teach a topic they believe the app has “taken over,” reducing effort or preparation). If routines, responsibilities, or incentives shift, these changes need to be understood and supported – or mitigated.
  3. When the tool is implemented in relation to other programmatic systems already in place. Digital tools rarely operate alone: they must either fit into existing systems (e.g., data or reporting systems) or transform them. For example, if an AI tool flags patients needing follow-up, that information must flow into existing health record systems and be used by the providers who see them. If data cannot be synced, matched to the right person, or acted on, outcomes do not improve, even when the tool itself is used correctly.

A purely user-focused evaluation framework that misses interrogating these broader factors could mean an otherwise strong product might deliver null or even negative impacts in a Level 4 trial, and Levels 1–3 will not have been designed to pick up why.

Implementation research is established — AI should use it

Implementation research asks: given the program delivery system, are the right actors doing the right things, at the right time, given the AI tool? 

Crucially, this level is not about estimating counterfactual impact; it is about testing whether the AI product/application is implemented as planned within a programmatic ecosystem and ready for a fair impact test.

“Implementation research is the study of the ‘constellation of processes’ intended to get an intervention into use within a specific setting” ( Damschroder et al. 2009)

Implementation research—often called implementation science—emerged as public health, education, and social protection sectors repeatedly found that effective interventions failed when introduced in real-world settings. Foundational work by Fixsen et al. (2005) and Damschroder et al. (2009) showed that this “voltage drop” often stems from how an intervention interacts with existing systems, incentives, and frontline actors. Frameworks such as CFIR (Damschroder et al. 2009), RE-AIM (Glasgow et al. 1999), and PARIHS (Kitson et al. 1998) formalized this insight by focusing on feasibility, fidelity, adoption, and system readiness.

Implementation science is not a new concept in the development sector; it is the hard-won lesson of decades of public health interventions. For example: Oral Rehydration Therapy (ORT) languished not because the salt-sugar solution was chemically flawed; a systematic review of ORT implementation in LMICs found that clinical norms favored IV drips and mothers culturally prioritized “stopping” diarrhea over rehydration; “socio-cultural factors,” “lack of political commitment,” and “weak distribution systems” were the primary barriers to uptake, not the efficacy of the solution itself. Success ultimately required a massive pivot in delivery strategy: moving the intervention out of the clinic and into the community by training mothers and community health workers to prepare the solution themselves.

This lens is crucial for AI in development. Applying implementation science to AI helps determine whether the surrounding system is prepared to support the tool—before we judge its impact.

Two examples of where the user journey is only part of the story

To see why this lens is needed for AI applications, consider two types of tools: 

Example 1: AI Risk prediction for midwives

Consider an AI tool that helps midwives screen for high-risk pregnancies in rural clinics. In theory, machine learning models classify risk with decent accuracy → midwives refer more high-risk women to hospitals → survival rates improve.

But reducing maternal mortality requires more than changing the midwife’s behavior. Zoom out from the user-focused theory of change (the dark blue path in Figure 1), and you see a broader ecosystem: patient expectations, supervision norms, reporting burdens, transportation, and whether referral hospitals actually have the capacity to handle more high-risk patients flagged by an algorithm. It’s possible that the Interrogating program implementation can track whether necessary enablers (or prohibitive barriers) are present, which can ensure (or prevent) the final steps in a program’s theory of change from being realized.

Example 2: AI tutors 

Likewise, for an AI tutor for the student to succeed, factors beyond the student’s user journey can play an important role. Does the tool complement or displace other learning activities—a strong determinant of the effectiveness of EdTech interventions? The introduction of the tool may shift how teachers plan lessons, grade work, and interact with students in the class, for good or ill. In more radical implementations, teachers’ roles are reinvented, shifting from direct instruction to motivating and coaching students while the AI delivers most of the curriculum. Teachers may struggle to play the role the intervention intends. So even if the students (the users) do follow the script, the expected impacts may not be realized because other actors do not. Investigating program delivery can identify the program logic’s broken links that a level 3 evaluation may fail to pick up, but before the need for a more rigorous level 4 impact evaluation. 

Using process evaluations to understand program delivery

Implementation science provides the conceptual lens; process evaluation is an empirical tool to interrogate implementation dynamics. A process evaluation documents what happened during delivery and then compares it to what was expected, so we can understand where things aligned and where they did not. It proceeds along the following lines: 

  • Start from a theory of change: Extend the usual product/user theory of change to include other human actors that interact with the product and influence the outcomes, organisational structures (guidelines, financing flows, reporting requirements). Explicitly articulate what each actor must do differently for the AI intervention to work. Make sure to surface the assumptions that link each step: what needs to be true for inputs to lead to activities, for activities to generate outputs, and for outputs to translate into outcomes. These assumptions are often where the chain breaks.
  • Adopt an implementation framework: Established Implementation Science (IS) frameworks like CFIR (Consolidated Framework for Implementation Research) can be incorporated into a process evaluation to define and track assumptions and activities along the theory of change, using implementation outcomes such as: 
    • Appropriateness: Is the AI compatible with the existing workflows and culture? 
    • Feasibility: Can the intervention be successfully carried out with available resources? (e.g., reliable internet, electricity, time)​
    • Fidelity: Is the intervention being delivered as intended? (e.g., Are teachers using the AI for lesson planning or just as a distraction?)​
    • Sustainability: Can the intervention continue once external support is reduced?
  • Use mixed methods: Combine routine and administrative data, structured process indicators, document review, qualitative (interviews, observations, focus groups) and quantitative methods (representative surveys), to interrogate weak links in the theory of change.
  • Iterate on programme design (not just model or product): Treat findings as input into redesigning training, supervision, accountability structures, and integration with existing systems, not only as reasons to tweak the model or interface.​
  • Stage-gate impact evaluation: Make satisfactory implementation improvements an explicit precondition for proceeding to Level 4 impact evaluations. This protects scarce evaluation resources and reduces the risk of misattributing implementation failure to “AI not working.”

Conclusion

If we want AI investments to translate into real improvements in learning, health, and livelihoods, we must treat implementation as a first-order question. Process evaluations and implementation science principles help determine whether the delivery system is ready to support the tool, so that impact evaluations can test the intervention fairly and results are interpreted accurately.