Skip to content
Blog

How AI evaluation works in practice: Insights from implementers

Isha Fuletra Suzin You 25 February 2026

©Paula Bronstein/Getty Images/Images of Empowerment

AI systems are increasingly being used in public-facing settings; helping doctors triage patients, guiding users through government services, supporting legal processes, and answering questions on a wide range of topics at scale. As this shift happens, evaluation becomes urgent and unavoidable for organizations building, funding, and deploying AI.

At the same time, it becomes a difficult and often messy problem. What does it really mean to “evaluate” an AI solution that is evolving quickly, embedded in everyday routines, used differently across settings and users, and expected to influence behavior in ways that are not easily measured, all while operating under tight time and resource constraints?

To help organizations answer this question, IDinsight, in collaboration with the Center for Global Development and The Agency Fund, and supported by the Gates Foundation, is developing a living AI evaluation playbook. Our end goal is to create a resource that helps organizations evaluate their AI systems rigorously and sustainably for greater impact, grounded in the realities that social sector organizations face.

An early version of this playbook was developed by the Agency Fund, based on the 4-level AI evaluation framework they formulated together with J-PAL and the Centre for Global Development. The framework defines 4 levels of evaluating an AI product for social impact: evaluating AI model’s outputs, then the product, then user affect and behavior change, and, finally, the development outcome in question. 

We wanted to expand on this work by understanding how organizations are approaching AI evaluations in practice. With that goal, we interviewed practitioners building and deploying AI systems across health, social protection, justice, and behavior change. What emerged was not a single model of “good evaluation,” but a set of recurring patterns shaped by organizational mission, practical constraints, and the realities of implementation.

This blog summarizes what we have learned from them.

1. Evaluation begins with the most consequential question

Our early research shows that the playbook must evolve to support a ‘risk-first’ entry point, allowing teams to dive into their most pressing uncertainties rather than following a strictly linear path.

For some teams, their most consequential risk was safety – whether their tool could cause harm if it behaved unexpectedly. For others, it was desirability – whether users would engage with the system at all. At RightWalk Foundation, early AI evaluation centered on this latter question of desirability- whether users would meaningfully engage with an AI agent designed to make a government-run apprenticeship portal more accessible. Rather than optimizing for the most cost-effective model or agent persona upfront, the team focused on a basic but consequential question: did the system actually help users navigate a complex public service? The case was similar for Cliniva, a women’s primary health care provider focusing on long-term behavior change. Since they were trying to scale in-person support via a WhatsApp chatbot, rather than performing rigorous model output evaluations, they made a prototype that worked well enough and brought it straight to the users to get their reactions. At that stage, early signals on user preference were more informative than deeper optimization of the underlying model’s performance.

This did not mean that model evaluation was absent. Most teams described running checks on underlying models alongside other forms of evaluation, with the depth and rigour shaped by risk, feasibility, and resource constraints. The result is a multi-pronged approach to evaluating AI solutions, rather than a strictly linear progression. The 4-level AI evaluation framework provided an initial lens for us, but we learned that practitioners prioritise and focus on generating enough clarity, at the right time, to decide whether to continue, adapt, or pause.

Taken together, this suggests that evaluation often occurs when different parts of an AI system are at different stages of maturity. Organizations may be learning about user behavior or product performance while the underlying model is still being refined. In this context, the value of the broader evaluation ecosystem- methods, tools, and external evaluators- lies less in certifying that every layer has been fully assessed, and more in helping teams make sense of partial, imperfect signals and use them to guide decisions.

2. Product and user evaluation are deeply intertwined in practice

We learned that for the playbook to be truly practical, it must move toward a more integrated model where product performance and user behavior are viewed as two sides of the same coin. The questions practitioners ask tend to cut across both those layers at once.

Teams often report tracking indicators – users complete tasks, follow recommendations, return to the tool, or correct the AI- that simultaneously reflect how the product is performing and how users are engaging with it. These indicators are not easily attributable to one level alone. A drop-off, for example, could point to usability issues, model limitations, unclear content, or mismatched user expectations. As a result, evaluation questions often sit at the intersection of product behavior and user response, rather than neatly within one category.

This was evident in how Pinky Promise, a women’s reproductive health service, described its evaluation approach. The team tracks outcomes such as medication adherence and symptom resolution. These measures capture whether users trust and follow guidance, whether the care pathway is functioning as intended, and whether their AI-supported solution is effective in practice. Internally, these are not treated as separate “user” versus “product” metrics; they are viewed as integrated signals of system performance.

Dalberg Data Insights described a similar perspective across multiple AI deployments, while emphasizing that evaluation must become more contextual at user and product level. Rather than isolating product evaluation from user evaluation, teams examined how people interacted with AI-enabled features within workflows. Moments where users hesitated, made manual corrections, developed workarounds, or disengaged became especially informative. These interactions revealed not only whether users engaged with the system, but whether product design, model behavior, and automated workflows fit the task, supported user judgment, and enabled meaningful changes in practice

Across several such use cases, what stood out was that questions at the intersection of product behavior and user response were often the most informative. They revealed how the system worked in practice – across model performance, product design, and user behaviour – rather than forcing signals into categories that mattered less for decision-making in small, multi-hatted teams.

3. Domain experts are central to evaluating accuracy and safety

Across sectors, particularly in specialized domains such as medicine and law, teams emphasized that automated methods and tools alone are rarely sufficient to assess AI model accuracy, robustness or safety. Instead, domain experts play a central role in defining what “good” looks like and in identifying unacceptable failures.

Intelehealth, a non-profit delivering high-quality healthcare via technology where there is no doctor, is developing an AI-enabled clinical decision support tool for telemedicine doctors. Until recently, physicians reviewed most AI-generated diagnoses and treatment recommendations; now, as more training data is available, they’re starting to use LLM-judges trained on those human evaluations. The extent to which doctors need to edit or override the AI is treated as a practical signal of model quality and safety. At Adalat AI, an organisation specialised in AI and LLM-driven solutions to tackle case backlogs and improve accuracy in legal proceedings, to improve the AI behavior, legal experts are responsible for curating datasets that define the desired output. These inputs from experts become the reference points engineers rely on; without them, it is often unclear what the model should be optimizing for in practice.

Teams were candid about the limitations of this approach. Domain-expert review is slow, costly, and difficult to scale. Yet for systems operating in high-stakes, context-dependent environments, practitioners consistently described it as unavoidable.

4. Impact is evaluated through proxies that support learning in evolving systems

Across interviews, teams did describe measuring impact, but these efforts were typically lightweight and pragmatic. This reflects both practical constraints and the nature of AI systems themselves.

First, many intended impact goals – such as improved health outcomes, better access to justice, or livelihood gains- take time to materialize. Rather than waiting years for these goals to mature, organizations often track intermediate indicators that plausibly sit on the pathway to impact. For instance, Intelehealth looks at consultation time, diagnostic accuracy and medical appropriateness as early signals of clinical efficiency and effectiveness, Pinky Promise tracks medication adherence and symptom resolution, RightWalk focuses on whether users complete administrative workflows with less friction, and AdalatAI treats productivity gains as a proxy for downstream justice outcomes.

Second, because AI systems evolve continuously, evaluating them as fixed interventions over long time horizons is often impractical. Teams therefore rely more on frequent, directional signals that support ongoing judgment and iteration, even when those signals stop short of definitive impact attribution.

Finally, rigorous impact evaluations such as randomized controlled trials (RCTs) are time and resource-intensive and require specialized expertise. Many organizations building AI solutions in the social impact contexts are small or early-stage, and struggle to run multi-year studies alongside active product development.

Taken together, these accounts point to the need for outcome-focused evaluation approaches that can keep pace with fast-evolving AI systems, providing timely, directional signals to support learning and decision-making as tools change − a shift from traditional long-term impact studies.

Going forward

A central learning from these conversations so far is that AI evaluation in practice is shaped by trade-offs between what is at stake, what is feasible, and what needs to be learned next. We firmly believe that understanding these trade-offs is essential if evaluation methods, frameworks, and tools are to reflect how evaluation actually happens. 

This blog is the first in a series on AI evaluation in the field, with more insights to follow as we develop the AI Evaluation Playbook. We see this playbook as a practical, evolving resource- one that grows alongside field experience and technological advancements. As this work develops, ongoing input from practitioners remains central. We thank AdalatAI, Cliniva, Dalberg Data Insights, Digital Green, Intelehealth, Jacaranda Health, mDoc, Myna Mahila Foundation, Pinky Promise, Reach Digital Health and RightWalk Foundation and for their time and invaluable insights that have contributed to some of these findings. 

If your organization is building or deploying AI for social impact and would like to share what you are learning, or contribute your experience as a ‘case study’ for other organisations, we would welcome the conversation. You can reach out to Sid Ravinutala at sid.ravinutala@idinsight.org or Isha Fuletra at isha.fuletra@idinsight.org.