Skip to content
Project

Improving equity in AI for estimating household incomes

A community health worker collecting data ©IDinsight

Decision-makers challenge

Governments around the world face the complex task of designing policies and allocating limited resources to meet the essential needs of their populations. Many are increasingly using data science and AI to support decisions such as how to target social programs and deliver services more efficiently. While these tools have the potential to improve cost-effectiveness and promote equity, their success depends heavily on the quality and representativeness of the underlying data. Models built on outdated surveys or incomplete administrative records often miss key populations. When such models are used to make real-world decisions, they can lead to biased outcomes that exclude or misclassify the very groups they aim to support.

In one Eastern African country, the government is rolling out a new system to expand access to healthcare. The system is designed such that all households contribute to a health insurance fund based on their financial capacity, and the resulting fund can then be used to subsidize healthcare for low-income households. To implement this, a Proxy Means Test (PMT) model was developed using national household survey data to estimate incomes. However, the available data had several gaps. It was somewhat outdated, did not adequately represent the poorest regions, and struggled to distinguish income levels across different groups. These limitations raised concerns that the model could overestimate the incomes of low-income families, putting them at risk of unaffordable contributions.

Fig1: Original model – Predict household income directly
Fig1.5: Original model – Overestimates for low-income HHs (Note – ALL the households in this dataset are low-income)

Impact opportunity

IDinsight’s Data Science team and DataDelta team partnered with the government to strengthen the system’s equity and cost-effectiveness. Our work focused on improving how household financial capacity is assessed, with a particular focus on protecting the most vulnerable. We recommended new modeling techniques, gathered fresh data from underrepresented regions, and created a system for households to contest their estimated contributions. Together, these efforts have helped lay the foundation for a more inclusive and accountable health financing system.

Our approach

Fig2: New model – Classify by income, supplement with additional high quality data and an appeals mechanism
Fig2.5: New model – Performs better at classifying households in the right income bracket (Note – ALL the households in this dataset are low-income)

Two-Step Model Design

At the core of our work was the development of a two-step model (Fig 2) to estimate household income more accurately and fairly. In the first step, households are grouped into broad income categories low, middle, and high based on a select set of indicators. In the second step, separate models are applied within each group to more precisely estimate income levels. This allowed us finer control in selecting indicators that were informative for each income group. Crucially, it also allowed us to tailor the low-income group model for greater equity, without drastically affecting the performance of the middle- or high-income groups.

Our data collection effort was specifically designed to strengthen both steps of this model. First, we focused on improving how accurately households were placed into the correct income group, especially for those at the lower end of the income spectrum who were previously underrepresented in national data. Second, we collected richer data, with additional relevant indicators to better distinguish income levels within each group. By doing so, we addressed critical data gaps that often lead to misclassification—particularly of low-income households. 

Targeted Data Collection

To improve the model’s accuracy and fairness, we focused our data collection on low-income households that were missing or underrepresented in national datasets. Working with a local government agency, we selected areas known to have high poverty levels across 20 counties. Over a short three-week period, our teams collected data from over 2,000 households to strengthen the model’s ability to identify financial vulnerability.

We piloted a new pin-drop sampling method to efficiently select households within these localities. Local community health workers and trained local enumerators helped us locate and engage sampled households, ensuring better participation and stronger local trust.

Throughout the survey, we used SurveyStream-enabled real-time data quality checks and ongoing cleaning and analysis to improve the reliability of the training dataset. This helped ensure that the model was built on high-quality, ground-truthed data that reflects the realities of the country’s most vulnerable families.

Automated Corrections and Appeals

We incorporated rules into the model to automatically adjust contribution estimates in cases where certain indicators strongly suggested misclassification. For example, households receiving social assistance or led by someone retired or unemployed were flagged for reassessment. We also developed a process to support household appeals, making the system more responsive and citizen-friendly.

Inclusive AI

This work was part of IDinsight’s broader Inclusive AI initiative, a collaboration between our DataDelta and Data Science teams. Our goal is to build machine learning models that are equitable, transparent, and grounded in data that reflects the realities of the communities they serve. By combining rigorous model development with high-quality data collection, we ensure that tech-enabled decisions; like health insurance contributions – are fair and decision relevant.

The results

  • Developed a fairer and more flexible model for estimating household income.
  • Used new data and targeted adjustments to better identify and protect low-income households.
  • On the validation data we collected, the model – together with automated corrections and an appeals process placed low-income households in the correct income group 95% of the time. Please refer to Fig 2.5. 
  • Reduced the chances of low-income families being asked to contribute more than they could afford.
  • Helped design a more equitable system for setting health insurance contributions based on real financial capacity.