Blog

Rebuilding the Educate Girls machine learning model

Sid Ravinutala 29 April 2019

IDinsight previously led a workshop with Educate Girls’ team in India. ©Rob Sampson/IDinsight

We recently rebuilt our machine learning (ML) model for Educate Girls (EG). Though part of the motivation was paying down some tech debt (fixing shortcuts), the main reason was to incorporate all the feedback and learnings on the performance of the ML model from the last few years. We also received new data from recent data collection efforts and surveys. In this post, I’ll talk about the new model — its strengths, weaknesses, and what’s next.

In a previous article, we wrote about how we used ML to reduce the operational costs of locating out-of-school girls for EG. I recommend reading that article for additional detail. But here’s a quick recap: EG wishes to identify villages with a large number of out-of-school girls. They set up support programs to help them enroll, stay in school, and learn. Finding villages with a high number of out-of-school girls used to be a lot of work — EG’s field staff would go from village to village in a district to collect data on the number of out-of-school girls in each. Using this, they would come up with a list of villages to work in to maximize the number of girls they could support.

This is the process that EG used to follow:

In 2018 and 2019 we wrote a village-level ML model that used 2011 census data, the 2017 and 2018 DISE survey, and 2017 district level out-of-school data from ASER to predict the number of out-of-school girls in each village. By following the model’s predictions, EG was able to reach between 50 and 100 percent more out-of-school girls for the same budget. This meant identifying a greater number of out-of-school girls for the same cost and faster scaling of their programs.

This is the new process that EG follows:

Need for a new model

But a model is never perfect. As EG followed our team’s suggestions and collected actual on-the-ground data, we learned where the model was doing well and where it wasn’t. EG also went to some additional villages that were recommended by district government officials or that they heard about through word-of-mouth. With these, we had new insights into our model’s performance and additional data to refine our model.

We had built the original model as a proof of concept. Now, having established its value, we needed to set it up as a repeatable and extensible process. We wanted to be able to run the entire pipeline from ingestion of the latest data to visualizations of predictions for any new geography to which EG wishes to expand. Making changes to the model, like adding new features or new learners, should be simple. Model performance should also be easy to monitor.

Map showing districts in which EG has programs. Colors represent different expansion years.

While analyzing performance, we realized that the old model did worse predicting out-of-school girls in geographies far away from the districts in the training set. This is what you might expect. ML models do great when making predictions based on data similar to that in the model, but when the data to be scored comes from the same distribution as the training data. Greater the difference in their distributions, the greater the loss in performance.

The districts where we have outcome data for 100% of the villages are shown in blue; villages where we only have a small biased sample of villages are shown in orange, and the districts where we need to predict out-of-school girls for each village are shown in green.

The geographic distance between these is quite large and you’d expect there to be some covariate shift (see Box 1 below). Training models robust to such shifts is an area of active research. See next steps for a discussion of possible approaches. One of the key ideas in the literature (Kuang, 2020) is to discover invariant relationships within the data. What relationships between the features and the outcome will be stable in all our geographies?

The new architecture

There are methods to discover these invariant relationships, which we’ll discuss below. But given time constraints, we used a shortcut. We took advantage of subject matter experts to construct a number of simple models based on various causal theories. Next, we ensembled these to arrive at the final prediction.

Using subject matter experts

This article by the United Nations Office for Coordinating Humanitarian Affairs lists 20 reasons why kids might be out of school. All of these are true to varying degrees and some more salient in certain geographies. For a number of years, our team and Educate Girls’ team in India have been working on challenges delivering quality education. We reached out to Educate Girls and IDinsight experts to understand the most salient factors that affect girls’ enrollment. Here is what we collected:

Parent’s literacy: Educated parents are more likely to invest in the child’s education. A differential in education between men and women in the village may be also an indicator of how much importance is placed on women’s education.
Caste and religion: Some religions and castes may have strong norms around women’s education.
Occupation: If the opportunity cost of going to school is high, it would reduce enrollment. For example, if there is significant unemployment, parents may need labor from the child to supplement their income.
Assets & Housing condition: A proxy for wealth. Poor families may be less likely to have the funds to support a child in school.
Size of Household: Larger families may only be able to afford sending some of the kids to school.
School facilities: Poor school facilities are seen by parents as a proxy for poor education quality. Not having sufficient functional toilets for girls may lead to lower enrollment for girls. If the school is not easy to access via the street, safety-conscious parents may be less likely to send their kids there.
School type: If there is a secondary school in the village, parents may see a path to further education and be more motivated to enroll their child in primary school. Private schools may be valued more by parents than public schools.
Sex ratio in school: The ratio of enrolled boys to girls at various levels of schooling may reveal gender preferences among the community.

Two-tier architecture

For each of these theories, we constructed a model. For each, we tried weighted linear or binomial models (using statsmodels), hierarchical models (using sklmer), or ML models like Random Forests, XGboost, or LASSO (using sklearn). We measured performance of each of the models and kept those that had (a) a low mean cross-validation score and (b) a low variation in the cross-validation scores.

Once we had a set of weak but stable models, we blended these using another learner to produce our final outcome. Here’s what the architecture looks like:

The model has two tiers — child models and the parent model. Each of the child models only sees a subset of the features based on a theory of change. The parent model receives the predictions of each of these child models as features in addition to a subset of the features not used by any of the child models.

The simplest parent model is the “averager” — it takes the prediction of each of the child models and returns their average. This worked remarkably well. We gained only a small improvement in cross-validation score by using a Random Forest or XGBoost as the parent model.

By using this new architecture, we were able to reduce our cross-validation score by almost 20%.

Why does this work?

First, the performance of causal models tends to be more stable than correlation-based models across different environments. Though we are not doing any causal inference here to measure treatment effect, each of the child models is based on a theoretic causal model of the world. A number of these have been well studied and shown to be true in certain settings. By using such stable models, we reduce the size of our error in new geographies.

When you have a set of weak statistical learners, ensembling them has shown to produce better performance. Scullet et al. had a gem of a paper in 2015 called “Hidden Technical Debt in Machine Learning Systems”. I keep coming back to this paper when designing a new ML-based solution. There is just so much good advice packed in that paper. One of the suggestions in there to avoid entanglement is to “isolate models and serve ensembles.’’ Ensembles work best if the errors of the child models are uncorrelated. In our architecture, each of the child models only see a small subset of the features and there is little overlap between the columns seen by each of the models. Though it gives us no guarantees (columns used in child models may still be correlated) it reduces the likelihood of correlated errors for the child models.

Despite this, proof is in the pudding. We are still waiting on final on-the-ground results to measure performance and decide on the next steps to take.

Next steps

In 2021, EG further expanded its operations in Uttar Pradesh. Not only are these districts far from ones where EG has an existing footprint, they are also relatively more urban. Third, there is a pandemic. It is hard to estimate how this has changed the landscape. In all, there are legitimate concerns of substantial dataset shift. The improvements we have made to the model do not give us guarantees or bounds on our error.

There is a good overview of methods in the introduction of Kuang et al (2020). There are two major ways to deal with such dataset shifts:

If you have access to the test set features: Reweight the training dataset with a density ratio so that it’s distribution matches that of your test set (see Shimodaira (2000); Bickel et al. (2009); Sugiyama et al. (2008); Huang et al. (2007); Dudík et al. (2006); Liu and Ziebart (2014)).
If you have multiple datasets: Discover invariant structure or causal features (see Peters et al. (2016); Rojas-Carulla et al. (2015); Muandet et al. (2013))

Fortunately, we have both — access to test set features and multiple datasets from each round of expansion. Unfortunately, these methods are complex to implement and computationally expensive. Experimentation and tuning will take substantial effort. We are waiting on the results of the latest round of data collection to see how our model performed before investing in these more robust approaches.

There are still millions of girls out of school in India. There is much work to be done. We will continue to invest in improving our model to support EG as they extend to new geographies and support a greater number of girls to go to school and learn. Stay posted for future posts on this work.

References

Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45, 2014.

Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.

Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.

Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016

Kuang, K., Xiong, R., Cui, P., Athey, S., & Li, B. (2020). Stable Prediction with Model Misspecification and Agnostic Distribution Shift. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 4485–4492. https://doi.org/10.1609/aaai.v34i04.5876

Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.

Miroslav Dudík, Steven J Phillips, and Robert E Schapire. Correcting sample selection bias in maximum entropy density estimation. In Advances in neural information processing systems, pages 323–330, 2006.

Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.