IDinsight previously led a workshop with Educate Girls’ team in India. ©Rob Sampson/IDinsight
We recently rebuilt our machine learning (ML) model for Educate Girls (EG). Though part of the motivation was paying down some tech debt (fixing shortcuts), the main reason was to incorporate all the feedback and learnings on the performance of the ML model from the last few years. We also received new data from recent data collection efforts and surveys. In this post, I’ll talk about the new model — its strengths, weaknesses, and what’s next.
In a previous article, we wrote about how we used ML to reduce the operational costs of locating out-of-school girls for EG. I recommend reading that article for additional detail. But here’s a quick recap: EG wishes to identify villages with a large number of out-of-school girls. They set up support programs to help them enroll, stay in school, and learn. Finding villages with a high number of out-of-school girls used to be a lot of work — EG’s field staff would go from village to village in a district to collect data on the number of out-of-school girls in each. Using this, they would come up with a list of villages to work in to maximize the number of girls they could support.
This is the process that EG used to follow:
In 2018 and 2019 we wrote a village-level ML model that used 2011 census data, the 2017 and 2018 DISE survey, and 2017 district level out-of-school data from ASER to predict the number of out-of-school girls in each village. By following the model’s predictions, EG was able to reach between 50 and 100 percent more out-of-school girls for the same budget. This meant identifying a greater number of out-of-school girls for the same cost and faster scaling of their programs.
This is the new process that EG follows:
But a model is never perfect. As EG followed our team’s suggestions and collected actual on-the-ground data, we learned where the model was doing well and where it wasn’t. EG also went to some additional villages that were recommended by district government officials or that they heard about through word-of-mouth. With these, we had new insights into our model’s performance and additional data to refine our model.
We had built the original model as a proof of concept. Now, having established its value, we needed to set it up as a repeatable and extensible process. We wanted to be able to run the entire pipeline from ingestion of the latest data to visualizations of predictions for any new geography to which EG wishes to expand. Making changes to the model, like adding new features or new learners, should be simple. Model performance should also be easy to monitor.
While analyzing performance, we realized that the old model did worse predicting out-of-school girls in geographies far away from the districts in the training set. This is what you might expect. ML models do great when making predictions based on data similar to that in the model, but when the data to be scored comes from the same distribution as the training data. Greater the difference in their distributions, the greater the loss in performance.
The districts where we have outcome data for 100% of the villages are shown in blue; villages where we only have a small biased sample of villages are shown in orange, and the districts where we need to predict out-of-school girls for each village are shown in green.
The geographic distance between these is quite large and you’d expect there to be some covariate shift (see Box 1 below). Training models robust to such shifts is an area of active research. See next steps for a discussion of possible approaches. One of the key ideas in the literature (Kuang, 2020) is to discover invariant relationships within the data. What relationships between the features and the outcome will be stable in all our geographies?
There are methods to discover these invariant relationships, which we’ll discuss below. But given time constraints, we used a shortcut. We took advantage of subject matter experts to construct a number of simple models based on various causal theories. Next, we ensembled these to arrive at the final prediction.
This article by the United Nations Office for Coordinating Humanitarian Affairs lists 20 reasons why kids might be out of school. All of these are true to varying degrees and some more salient in certain geographies. For a number of years, our team and Educate Girls’ team in India have been working on challenges delivering quality education. We reached out to Educate Girls and IDinsight experts to understand the most salient factors that affect girls’ enrollment. Here is what we collected:
For each of these theories, we constructed a model. For each, we tried weighted linear or binomial models (using statsmodels), hierarchical models (using sklmer), or ML models like Random Forests, XGboost, or LASSO (using sklearn). We measured performance of each of the models and kept those that had (a) a low mean cross-validation score and (b) a low variation in the cross-validation scores.
Once we had a set of weak but stable models, we blended these using another learner to produce our final outcome. Here’s what the architecture looks like:
The model has two tiers — child models and the parent model. Each of the child models only sees a subset of the features based on a theory of change. The parent model receives the predictions of each of these child models as features in addition to a subset of the features not used by any of the child models.
The simplest parent model is the “averager” — it takes the prediction of each of the child models and returns their average. This worked remarkably well. We gained only a small improvement in cross-validation score by using a Random Forest or XGBoost as the parent model.
By using this new architecture, we were able to reduce our cross-validation score by almost 20%.
First, the performance of causal models tends to be more stable than correlation-based models across different environments. Though we are not doing any causal inference here to measure treatment effect, each of the child models is based on a theoretic causal model of the world. A number of these have been well studied and shown to be true in certain settings. By using such stable models, we reduce the size of our error in new geographies.
When you have a set of weak statistical learners, ensembling them has shown to produce better performance. Scullet et al. had a gem of a paper in 2015 called “Hidden Technical Debt in Machine Learning Systems”. I keep coming back to this paper when designing a new ML-based solution. There is just so much good advice packed in that paper. One of the suggestions in there to avoid entanglement is to “isolate models and serve ensembles.’’ Ensembles work best if the errors of the child models are uncorrelated. In our architecture, each of the child models only see a small subset of the features and there is little overlap between the columns seen by each of the models. Though it gives us no guarantees (columns used in child models may still be correlated) it reduces the likelihood of correlated errors for the child models.
Despite this, proof is in the pudding. We are still waiting on final on-the-ground results to measure performance and decide on the next steps to take.
In 2021, EG further expanded its operations in Uttar Pradesh. Not only are these districts far from ones where EG has an existing footprint, they are also relatively more urban. Third, there is a pandemic. It is hard to estimate how this has changed the landscape. In all, there are legitimate concerns of substantial dataset shift. The improvements we have made to the model do not give us guarantees or bounds on our error.
There is a good overview of methods in the introduction of Kuang et al (2020). There are two major ways to deal with such dataset shifts:
Fortunately, we have both — access to test set features and multiple datasets from each round of expansion. Unfortunately, these methods are complex to implement and computationally expensive. Experimentation and tuning will take substantial effort. We are waiting on the results of the latest round of data collection to see how our model performed before investing in these more robust approaches.
There are still millions of girls out of school in India. There is much work to be done. We will continue to invest in improving our model to support EG as they extend to new geographies and support a greater number of girls to go to school and learn. Stay posted for future posts on this work.
Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45, 2014.
Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages 601–608, 2007.
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016
Kuang, K., Xiong, R., Cui, P., Athey, S., & Li, B. (2020). Stable Prediction with Model Misspecification and Agnostic Distribution Shift. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 4485–4492. https://doi.org/10.1609/aaai.v34i04.5876
Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
Miroslav Dudík, Steven J Phillips, and Robert E Schapire. Correcting sample selection bias in maximum entropy density estimation. In Advances in neural information processing systems, pages 323–330, 2006.
Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137–2155, 2009.
13 November 2024
11 November 2024
5 November 2024
25 October 2024
22 October 2024
21 October 2024
16 October 2024
14 October 2024
10 August 2021
23 April 2024
15 November 2021
12 October 2022