Skip to content

The reality behind a machine learning dataset – Part 2: Practical learnings for development, engineering, and data science

Victor Zhenyi Wang 24 November 2022

A surveyor mapping plot boundaries ©IDinsight

In Karl Popper’s The Open Society and its Enemies (1945) he introduces “piecemeal social engineering,”  his framework for building up social institutions incrementally informed by experimentation and evidence. This is in contrast to the more prevalent “utopian social engineering” of his time which he criticized for overly lofty / abstract ideals that largely ignored practicality; indeed today we might regard such methods as colonial and paternalistic. For Popper, the “piecemeal engineer knows, like Socrates, how little he knows. He knows that we can learn only from our mistakes.” 1

In that spirit, I want to begin with lessons we have learned in trying to apply engineering principles and methods to help our partners increase their social impact. I hope that our learnings can be helpful for others in the sector.


One of the biggest constraints we have experienced as data professionals working in the development sector are the additional degrees of separation between the work we do and the ultimate beneficiaries of our work. By its nature, data already abstracts and reduces the richness of full contextual information. In a model building context, getting a good grasp of this contextual information is critical to a useful model. 

Originally much of our work took a top-down approach guided by outputs from analysis, frequently the most common framework for data science work. We would get some data, perform data analysis, followed by feature engineering, model selection, validation, and then production. Over time, we realized that this approach tended to create outputs which were misaligned with the theory of change employed by our partner organizations.

This has led to a more “product” minded approach to data science. Today, we start all projects by working closely with our partners to understand their theories of change, generate user stories, and then perform analysis relative to those user stories. 

For example, we worked with a political advocacy organization to help them match more people with social welfare benefits. Although initially, our focus was on improving a model we had developed for them in an earlier phase of the project, through a series of workshops, we understood that the more pressing need for their organization, as well as a much larger impact statement, was actually in making their data systems more efficient. This led us to pivot away from our initial work in order to pursue work that was more aligned with our partner’s strategic vision and theories of change.

Cost Efficacy

Computing resources for data management and data science can be expensive. For an extreme example, estimates for OpenAI’s GPT-3’s training costs range from $4.6 million to $12 million per training cycle. This creates a power asymmetry (even within high-income countries) when it comes to equitable access to the latest machine learning methods as the costs for training and hosting (putting aside other associated costs for now) grow astronomical.

Unsurprisingly, many of our partners have expressed concern over potentially substantial compute costs that might overwhelm any benefits coming from our work. Although empirically, we have found that costs associated with our deliverables usually make up a slim proportion of an organization’s ongoing costs, reducing ongoing and development costs can make our work more cost-effective thus increasing our contribution to social impact.2

This adds a constraint of trying to implement best-in-class data science practices with various resource constraints. For instance, it was standard practice in my previous job to spin up large virtual machines / other resources on demand to do everything from experimentation to running entire model pipelines to automatically training hundreds of models for selection. However, this practice would not be acceptable for our work at IDinsight. We have found the following set of practices to be helpful for us:

Optimize code to decrease memory and compute requirements

This is a common software engineering practice but generally not with the intention of cost reduction. We have found substantial cost savings in creative ways to minimize memory usage or reducing code complexity which might lead to smaller compute resources without a reduction in performance. For instance, in one project in which we used word embeddings as part of a message-matching application, we found that normalizing the saved word vectors within the embedding ahead of time saved substantial computation time thus obviating the need to size up our EC2 instance.

Rightsizing resources

It is well known that the main cloud vendors provide fairly generous free tiers. Often this is limited to specific resources for a fixed amount of time or gbs per month. However, this does help reduce development costs considerably. For an example on AWS, instead of a Sagemaker notebook mounted on a r4.xlarge, a t3.medium in the free tier might mean no additional costs to development if there’s no need for a super powerful instance or running the instance constantly!

In this regard, it is extra important to ensure that resources are appropriately sized for the task required. A very large and powerful database is just not required to store a fixed few thousand rows of data. In fact, in this case, it probably makes sense to go serverless.

Align solution architecture with use cases

Suppose we want to deploy an API to AWS for a mobile application, which is used by a few hundred people, each making no more than 50 calls a day between the hours of 9am to 6pm on weekdays. We could host everything on a conventional EC2 instance. However, this would be fairly costly since we are paying for server time even when no one is accessing the API. Provided the application does not need to have incredible latency, we could think about a serverless solution such as a Lambda function. This would meet the requirements and cost substantially less. 

In addition to minimizing costs, we have found that it is important to break down and model potential cost scenarios with our partners in order to come up with a solution architecture that makes sense for their budget and capacities. We have found this exercise useful as many cloud vendors are not immediately transparent with their pricing especially if our partners have had negative experiences in the past with surprising bills.

Model Interpretability

This one is specific to data scientists. For most of our careers outside of the development sector, one of the most important goals of our modeling work is to minimize some error function to train the most performant model. However, in our experience, we have found that the most accurate model is usually not the best for social impact. When considering model performance, we have found that other context-dependent factors are just as important as minimal errors. 

We might consider two models indistinguishable provided both lie above a certain margin of error and both are useful in whatever task is required even if one model drastically outperforms the other nominally. As practitioners however, model performance by itself does not always translate to better decision-making or better outcomes for beneficiaries. In particular, if one model is more transparent, more explainable, and more interpretable,3 we ought to prefer that one even at the expense of greater error. Here are a couple of reasons for why we believe this to be the case.

Change management

Model accuracy does not necessarily lead to better uptake in usage by our partners. A big part of our work in advising our partners lies in building trust in our methods and ensuring that those involved understand the model in order for the outputs to be useful. This means that some classes of models may be preferable to others depending on the context of our client organization. 

For example, we have found that most tree-based models are generally fairly intuitive for most people who have a somewhat technical background (e.g. understands linear regression) especially when paired with interpretation methods such as SHAP. As a result, when we have invested time in making sure our partners understand our modeling work and are comfortable with interpreting the outputs themselves, we have noticed more trust in our work overall.

Bias, variance, and imperfect information 

Another reason why we might focus on interpretability and explainability is that we are frequently data constrained in our work. Training data is often expensive (especially if survey data), limited, outdated, and riddled with data quality issues; it can be very easy to overfit. Consequently, issues such as covariate shifts become serious concerns for our work. 

As such, we have found that an easily interpretable model might be more suitable as it can be easier to figure out what might have gone wrong. For example, we worked with an organization to predict the number of out-of-school girls in different regions in India. We found substantial variation in performance across districts and states suggesting that the model did not generalize well. However, the model was a legible ensemble of sub-models that represented different aspects of theories of change for why girls might be out of school. This suggested to us that we were likely missing covariates which led to a round of model improvements. 

Finally, it is worth noting that it may be worth investing in interpretation tools for our models specifically with users at our partner organizations in mind. For the above prediction project, we created a small application that mapped recommended villages to visit topographically. For another project involving matching messages to a database of topics, we created a tool that allowed less technical users to experiment with how our model would assign potential messages. Our experience in this area has led to our partners becoming more confident with our tools, which has greatly aided our collaborative efforts. For instance, picking up potential covariate drift issues for us. 

  1. 1. Poverty of Historicism, p67
  2. 2. Argument for cost-effectivess ↗ social impact: If an organization contributes to some notion of the “good” by X units with N units of funding, if IDi work acts as a multiplier of p > 1 at a cost of K then the comparison between two worlds, W1 and W2, wherein W1 we work with the org and in W2 we do not proceed as follows: U(W1) = N*X, U(W2) = (N-K)*X*p => Our uplift = N*X*(p-1) – K*X*p. Thus reducing K improves impact proportionally. If you reject the above, here is an alternate argument: 1) Two pieces of work have the same quality if they improve our partner’s social impact by the same amount. 2) Our partner organization can allocate excess funding efficiently. 3) If we can provide the same quality of work at a lower cost, then we have no reason not to prefer the cheaper option. 4) We ought to prefer the cheaper option as it generates more uplift.
  3. 3. We acknowledge that attributes such as explainable, transparent, interpretable are not necessarily intrinsic to specific types of models. For instance, there are many ways to make neural networks interpretable despite the perception that they are not. However, many such methods do not make it easier to interpret for non-technical audiences/ in resource constrained settings e.g. if our partners do not have a data science team of their own.