Skip to content
Blog

Why IDinsight is investing in building data systems for primary data collection activities

Mitali Roy Mathur 25 February 2022

IDinsight is innovating to provide high-quality data to decision-makers efficiently and cost-effectively.

Survey of an Auxiliary Nursing Midwife (ANM) in Bahraich, Uttar Pradesh, India. ©Prabhat Sharma/IDinsight

I will never forget the day that I woke up to a flurry of missed calls and WhatsApp messages that all contained the same message: “it’s not working!” 

It was the first day of data collection for our first COVID-19 phone survey. My teammates and I had worked diligently to design, prepare and launch a mobile phone survey in our effort to understand the impacts of COVID-19 on rural households in India. Policymakers needed to quickly make informed decisions about how to manage the pandemic and requested IDinsight to support. Given this urgency, we had only twelve days to execute a survey to a representative sample of over 5,000 households across eight states in India.

I frantically opened my laptop to the Google Sheet tracker and saw my defeat: importrange internal error. The Google Sheet we had spent hours defining formulas for to manage assignments, productivity, and data quality had crashed given the influx of data we received. This meant that surveyors did not know who to survey, and as a team, we were unable to monitor data quality and survey progress. Luckily, we were able to fix the error by disaggregating our Google Sheets into smaller ones that could better process our data as a temporary solution. Although we were able to complete data collection in time, resolving the issue cost us considerable time and effort.

Thanks to IDinsight’s continued investment in data systems, I have not had this experience since.

The Need for Systems

Data on Demand (DoD) aims to accelerate the do-learn-improve cycle for policy changes and programmatic interventions by generating affordable, high-quality primary data at quicker speeds. By facilitating access to data “on demand,” DoD hopes to transform the way the social sector innovates, learns, and improves. In this way, DoD has the potential to affect policies that could improve millions of lives. 

One of our main activities on DoD is to conduct data collection at a large scale. The data collection process involves many moving pieces, and it is inefficient to start from square one for each survey. Furthermore, it requires a tremendous amount of effort to manage a remote surveyor workforce using purely manual systems, especially when management processes need to be informed by real time data coming in from the field. To tackle these challenges, the DoD team has been collaborating with IDinsight’s Data Science, Engineering and Monitoring Systems (DSEM) team to design and create a platform called SurveyStream, which consists of automated systems that can be used to better manage all stages of the data collection process.

The chart below highlights the various tasks involved in managing primary data collection activities:

As you can imagine, executing these tasks on short timelines, at scale, and across different regions can be quite challenging, especially if each task is manually done. 

Let’s take a deeper look into one of these task categories: assignments. The goal of this task is to ensure that surveyors know who to survey. While this may seem simple, there are a few considerations that complicate the task:

  1. How will you generate assignments? Will you randomly assign surveyors to households? What if surveyors have different language competencies? How can you optimize assignments so that surveyors do not need to travel long distances?
  2. How will you communicate assignments? How will you inform surveyors of who they should visit? Will you distribute printouts or send emails? If it is a survey in which assignments depend on past attempts (ex: phone surveys), how will you communicate these dynamic assignments to surveyors?
  3. How will you dynamically update assignments? What do you do if a surveyor drops out midway through data collection? Who is responsible for reassigning her assignments to another surveyor? How will you inform the surveyor team of such updates so surveyors are aware of additional assignments? How can you shuffle around assignments to optimize productivity?

It is very possible that some surveyors experience problems with receiving their assignments in a timely manner if all tasks are individually and manually executed. For example, it could be the case that a surveyor may not get their assignment sheet or notice that they are visiting households too far away. It is also possible that a surveyor drops out and we forget to give that surveyor’s assignments to someone else, and therefore miss out on surveying certain respondents.

When breaking down each of the other tasks involved in surveyor and data collection management, a parallel set of challenges emerge.

SurveyStream: Data Systems for Primary Data Collection

In order to improve our ability to efficiently execute multiple data collection exercises while also innovating to make data collection faster, cheaper, and of higher quality, we have been collaborating with the DSEM team for the past two years to create automated systems, which form a platform we call SurveyStream. These systems are meant to standardize and automate the tasks involved in surveyor management and data collection (described above). Systems can be designed to fit each project’s needs: they can be as simple as a Google Sheet or more complex like a customized web application.1 On the DoD team, we have found that system requirements depend on the scope of the data collection exercise. For smaller pilots, we have successfully managed processes using Google Sheets, Google Data Studio, and Google Forms, but for large-scale surveys, we have found the need to invest in more technologically advanced systems to handle each task and manage large loads of data. 

The table below summarizes a select few of the SurveyStream systems our teams have created to improve the efficiency and ease of primary data collection tasks:

Data Quality Calculations: For more details on the data quality checks conducted by the Data on Demand team, please see the descriptions here: https://www.idinsight.org/article/5-steps-to-constructing-a-composite-data-quality-index-to-assess-overall-surveyor-performance/

All raw survey data inputted in the above systems is extracted from the survey platform we use, SurveyCTO, using an API.2 3 Our data is stored in a secure cloud based server, which is more secure than project teammates keeping files locally on their computers. These various systems are integrated using the same database and data pipeline back end,  called SurveyStream. In the future, we hope to integrate different SurveyStream features like the data quality and assignment systems to the web application so that they can be configured and monitored on a more user-friendly platform, rather than more rudimentary interfaces like Google Sheets.

Let’s return to the example above on the tasks involved in sharing assignments with surveyors such that they know who to survey. With the Productivity Tracker, Web App, and Email Assignments systems described above, many of the challenges are resolved. 

  1. How will you generate assignments? While project teams still need to match surveyors to targets, the details on each surveyor hired for the project are standardized, enabling IDinsight teammates to easily pair surveyors to targets by matching variables such as location and language. The web application provides a nice interface to allow surveyor supervisors to adjust these assignments. 
  2. How will you communicate assignments? Instead of printing PDFs (which might change if surveyors drop out), updating Google Sheets and expecting surveyors to follow along, or manually sending messages to each surveyor, the email assignment system will automatically send emails to surveyors with an updated list of targets at an adjustable frequency. These tables will take into account any reassignments made on the web application and remove the targets in which the respondent completed the survey or refused the survey (calculated in the productivity tracker). In this way, surveyors do not need to manually keep track of who they visited and who is still left to visit: there is a lesser chance that they will “miss” speaking to someone.
  3. How will you dynamically update assignments? Supervisors of surveyors can edit assignments on the web application if a surveyor drops out or if they want to optimize productivity. The changes made here will be reflected on the email assignments sent to surveyors.

We have found that the systems built for assignments reduce the time spent on assignments related tasks by ~50%. As a result, IDinsight teammates can direct their time towards other aspects of the project or other innovation related work. 

The beauty of each automated solution lies in its replicability, flexibility, accuracy, security, and scalability. 

  • Replicability: Instead of creating new spreadsheets or writing new lines of code for each data collection exercise, we can simply “reuse” existing systems built for previous rounds. 
  • Flexibility: Each system is configurable by teammates based on the context of each survey. This enables systems to adapt to the specifics of each survey (ex: frequency of running data quality checks, unique survey form identifier etc.).
  • Accuracy: All systems have been tested using previous survey data, and are significantly less error prone compared to manual tasks done in a rushed manner during peak periods of data collection. 
  • Security: Manual methods of survey management require project team members to routinely download raw survey data onto their laptops in order to run analysis and generate information on quality and productivity. A significant benefit of moving to an automated system is that sensitive data is stored and managed in a secure cloud-based environment, and the majority of users only interact with outputs that contain less-sensitive metadata that is needed for survey management. Using these more sophisticated systems, access permissions for viewing sensitive information can be more tightly controlled and monitored as well.
  • Scalability: For large scale surveys, manual solutions are inefficient: Google Sheets has cell count limits and data downloads can be slow. A data system built on the cloud can store and process data in a highly scalable way, even auto-scaling resources to ensure project teams can access the information they need even when load on the system spikes.

The Future of SurveyStream

IDinsight is continuing to support the work of the DoD and DSEM teams in building and improving SurveyStream. Now that the key datasets and features used for survey management are being stored and managed through the core SurveyStream platform, our opportunities for building more sophisticated features and doing more advanced analysis of survey data will grow. We are currently exploring building features for optimizing assignments based on GPS locations, tools to visualize sampling frames, estimating population sizes from satellite imagery, ranking surveyors in terms of suitability for a survey, predicting poor data quality, photo analysis, and more!

The efficiency benefits gained from each system are substantial – surveys can be run and managed more rapidly and with fewer errors, freeing up time for teammates to work on other productive activities, and reducing overall survey costs. We strongly believe that a continued investment in designing and developing these systems will enable the DoD team to take on more projects in which high-quality data is collected at a higher frequency and lower cost.

 

  1. 1. Donald Lobo, Sanjeev Dharap, “How the Social Sector Thinks about Tech is Wrong,” India Development Review, October 22, 2021, https://idronline.org/article/technology/how-the-social-sector-thinks-about-technology-is-wrong/
  2. 2. Jeenu Thomas, “Python & SurveyCTO: We bring you PySurveyCTO.” SurveyCTO Blog, November 2, 2020, https://www.surveycto.com/blog/idinsight-phython-surveycto/
  3. 3. The source code for the Python and SurveyCTO API is publicly available here: https://github.com/IDinsight/surveycto-python