Skip to content

Improving efficiency in data collection with automated grid sampling

Example of a printable map created from clustering grids in Mulobezi district, Zambia ©IDinsight

When we begin a survey project, the first challenge we face is identifying who to survey, and how to do it efficiently. In this blog post, we discuss grid-based sampling – a method that enables researchers to sample targets for surveys using geospatial data. We describe a tool that we built to make grid-based sampling easier and more efficient. In a follow-up post, we will talk about the different projects and geographies for which we used grid-based sampling, and discuss the advantages and disadvantages for each.

While planning a survey, we typically use census data to partition survey regions into reasonably sized chunks or enumeration areas (EAs). We then use the data to list people, households, administrative entities, etc. within these EAs. We then randomly sample a subset of this list such that they are representative of the population while balancing the considerations for survey logistics. While there are many approaches for this random sampling, the description above is by far the most common approach for rigorous surveys such as those conducted by national statistics agencies or the World Bank.

Sampling frames (for households) are typically created using demographic information: voter rolls, census data, employment registries, etc. However, researchers and surveyors are increasingly relying on geographic information – i.e. remote-sensing data – to estimate population counts in geographic “grids”. This is due to two reasons: first, remote-sensing data can replace or augment census data in cases where the latter is inaccurate, expensive, or unavailable; and second, the geographic grids provide demographic information at a much higher resolution, thus enabling researchers to create smaller EAs and sample households more efficiently while being representative.

Illustration of census-based versus grid-based sampling processes. PSU = Primary Sampling Unit. PPES = probability proportionate to estimated population. Source: Thomson et al., 2017.

Grid-based sampling

With the advent of freely available satellite imagery and sophisticated machine learning methods to process them, it is now possible to obtain population estimates for grids of a certain resolution (30m x 30m, 100m x 100m, etc.) tiling any geographic area. We can use this data as a sampling frame from which to select and demarcate EAs, either independently or in tandem with other sources of demographic data – this is known as grid-based sampling.

While grid-based sampling offers potential improvements to the traditional sampling approaches, it still presents challenges of its own. It requires expertise in understanding and manipulating remote-sensing data to obtain the information relevant to a specific survey. Furthermore, generating EAs from a grid-based sampling frame is difficult since it sometimes requires sophisticated clustering algorithms to group survey targets (see this post on how we tackled this problem).

At IDinsight, we have increasingly been using grid-based sampling for various projects in India and the Philippines. As we expand our data collection services via DataDelta and other efforts to more countries and continents, we expect to have more and more use cases for grid-based sampling. This means we also have to anticipate and mitigate the problems with grid-based sampling.

Automating grid-based sampling

For past projects, teammates have created gridded sampling frames using software like QGIS. However, these tools require users to manually specify EAs, require a lot of familiarity with geospatial data and the associated tools, and also require project teams to find gridded population density maps of the appropriate resolution on their own. Since many open-source freely available maps are generated using machine learning tools, it is hard for non-technical teammates to assess the accuracy and reliability of these maps. Finally, while the sampling frame by itself is useful, it requires significant effort to convert the frame and corresponding EAs into a format (e.g. maps or tables) that a field team can use to guide enumerators.

The site from FlowMinder provides an automated tool for grid-based sampling that does not require the users to find their own population density maps. However, it only uses population density maps with grids of 100m x 100m resolution, which may be too coarse for the majority of surveys. FlowMinder has also stopped actively developing the tool, which means that it would be difficult to add or request new features or adapt it to emerging contexts.

To address the difficulties outlined above, the Data Science, Engineering, and Monitoring Systems (DSEM) team at IDinsight built a tool to automate grid-based sampling, modeled along the lines of the tool. It requires users to provide gridded population density estimates and shapefiles delineating the boundaries of the geographic region they are interested in. The tool then automatically creates a sampling frame from these inputs. We also implemented an efficient clustering algorithm that can combine grids into appropriately-sized EAs. Clustering is a useful part of the process since we often want to sample a certain number of households per EA. In cases where the EAs have very few households (e.g. in the Phillippines), we risk visiting many of them, and not finding enough households since the underlying data we are using to create these EAs is not 100% accurate. Clustering them to form new EAs allows us to sample enough households per EA to maintain scientific rigour, while also being resource-effective. 

The tool also creates printable maps and .kml files that can be uploaded to Google Maps, to help pinpoint survey households and help enumerators navigate to them.

As we pointed out, it is not always easy to find population density maps or shapefiles, or assess their accuracy. However, we can use historical survey data from DataDelta and other sources to calibrate these maps. We can also compare the density maps and tool outputs to ground truth data from data collection, in order to estimate mismatches. Moreover, from user interviews and from our experiments while building the tool, we have begun to compile a database of potential sources of population density maps and shapefiles: we anticipate that this will reduce the effort expended by project teams, by ensuring that they don’t duplicate the work of other project teams, while also adding to the store of collective knowledge across the organization.

Schematic of automated grid-sampling tool.

How is the tool useful?

We have already piloted the tool for a few different projects – in India, Zambia, Morocco, and the Philippines. The clustering algorithm to combine population grids into EAs, in particular, has proved useful to project teams, since they can quickly experiment with different sizes for EAs during the planning stage, and also re-draw the areas as circumstances change on the ground (more details on the algorithm in this blog post). The algorithm is also flexible in that it can cluster grids based on population, geographic area, rooftop numbers, etc.: essentially on any metric that can be observed or inferred from satellite data.

We anticipate that automating many of the steps involved in creating sampling frames will free up a lot of time for teammates, while also helping plan and budget for projects better. Moreover, the flexibility and relative ease of using the tool, specifically with respect to the clustering algorithm, means that we can quickly create sampling frames for different scenarios: for example, smaller and bigger EAs, fewer or more enumerators, enumeration based on population size or geographic size, etc. We hope that this will help project teams plan better, and also adapt their plans quickly to evolving circumstances during a survey project.

Example printable map created from clustering grids in Mulobezi district, Zambia.

What’s next?

Despite the uses we envision for it, our tool does not satisfactorily address all the problems with grid-based sampling. First, while we’ve made a lot of progress on the clustering algorithm, we have also run into a lot of challenges. For instance, in the Philippines, the algorithm did not scale well. The clusters were often non-contiguous and spanned large geographical features (e.g. rivers, mountains, highways, etc.) making it very difficult for surveyors to conduct listings in the clusters. Dr. Sarchil Qader and his team at WorldPop have developed a QGIS-based tool that is capable of creating EAs that conform to geographic features and estimating the time it would take for an enumerator to traverse. We are exploring ways to integrate this pre-EA tool with ours, in order to mitigate some of the issues with the clustering

Another problem  is the low accuracy of the population density maps and shapefiles: we are thinking of ways to include uncertainty estimates for population values in each grid, flagging grids or regions where the information is unreliable and using historical survey data to calibrate outputs from the tool.

The tool also requires, at minimum, someone with reasonable expertise in Python to run it and also expertise working with cloud infrastructure to scale it up for reasonably-sized survey areas. We are working on ways to enable code- or language-agnostic ways to use the tool.

Overall, we are optimistic that the tool in its current form will be useful to project teams planning surveys, particularly where traditional data sources are unreliable for creating sampling frames. Our tool is especially handy if project teams need to cluster grids based on different metrics, and to simulate different scenarios while designing the survey – look out for part 2 of this blog post, where we will discuss the use cases of grid-based sampling surveys at IDinsight in more detail. 

We look forward to the tool (and future updates) contributing to impact by supporting our teammates and streamlining surveys we undertake.