Nano’s sampling frame, using geographical segmentation and publicly available indicators of population.
Gathering accurate information in rural areas is often costly and difficult — households are spread out and hard to reach or find. Because it is challenging to survey constituents, local leaders often lack important knowledge about their community members’ needs. This makes it hard for them to make evidence-based decisions about how to allocate resources or which services to provide. Given IDinsight’s focus on data-driven decisions in the social sector, we see this as an important information gap. To this end, IDinsight is developing an innovative system (Nano) that seeks to transform information collection and use by chiefs in Zambia. Nano provides chiefs with accurate information that enables them to resolve local challenges in a timely manner.
Nano is demand-driven and, therefore, gathers data that chiefs request. In the early scoping stages of this project, we identified the need to collect household-level data from a representative sample of households. However, there was no reliable and up-to-date list of households (or villages) that we could use as a sampling frame (and from which to draw a sample). Knowing that we want to operate Nano at scale for a low cost, censusing the whole chiefdom to create a sampling frame was not an option. Instead, we wanted to create a sampling methodology that was economical, scalable, and easy for surveyors to implement.
This blog post outlines the geographic sampling approach which we used to create such a sampling frame to draw a representative sample. At a high level, we work with the electronic map of the area of interest (step 1), segment the area into smaller cells (step 2), identify the cells with a non-zero likelihood of household presence (steps 3–5), and sample cells for surveying (step 6–7).
Step 1: Plotting all the shapefiles on the map.
We started with plotting the boundary for the study population. In our case, we had access to a chiefdom’s boundary shapefile. Next, we included any other shapefiles that may serve as natural borders to smaller segments (roads, waterways, mountains — whatever is relevant and available). The chiefdom boundary and roads and waterways boundaries are shown in Figure 1.
Humdata is an incredible source for all sorts of datasets (completeness of the public datasets is questionable but having some information is better than having no information at all).
Figure 1: Chiefdom, rivers, and roads boundaries
Step 2: Dividing the full study area into smaller cells (we called them enumeration areas — EAs).
We segmented the chiefdom into 500×500 meter grids (Figure 2). The size of the EA was somewhat arbitrary, but we wanted to ensure that they would be small enough to be manageable for the surveyors on the ground, but large enough to contain two households on average. Each EA was then trimmed to natural borders to avoid natural landscape (or infrastructure) splitting EAs and making it easier for surveyors to stay within EAs boundaries.
Figure 2: Segmentation of Chiefdom into EAs
Step 3: Determining areas with high probability of household presence.
A large part of the chiefdom is uninhabited land, and sending surveyors on a forest walk did not seem like a cost-effective way to spend time, so we needed to identify EAs with some indication of household presence. This is where Facebook’s population and OpenStreetMaps rooftop datasets were useful (hereafter called “FB” and “roofs” dataset, respectively).
The advantage of using both datasets was that they were constructed differently. Facebook used a machine learning algorithm to identify populated areas, whereas the OpenStreetMaps dataset was produced by a volunteer-based structure-tagging exercise. If each dataset has systematic exclusion errors, the reasons for exclusion are likely different between the two. Therefore, using both datasets was more likely to result in sampling frame without systematic exclusion of certain types of households. The two datasets are plotted in Figure 3.
Figure 3: Facebook population and roof datasets
Step 4: Identifying EAs with non-zero household presence.
Next, we superimposed the roof and FB datasets onto the grid and classified each EA into one of the four categories (shown in Figure 4):
Visual examinations revealed significant agreement between the two data sources (Figures 4.1 and 4.4), there are a sizable number of EAs that have population according to the FB dataset but are absent from the roofs (Figure 4.2), and FB’s coverage rate is more complete than that of the roofs (Figure 4.3). To avoid systematic exclusion errors caused by the methods used in FB and roof datasets, our final sampling frame contains all EAs in which there is a positive facebook population projection, or in which there is at least one roof tagged (i.e. the union of EAs in 1–3 categories).
Figure 4. 1. EAs with at least one roof & FB estimates non-zero population (19% of EAs)
Figure 4. 2. EAs with no roofs, but FB estimates non-zero population (14% of EAs)
Figure 4. 3. EAs with at least one roof, but FB estimates zero population (1% of EAs)
Figure 4. 4. EAs with no roofs & FB estimates zero population (66% of EAs)
We had access to the locations of households from a sub-area of the chiefdom (collected during a previous round of surveying) to ground-truth this methodology and found that only 3% of category 4 EAs had at least one household present (i.e. exclusion error of 3%) — this is a reasonable rate. In a subsequent round of data collection, we found that about 98%, 98% and 88% contained at least one household in EA categories 1, 2 and 3, respectively.
The final sampling frame, which is the union of category 1–3 EAs, is displayed in Figure 5.
Figure 5: Sampling Frame
Step 5: Determining the number of EAs in the sample.
After we created a sampling frame, we took a random sample of EAs and visited all households within the EA. We determined that in order to measure the key indicators’ means with the required precision, we needed approximately 500 households in total.
But how did we determine the number of EAs that we needed to visit in order to ensure that the final sample contains the right number? This was a bit tricky and entailed some guesswork. Using data from a previous round of surveying, we calculated the approximate number of households that we expected to find in each type of EA (categories 1–3), as well as the percentage of EAs that we expected to be completely empty. These projections were used to scale the number of EAs that we needed to sample.
Step 6: Taking a random sample of EAs.
We took a simple random sample (SRS) of EAs with equal probability of sampling (Figure 6 — illustrative, not a sample we used for data collection). Another option is to sample proportionate to population density, which yields lower variance in the estimates (Lohr, Sampling Design and Analysis). However, this approach requires accurate population estimates, and ex-ante it was unclear how well FB data would perform. Using roof density as a proxy to population size may also be problematic since higher roof density in more urban areas may be negatively correlated with the population counts (for example, market centres have a high population of roofs but a low population of households).
Figure 6: Random sample (illustrative, not a sample we used for data collection)
Step 7: Collecting data and monitoring progress.
During fieldwork, the surveyors were instructed to visit all households within their assigned EAs. They were equipped with a tablet and an electronic map. Shapefiles of the EAs were uploaded on the map and were clearly demarcated for easy navigation. Surveyors were instructed to visit all EAs on their maps, stay within EA borders, and survey all households within those areas. In case an EA contained no households, they were required to “check in” to the EA as proof that the EA was visited. Surveyors’ progress and compliance with the procedures was monitored remotely and is described in our blog post here.
While we used this sampling approach in isolation, it can be combined with more traditional sampling methods. For example, a common sampling approach is to randomly select villages, census all households in the villages, digitize the data and select a household sample. This approach may be expensive if villages are geographically diffuse and/or cover a large area. Another option is to use household listings provided by village heads as a sampling frame. However, the listing may not be up to date and potentially lead to a less representative sample, if specific types of households (e.g. newer or remote households) are not listed in the village records. Alternatively, if using village listings or conducting a census is not logistically or financially possible, surveyors can be instructed to do a random walk or employ another on-the-ground sampling strategy. These approaches require surveyors to fully understand and adhere to the protocol, and employ it with diligence (and actually cover all the remote areas) with little supervision. Instead, it is possible to integrate geographical segmentation with electronic maps to save costs and tighten monitoring. Specifically, given access to village-level GPS coordinates (village centroids), it is possible to use FB population density to define village borders, segment the village into smaller areas and instruct surveyors to only survey randomly selected areas. Under this approach, surveyors have clearly defined geographical areas to cover, can be more easily held accountable to their work-plan and monitored remotely.
More generally, geo-tagged granular datasets can be used not only to create sampling frames but also to identify areas for targeted interventions or to create stratification variables at the right level of geography. Big data is entering development quickly, and we should take full advantage of it.
We welcome comments, suggestions, or feedback from others who have used geographic sampling approaches.
The code and data required to implement steps 1–5 of this strategy are found here and uses Python. Users will need to download the relevant shapefiles and save them to the Shapefiles folder. The initial study area boundary shapefile should be a coordinate system that uses meters (e.g. EPSG: 32735 for Southern Africa), while the rest of the shapefiles should be in the standard longitude-latitude coordinate system (EPSG: 4326).
The sample projection uses the output for the Python code and was done in Stata (the Stata code is not included because it is more easily replicable). Shapefiles can be loaded onto QGIS to easily visualize the data.
13 March 2020
30 November 2019
15 June 2020
20 May 2019