Click here to read Part 1.
Governments and NGOs often need to combine datasets from different sources to identify delivery gaps in existing programs or measure a program’s effects on particular demographic groups. In a previous post, we explain the challenges researchers face when matching datasets using Hindi string identifiers and illustrate the need for a customised fuzzy matching algorithm.
This post expands on how we addressed a similar problem when, as part of testing the suitability of using voter rolls as sampling frames (explained below), we had to match names on voter rolls with names from a household listing. Our algorithm improves upon existing off-the-shelf fuzzy matching programs by using a stepwise approach tailored to the nuances of the Devanagari script. Below we describe how we were able to use this algorithm and how we assessed its performance. Researchers can access our full code here and tailor it to their use cases.
IDinsight’s Data on Demand (DoD) service provides government and non-profit decision-makers with just-in-time information about citizen’s preferences, conditions, and behaviors. Using novel methods adaptable to diverse topics, DoD returns accurate survey results within days at a low cost, propelling the do-learn-improve cycle for policy changes and programmatic interventions. By testing diverse methodological innovations, including alternative sampling methods, we are also consistently examining ways to improve DoD’s time and cost efficiencies.
Towards this, we recently tested whether publicly available voter rolls in India are a suitable alternative sampling frame to an expensive household-listing exercise (i.e. mapping and enumerating all households in the area). To assess the comprehensiveness and accuracy of voter rolls, we conducted a full household listing and matched the eligible voters from our listing to the voters on the corresponding voter rolls. We ran this exercise in 13 urban and rural locations in northern India, collecting information on 7,769 voting-age individuals.
Our approach was to first match self-reported voter ID numbers with those listed on voter rolls since, at least in theory, these IDs are supposed to be unique identifiers such that each voter should have their own ID. However, of the 7,769 voting-age individuals from our household listing, 14.5 per cent of the respondents reported they were not registered to vote, 43.2 per cent did not provide their voter ID number, and 9.7 per cent of self-reported voter ID numbers did not match with a voter ID in the publicly available voter rolls.
To account for these cases, we built a customized algorithm to “fuzzy match” individuals, which relied on a combination of three types of information:
In theory, we could have relied on Stata’s reclink command, or one of several user-written fuzzy matching programs that are specific to Devanagari, to identify approximate matches for the names. However, with experimentation, we found that we could nearly double the match rates by taking a stepwise approach. False matches in fuzzy matching algorithms propagate: an early false match that incorrectly removes an individual from the match pool, leads the algorithm to make false matches with other individuals in later steps. For this reason, we started with the matches we were most sure about, removed them from the match pool, and then matched the remaining individuals on progressively less strict criteria.
A simplified example of this stepwise approach is that we first matched on the exact names of the individual and their primary relation, followed by exactly matching on the individual’s name but fuzzy matching on the relation name, then vice-versa, followed by fuzzy matching on both the individual’s and the relation’s name, and so on. By completing matches we were more sure about, before moving to matches we were less sure about, our algorithm led to fewer false matches.
We also found that by customizing the algorithm to the transcription-related idiosyncrasies of the specific datasets we were using also improved match rates. Here, we not only accounted for transcription errors common when transcribing Devanagari (like swapping commonly interchangeable letters ‘v’ and ‘b’: see previous post for a more detailed explanation of why this is important), but also addressed errors particular to Hindi-based administrative datasets. For instance, voter rolls inconsistently attach the word ‘Devi’ to a female voter’s name, even if this word is not part of her official name. To account for these discrepancies, we removed all instances of ‘Devi’ from our matching algorithm.2
On running our algorithm, we found that:
Table 1 lists some examples of names that were fuzzy matched across the census and voter rolls despite having different spellings.4
Our dataset provided a unique opportunity to assess the performance of our matching algorithm. About 2,530 individuals in our household listing provided their voter ID number, and so for each of these individuals, we can find the corresponding matching ID in the voter rolls. Since we know the ‘true’ matches for these 2,530 individuals, we can test the performance of our algorithm by running it on this group and counting up how many correct and incorrect matches are made.5
Our algorithm performs well, matching 93.1% of individuals with the correct entry in voter rolls. As a point of comparison, Stata’s reclink command only finds half as many matches (47.5%). Our algorithm also makes half as many incorrect matches as reclink (2.3% vs 4.9%) and leaves far fewer individuals unmatched with the correct entry in the voter rolls (4.6% vs 47.7%).
Given our algorithm’s better performance, we recommend this over existing off-the-shelf fuzzy matching programs. The Stata code for our matching algorithm is available here. We also provide the Python code for another fuzzy matching use case at IDinsight, based on the same algorithm, where we matched student names in our client’s database with attendance records at government schools. We hope that these resources will help other researchers to generate better insights from combining different administrative datasets.
22 February 2024
14 February 2024
9 February 2024
1 February 2024
28 January 2024
24 January 2024
18 January 2024
22 December 2023
13 December 2023
18 December 2019
24 November 2022
11 November 2021
25 October 2019