Blog

Part 2: What’s in a name: Combining datasets when unique identifiers are missing

Ruchika Joshi , Jeff McManus , Jeenu Thomas 22 October 2020

Governments and NGOs often need to combine datasets from different sources to identify delivery gaps in existing programs or measure a program’s effects on particular demographic groups. In a previous post, we explain the challenges researchers face when matching datasets using Hindi string identifiers and illustrate the need for a customised fuzzy matching algorithm.

This post expands on how we addressed a similar problem when, as part of testing the suitability of using voter rolls as sampling frames (explained below), we had to match names on voter rolls with names from a household listing. Our algorithm improves upon existing off-the-shelf fuzzy matching programs by using a stepwise approach tailored to the nuances of the Devanagari script. Below we describe how we were able to use this algorithm and how we assessed its performance. Researchers can access our full code here and tailor it to their use cases.

Matching individuals in a household census with voter rolls

IDinsight’s Data on Demand (DoD) service provides government and non-profit decision-makers with just-in-time information about citizen’s preferences, conditions, and behaviors. Using novel methods adaptable to diverse topics, DoD returns accurate survey results within days at a low cost, propelling the do-learn-improve cycle for policy changes and programmatic interventions. By testing diverse methodological innovations, including alternative sampling methods, we are also consistently examining ways to improve DoD’s time and cost efficiencies.

Towards this, we recently tested whether publicly available voter rolls in India are a suitable alternative sampling frame to an expensive household-listing exercise (i.e. mapping and enumerating all households in the area). To assess the comprehensiveness and accuracy of voter rolls, we conducted a full household listing and matched the eligible voters from our listing to the voters on the corresponding voter rolls. We ran this exercise in 13 urban and rural locations in northern India, collecting information on 7,769 voting-age individuals.

Our approach was to first match self-reported voter ID numbers with those listed on voter rolls since, at least in theory, these IDs are supposed to be unique identifiers such that each voter should have their own ID. However, of the 7,769 voting-age individuals from our household listing, 14.5 per cent of the respondents reported they were not registered to vote, 43.2 per cent did not provide their voter ID number, and 9.7 per cent of self-reported voter ID numbers did not match with a voter ID in the publicly available voter rolls.

Our customized “fuzzy matching” algorithm

To account for these cases, we built a customized algorithm to “fuzzy match” individuals, which relied on a combination of three types of information:

Different spellings of their name and their primary relation’s name (i.e. father or husband name);
Information on each individual available in the voter rolls (gender, age, marital status, whether individuals were listed in the same household as other individuals); and
Whether names matched exactly or approximately (with varying degrees of similarity between matched names).¹

In theory, we could have relied on Stata’s reclink command, or one of several user-written fuzzy matching programs that are specific to Devanagari, to identify approximate matches for the names. However, with experimentation, we found that we could nearly double the match rates by taking a stepwise approach. False matches in fuzzy matching algorithms propagate: an early false match that incorrectly removes an individual from the match pool, leads the algorithm to make false matches with other individuals in later steps. For this reason, we started with the matches we were most sure about, removed them from the match pool, and then matched the remaining individuals on progressively less strict criteria.

A simplified example of this stepwise approach is that we first matched on the exact names of the individual and their primary relation, followed by exactly matching on the individual’s name but fuzzy matching on the relation name, then vice-versa, followed by fuzzy matching on both the individual’s and the relation’s name, and so on. By completing matches we were more sure about, before moving to matches we were less sure about, our algorithm led to fewer false matches.

We also found that by customizing the algorithm to the transcription-related idiosyncrasies of the specific datasets we were using also improved match rates. Here, we not only accounted for transcription errors common when transcribing Devanagari (like swapping commonly interchangeable letters ‘v’ and ‘b’: see previous post for a more detailed explanation of why this is important), but also addressed errors particular to Hindi-based administrative datasets. For instance, voter rolls inconsistently attach the word ‘Devi’ to a female voter’s name, even if this word is not part of her official name. To account for these discrepancies, we removed all instances of ‘Devi’ from our matching algorithm.²

On running our algorithm, we found that:

44 per cent of matches were based on voter ID numbers
About 10 per cent of matches were based on the exact same spelling of the individual’s name and their relative’s name
20 per cent of matches were based on very similar spellings that differed due to transcription inconsistencies.
Remaining 25 percent[footnote]1 percent of the matches were implemented manually. [/footnote] matches were based on fuzzy name matching and different types of information provided in voter rolls such as voter gender, age, and house number.

Table 1 lists some examples of names that were fuzzy matched across the census and voter rolls despite having different spellings.⁴

*Table1: Examples of Fuzzy Matched Names*

How do we know our algorithm worked?

Our dataset provided a unique opportunity to assess the performance of our matching algorithm. About 2,530 individuals in our household listing provided their voter ID number, and so for each of these individuals, we can find the corresponding matching ID in the voter rolls. Since we know the ‘true’ matches for these 2,530 individuals, we can test the performance of our algorithm by running it on this group and counting up how many correct and incorrect matches are made.⁵

Our algorithm performs well, matching 93.1% of individuals with the correct entry in voter rolls. As a point of comparison, Stata’s reclink command only finds half as many matches (47.5%). Our algorithm also makes half as many incorrect matches as reclink (2.3% vs 4.9%) and leaves far fewer individuals unmatched with the correct entry in the voter rolls (4.6% vs 47.7%).

Given our algorithm’s better performance, we recommend this over existing off-the-shelf fuzzy matching programs. The Stata code for our matching algorithm is available here. We also provide the Python code for another fuzzy matching use case at IDinsight, based on the same algorithm, where we matched student names in our client’s database with attendance records at government schools. We hope that these resources will help other researchers to generate better insights from combining different administrative datasets.

Authors

Why digitising CHPs could be key to universal health

4 April 2024

Blog

How we’re using a gender lens to increase social sector partners’ impact

1 April 2024

News

IDinsight annonce un partenariat avec Women for Women International

29 March 2024

News

IDinsight announces partnership with Women for Women International

29 March 2024

News

Wrap-up of CSW68 from IDinsight

28 March 2024

Blog

Q+A: How women enumerators are driving impact in data collection

28 March 2024

News

Government of Uttar Pradesh – IDinsight MoU to help government achieve development goals

21 March 2024

Blog

How your organisation can use AI to improve efficiency

29 February 2024

Blog

Improving efficiency in data collection with automated grid sampling

28 February 2024

Blog

Breaking language barriers with IDinsight’s generative AI tool – “Ask A Question”

22 February 2024

1. Since Hindi is commonly spoken in all four states, we instructed both surveyors and data entry operators for voter rolls to enter names in Devnagari. We then transliterated names from both datasets using a modified version of Polyglot– a Python-based natural language pipeline that supports language transliteration. This minimized inconsistencies arising from non-standardised transliteration. We thank Skye Hersh for her technical support on the transliteration process.
2. The list of all changes we made as part of (i) tailored to the voter rolls dataset is available in the read.me file.
3. 1 percent of the matches were implemented manually.
4. The names outputted in Latin are slightly difficult to discern due to the modifications made to the Python package. However, since the transliteration rules were the same across both datasets, these non-standard spellings do not interfere with the matching process.
5. Technically this test gives us an upper bound on match error since the algorithm must find fuzzy matches in a larger pool (7,769 census entries and 9,351 voter roll entries, compared to 5,239 census entries and 6,821 voter roll entries once you remove voter ID matches).

Matching individuals in a household census with voter rolls

Our customized “fuzzy matching” algorithm

How do we know our algorithm worked?

Authors

Ruchika Joshi

Jeff McManus

Jeenu Thomas

Why digitising CHPs could be key to universal health

How we’re using a gender lens to increase social sector partners’ impact

IDinsight annonce un partenariat avec Women for Women International

IDinsight announces partnership with Women for Women International

Wrap-up of CSW68 from IDinsight

Q+A: How women enumerators are driving impact in data collection

Government of Uttar Pradesh – IDinsight MoU to help government achieve development goals

How your organisation can use AI to improve efficiency

Improving efficiency in data collection with automated grid sampling

Breaking language barriers with IDinsight’s generative AI tool – “Ask A Question”

Related content

Using technology to reach farmers at a low cost

Rebuilding the Educate Girls machine learning model

How we can improve waste management in rural India

Data on Demand COVID-19 survey instrument

Matching individuals in a household census with voter rolls

Our customized “fuzzy matching” algorithm

How do we know our algorithm worked?

Authors

Ruchika Joshi

Jeff McManus

Jeenu Thomas

Why digitising CHPs could be key to universal health

How we’re using a gender lens to increase social sector partners’ impact

IDinsight annonce un partenariat avec Women for Women International

IDinsight announces partnership with Women for Women International

Wrap-up of CSW68 from IDinsight

Q+A: How women enumerators are driving impact in data collection

Government of Uttar Pradesh – IDinsight MoU to help government achieve development goals

How your organisation can use AI to improve efficiency

Improving efficiency in data collection with automated grid sampling

Breaking language barriers with IDinsight’s generative AI tool – “Ask A Question”

Related content

Using technology to reach farmers at a low cost

Rebuilding the Educate Girls machine learning model

How we can improve waste management in rural India

Data on Demand COVID-19 survey instrument

Stay up-to-date with IDinsight