Skip to content
Blog

Part 1: What’s in a name: Combining datasets when unique identifiers are missing

Part 1: The Fuzzy Matching Problem. 

Click here to read Part 2.

Governments and NGOs often need to combine datasets from different sources to identify delivery gaps in existing programs or measure a program’s effects on particular demographic groups. A common hurdle when combining is that most datasets don’t include a unique and reliable identifier (e.g. voter ID number, Aadhaar number, house address) to facilitate a clean and easy match. In the absence of such an identifier, researchers need to find another way to relate records from one dataset to another.

For example, one may want to merge a child’s school record with an NGO’s database to help better target the NGO’s educational services. However, the child’s name, ‘सुनीता’, is spelt in English as Sunita by the school and Suneeta by the NGO. Or one may want to merge Census data with school infrastructure data to predict the number of out-of-school children for every village in northern India, but the village name ‘ग्वालदे’ is spelt as ‘Gwalde’ and ‘Gvalde’ across the two sources.

Differences in spellings can occur for many reasons: words are misspelt when data is entered, nouns are pronounced differently across regions and contexts, or individuals may spell words differently when transcribing speech to text. These issues multiply when individuals must transliterate words from one language to another, e.g. recording an individual’s Hindi name in Latin instead of the Devanagari script.1

Enter “fuzzy matching.” Fuzzy matching algorithms match ‘strings of text’ (string patterns) that are approximately, rather than exactly, the same (like “Sunita” and “Suneeta”). By measuring how similar two name strings are, researchers can match names that are similar enough. The criteria for quantifying the similarity between two strings can be varied. Many commonly used algorithms calculate this similarity in terms of the Levenshtein distance, which measures the similarity or difference between two strings based on the number of additions, removals, or substitutions needed to change one string into the other.

However, publicly available fuzzy matching algorithms are optimized for the Latin script, which poses a challenge to matching data available in non-Latin languages such as Hindi. As a result, when comparing two approximately similar strings that are Latin transliterations of Hindi names, these algorithms are likely to overestimate the distance between them. There are two situations where this can happen:

First, the same individual’s name may have been recorded differently in Devanagari. For example, the same individual may be named ‘वसंती’ (Vasanti) in one dataset and ‘बसंती’ (Basanti) in the other because the pronunciations of ‘v’ and ‘b’ are commonly swapped sounds in Hindi dialects. In this case, a fuzzy match algorithm applied to their Latin transliterated equivalents will interpret these two names to be more different than they are in reality. This is because the Levenshtein would give this difference a value of 1 (one change), when in actuality it should be closer to 0 (since ‘b’ and ‘v’ are commonly interchanged in Devanagari). If instead, the two words were “Basanti” and “Pasanti”, then the distance should be closer to 1 (since ‘b’ and ‘p’ are rarely interchanged in Devanagari).

Second, the same name appears differently in the two datasets because the transliteration process (from Devanagari to the Latin script) was not standardized across the two datasets. Some characters in Devanagari can map to more than one Latin character upon transliteration, which can lead to spelling inconsistencies in transliterated Hindi names. For example, the name ‘भेरु’ could be transliterated as Bhairoo or Bheru. In this case, a fuzzy matching algorithm will output these names to be different, even though in Devanagari they were recorded in the exact same way.

Off-the-shelf fuzzy matching programs, like Stata’s reclink program or user-written fuzzy matching packages, perform poorly in such cases, failing to pick up on true matches and having unacceptably high rates of false matches. Such algorithms need to be customized to capture the unique features of each language, and even each dataset, in order to correctly match names, or other relevant string patterns, across datasets.

Our Data on Demand (DoD) team recently faced this challenge while matching names of voters identified through a household listing process with those appearing in publicly available voter rolls. In part 2 of this post, we describe our matching process in more detail and provide access to our Stata and Python code that can be tailored to other use cases.

For a great intro on the fuzzy matching problem, check out this blog post by Data Ladder.

  1. 1. While manual transliteration by people is often unavoidable, where possible we recommend recording names in their native language and then using a module like Polyglot to apply a standardized transliteration to Latin text.