Record Linkage and De-duplication

What is Record Linkage and De-duplication?

It is often necessary to combine data from multiple sources to get a complete picture.

E.g., you have a dataset on researchers, and another on patent holders, and you want to determine the correlation between research activity and the granting of patents. For this you need to match a given researcher with a patent holder, if possible.

Such record linkage can be difficult because of differences in the way data is collected or recorded in the two datasets. E.g., suppose there is a researcher David A. Miller from Stanford, CA, and a patent holder David Andrew Miller from Fairhaven, NJ. Are they the same people? Seems unlikely since they live in different places. But maybe these datasets were collected at different times, and the person moved in between.

Similarly, is Art’s Deli the same as Art’s Delicatessen when linking restaurants.

De-duplication is when the same dataset has potentially multiple records for the same entity and we want to remove the duplicates. The techniques used are very similar to those for record linkage.

Why is Record Linkage Useful?

Record linkage can save costs, by reusing existing datasets, instead of collecting data afresh.

How is Record Linkage Done?

It is easy if both data sets have a unique identifier such as Social Security Numbers, particularly when we can assume they were entered correctly.

Otherwise the steps are —

  1. Identify the fields that will be used to link records. These are link keys. E.g, first name, last name, year of birth, address, etc. Prefer name over year of birth, as it has more values and coincidental matches are less likely.

  2. Clean and standardize all link keys.

  3. For any two values a link key can assume (v1, v2), define a scoring function S(v1, v2) to measure the similarity (or distance) between the values.

  4. For every potential pair, combine the scores for different link keys to decide whether the pair is a match or not.

Scoring the Similarity Between Records

For fields with continuous values, the absolute difference between the two values is usually good enough.

Scoring similarity for text fields is challenging. Common scoring functions are different kinds of edit distances:

E dit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

Levenshtein Distance

Minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. E.g., the distance between Peter and Pedr is 3 and between Miera and Meila is also 3.

Levenshtein-Damerau Distance

Minimum number of single-character insertions, deletions, substitutions, and transposition of adjacent characters required to convert one string to another. Distance between Peter and Pedr remains 3 but between Miera and Meila reduces to 2.

Above actually take too much time to compute. A more popular distance, similar to above is Jaro-Winkler distance. It is normalized to values between 0 (absolutely different) and 1 (identical strings).

Once the similarity (or distance) between different link keys is computed, we combine them together, using either rules, probabilistic approaches, or machine learning based approaches. We’ll discuss just the first two.

Rule Based Approaches

Say we have two lists of individuals. Both lists have the first name, last name, and year of birth. One possible linkage rule would be to link all pairs of records such that —

  • the Jaro-Winkler distance of first names > 0.9, and

  • the Jaro-Winkler distance of last names > 0.9, and

  • the first 3 digits of the year of birth are the same.

Such methods only work when there are a few fields, and the practitioner has detailed domain knowledge to create such rules.

Probabilistic Record Linkage

(We give a simplified description so that you can have a feel for the method. In practice, this method requires some additional sophisticated techniques such as EM-based optimization.)

In this we assume that the scoring function for any field assigns a categorial value: such as one of “good match”, “intermediate match”, and “bad match”. So if using the Jaro-Winkler distance, discretize it into a few levels.

Say A and B are two lists of individuals we want to link. Further suppose, we have some random samples of A and B, call them A’ and B’, which form our training data. We require that all potential matches in A’ X B’ be labelled as true matches or non-matches. Call the matched pairs M and the unmatched pairs U.

Suppose we want to determine matches in the rest of A and B using two fields: first name and year of birth.

Consider any potential pair: a and b. Let’s say the scoring function for first name assigns the pair (a, b) the value “intermediate match”. We determine the probability of getting such a value among the matches in M, P(first name has “intermediate | M), by looking at the fraction of matches that have such a value for the field first name.

Similarly we determine the probability for getting such a value among the non-matches, P(first name has “intermediate” | U).

We compute such probabilities corresponding to year of birth as well. Suppose the score for year of birth for (a, b) is “good match”. We then compute P(year has “good match” | M) and P(year has “good match” | U).

Now we are ready to find the ratio of the probabilities, assuming independence between first name and year of birth.

R = ( P(first name has “intermediate | M) * P(year has “good match” | M) ) / ( P(first name has “intermediate | U) * P(year has “good match” | U ).

Based again on training data we determine thresholds T1 and T2, such that when the score is above T1 we assign (a, b) as a match, and if below T2 as a non-match. In between them they are sent for a manual clerical review.

This method is useful when there is a large number of fields, but no clear idea of how exactly they help in record linkage.

Blocking

If one dataset has n records and another dataset has m records, we have to do the above computation for each of the n*m pairs. For n = m = 1 million, we need to examine 1 trillion pairs.

Instead we reduce the number of pair we examine by clustering the records into subgroups and only examine pairs that fall in the same subgroup.

For insteance, we could make a blocking key by concatanating the first letter of the last name with the postal code. Then only pairs which have the same value for the blocking key are examined. This can save a lot of time.

Privacy Issues

Sometimes record linkage may reveal the identity of persons in a dataset who were supposed to remain anonymous. E.g., a Harvard researcher was able to identify 40% of anonymous donors in a DNA study because the data included zip codes, dates of birth, and genders, and the researcher combined these with voter rolls, etc.

So mechanisms for ensuring privacy are vital. For this, record linkage is often performed by a trusted third party instead of the data users. Further research is ongoing in cryptographic techniques for linking encrypted data.