Record Linkage

Due: Friday, Feb 26th at 4pm

You must work alone on this assignment.

In this assignment, you will take a pair of datasets containing restaurant names and addresses and link them, i.e., find records in the two datasets that refer to the same restaurant. This task is non-trivial when there are discrepancies in the names and addresses of the same restaurants across the two data sources.

This assignment will give you experience in probabilistic record linkage, a technique that you may find valuable for your project and when working with real-world data in the future.

Here is an overview of the tasks you will perform. Some of these will make more sense after you have read the full description.

  1. The restaurant datasets are tables of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a pandas dataframe for each dataset consisting of four columns: (i) the sequentially-numbered index from the file, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the dataframes zagat and fodors.

  2. For record linkage, you will use a training dataset of known links, available in the file known_links.csv. For a given row, the first column indicates the index of a restaurant in the Zagat dataset, and the second column indicates the index of the same restaurant in the Fodor’s dataset. Your next task is to read this dataset into a pandas dataframe and generate a dataframe of the three string (non-index) columns from the Zagat dataset, and the three columns from the Fodor’s dataset, for the same restaurants. Call a pair from the known links a match, and the dataframe matches.

  3. You will also need a dataset of pairs that are not matches. We will call such a pair an unmatch. You will create this by choosing random pairs from zagat and fodors. Call this the unmatches dataframe. (The terms match and unmatch follow the convention in record linkage. )

  4. You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records (one each from zagat and fodors). We have provided a set of histograms that show the distribution of these scores, for each of the fields (name, city, and address), for matches and for unmatches. You should examine these histograms to understand the rationale for the algorithm we will adopt.

  5. To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this part you will represent the distances between the fields of a pair of restaurants as a tuple of three values. Each field will be rated as having a "low", "medium", or "high" similarity across the two datasets. (It turns out that categorizing the similarity values into three buckets works better than the two that would result from a simple cutoff.) Since each tuple has three slots, and each can be one of three values, there are 27 possible tuples. Your next task is to count the occurrence of each tuple for the matches and the unmatches. Here you will use the matches and unmatches dataframes constructed earlier.

  6. Next, you will place the 27 different possible tuples into three categories: match_tuples, possible_tuples, and unmatch_tuples. If a test pair maps to a tuple in the first set, we classify it as a match; if it maps to a tuple in the second set, we classify it as a possible match; else we classify it as an unmatch. (We will trust the decisions made by our algorithm for the clear matches and unmatches, but possible matches need to be inspected manually.)

  7. It will then be time to find matches. You will iterate through all potential pairs from the two data sets, for each rank the similarities of the three fields and generate the corresponding tuple, and then determine the category to which it belongs Based on the set you will mark it as a match, possible match, or unmatch. You will print the number of each type you find, and also return all matches, possible matches, and unmatches from your overall function.

  8. Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above algorithm, but only consider potential pairs that also have the same value for city.

Create restaurant dataframes

The restaurant datasets are tables of names, cities, and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a pandas dataframe for each consisting of four columns: i) the index number from the csv file, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city).

the review companies actually provide the restaurant information as a single string. We have split this string into the name, city, and address using regular expressions for you. Because an automated process was used, there may be restaurants for which this extraction process was performed imperfectly. This is not an ideal situation, but it is realistic, and the data is in sufficiently good shape to use for the probabilistic algorithm we will employ.

  • It is best not to name the restaurant name column name, because pandas dataframes have a predefined attribute called name, and you get unexpected results if you use the dataframename.columnname style of referring to a column.

Create the matches dataframe

For record linkage, you will use a training dataset of known links, available in the file known_links.csv. It contains 50 manually-linked pairs of restaurants. In other words, a human has chosen some of the rows in one dataset, and determined the corresponding rows in the other dataset. The first restaurant in the pair is from Zagat and the second from Fodor’s. Rows in the file correspond to these pairs, and the two columns give the indices for a restaurant within the corresponding files.

Using this file and the other dataframes, create a new dataframe called matches with six columns: the name, city, and address from Zagat, and the corresponding three fields from Fodor’s.

Create the unmatches dataframe

It turns out that for record linkage, you also need a dataset of unmatches. For this, choose 1000 random restaurants from zagat and pair them with 1000 random restaurants from fodors. To make random choices, the pandas sample function for sampling a dataframe is suitable.

Technically, the random pairs may include some pairs that actually match, but the overwhelming number will be unmatches, and we will ignore this minor source of error.

Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can, and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work.

The random choices you make will influence the rest of your results. This makes it hard to debug a program if the bug only shows up for some random choices. Furthermore, making the same random choices in your program as we do in ours, makes it easier for us to check your results. Therefore, we want you to seed the random generator with a fixed number. Specifically, use the following calls to choose the samples:

zs = zagat.sample(1000, replace = True, random_state = 1234)
fs = fodors.sample(1000, replace = True, random_state = 5678)

These calls will each always return the same 1000 rows because the random state is fixed through the seeds 1234 and 5678. In normal circumstances, once your code is working, you would remove these hard-coded seeds. But, we ask that you leave them in to facilitate testing of your submitted code.

Explore Jaro-Winkler scores in histograms

To link records, we take a potential pair, one each from zagat and fodors, and decide if their strings are sufficiently similar. Our results are better when the strings are broken up into natural components (like name, city, and address), which we have done for you.

When the components are strings, a natural measure of similarity, or rather the distance between two strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. In practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the Jaro-Winkler distance is used instead.

The exact definition of the Jaro-Winkler distance is somewhat technical (but for those interested, available here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”) has a score of 0.

You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install it using the standard:

pip3 install --user --upgrade jellyfish

To find the Jaro-Winkler distance, use, e.g.,:

import jellyfish
score = jellyfish.jaro_winkler("medici", "noodles etc.")

To make sense of the algorithm we will be using for record linkage, you need to gain an understanding of the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants, both when the pairs are matches and when they are unmatches.

Here is a link to a PDF file containing six histograms, labelled appropriately. The first pair of histograms is for the Jaro-Winkler distances between names for restaurants, one each for matches and unmatches. The second and third are similar, except for cities and addresses.

Estimate Tuple probabilities given match or unmatch

Observe that the probability of a match does not increase monotonically with the similarity for a field; rather, there are clusters in which this probability is high. For instance, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for the unmatch pairs.

The above implies that a single “cut-off” on the similarity for a field, or even a group of fields, will not serve our purpose. So we break-up the range of similarity values into discrete chunks, which we will call "low", "medium", and "high", and determine the probability of a match for each combination of field-wise chunks separately.

We have adopted (a simplified version of) the probabilistic record linkage approach proposed by Fellegi and Sunter. Provided in utils.py is a simple utility function get_jw_category() that takes a Jaro-Winkler distance and returns the string "low", "medium", or "high", essentially breaking the range of the Jaro-Winkler score into three blocks. We apply this function to all three fields: names, cities, and addresses. Thus for any pair of restaurants, we can create a tuple (name_category, city_category, address_category) that represents the similarity of the fields for that pair. Whether, during record linkage, the pair should be classified as a match or unmatch, or something in between, will depend on whether that tuple was most closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error.

Specifically, we determine whether a tuple should be classified as a match, possible match, or unmatch, based on estimates of the probability that a matched pair results in that tuple as well as the probability that an unmatched pair results in that tuple. Formally, assume that \((r_1, r_2)\) is a potential pair, \(t(r_1, r_2)\) is the tuple formed from its field similarities, and \(T\) is the set of all possible tuples. For every \(w \in T\) we need estimates for two quantities:

\[\begin{split}\begin{array}{l} m(w) \text{ defined as }\mathrm{P}[t(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is a match}], \text{ and} \\ u(w) \text{ defined as } \mathrm{P}[t(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is an unmatch}]. \end{array}\end{split}\]

Compute estimates for the former by iterating through all the pairs in matches, determining their tuples, and counting the frequency of each of the 27 possible tuples (combinations of the three similarity levels in the three different positions) during this process. The relative frequency for a tuple \(w\) is the estimate of the probability for \(w\) given a matching pair. Similarly find an estimate for the probability given an unmatch by iterating through tuples for all pairs in unmatches.

Partition Tuples into match, possible match, and unmatch sets

How should we decide which tuples belong in which of the three different categories: match_tuples, possible_tuples, and unmatch_tuples? It depends on our tolerance for two kinds of errors:

  • false positive rate, the probability that we classify an actual unmatch as a match, and

  • false negative rate, the probability that we classify an actual match as an unmatch.

Assume that \(\mu\) is the maximum false positive rate and \(\lambda\) is the maximum false negative rate we are willing to accept, and, given these are satisfied, we want to maximize the number of our matches.

In order to classify which tuples should be associated with matches, unmatches, and possible matches, we do the following—

  1. Assign all tuples \(w\) that have \(m(w) = 0\) and \(u(w) = 0\) to possible_tuples. Intuition: We have not seen either matches or unmatches with these similarity levels in our training data, so our algorithm cannot make any inferences about them, and human intervention will be required.

  2. Sort the rest of the 27 tuples such that the tuples with \(u(w) = 0\) are in front of the list, and the remaining are in decreasing (non-increasing) order of \(m(w)/u(w)\). Intuition: We want a ranked order of how likely a tuple is to be associated with a match. If we have never seen a tuple associated with an unmatch in our training data, then we will assume that any pair with the same tuple is a match. We prioritize these “especially good” tuples in our sorted order. After those, we rank the tuples in terms of the ratio of how often they have been associated with matches, versus the number of times they have been associated with unmatches, and put the ones most likely to be associated with matches earlier.

  3. Starting at the first tuple in this sorted order, find the last tuple \(w_1\) such that the cumulative sum of \(u(w)\) values (for \(w\) from first to \(w_1\)) is at most \(\mu\). Add all tuples from the first till \(w_1\) (inclusive) to match_tuples. Intuition: We want to limit ourselves to no more than a specified false positive rate (\(\mu\)). We have sorted the tuples in order of their being most likely to be associated with matches. We take as many as we can, starting with the best ones, until we are at the point where taking another one would cause us to be likely to exceed our false positive rate because the likelihood that a pair is actually an unmatch has become too high.

  4. Starting at the last tuple (the first tuple in reverse order), find the furthest-from-last tuple \(w_2\) such that the cumulative sum of \(m(w)\) values is at most \(\lambda\). Add tuples from the last till \(w_2\) (inclusive) to unmatch_tuples. Intuition: We choose the worst tuples, in terms of being likely to be associated with a match, and assign them to be used for unmatches. If we overdo this, then we might exceed our false negative rate, so we stop just before the point where this would happen.

  5. Add any remaining tuples (in the ones that were sorted) to possible_tuples. Intuition: We are unwilling to let an algorithm make any decisions about pairs that have these tuples, because we have already used up our false positive and false negative budget. Any such restaurant pairs will need to be looked at by a human.

There is a subtlety in this description: the above only works when \(w_1\) happens to be ahead of \(w_2\). If they “cross over” instead, then this means that some of the tuples could sensibly be used to assign matches or unmatches, and either way, we would not exceed our targets for false positives or false negatives.

What should we do, if faced with this situation? It comes down to our priorities. If we want to prioritize having more matches (and we do), then we should count all of these ambiguous tuples as being associated with matches. Remember, in this scenario, this is okay, since we still will not exceed our false positive rate in the process.

Find matches, possible matches, and unmatches

Once you have the three sets for given maximum false positive and false negative rates, simply iterate through every pair in zagat and fodors to create three DataFrames of matches, possible matches, and unmatches. All three DataFrames should have six columns: the name, city, and address from Zagat, followed by the name, city, an address from Fodor’s. Return these three DataFrames from the find_matches function, in the order above.

Block on City

Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same value for a blocking key. Implement a version of the record linking algorithm that only considers the restaurants that have the exact same city in both datasets.

This version will be extremely similar to the previous one; thus, you should have one implementation that simply takes whether to block on city, or not, as a parameter, and proceeds accordingly.

Note: Do not change how you assess the training data in the version of the algorithm that performs blocking. Compute the frequencies of different tuples, and the partitioning of the 27 tuples, exactly as before. The only change should be that, when you iterate through all pairs of restaurants, you disregard any pair where the two entries do not have identical cities.

Task checklist

Write Python code, with appropriate auxiliary functions and documentation that implements the following function:

find_matches(mu, lambda_, block_on_city)

which takes mu, the maximum false positive rate, lambda_, the maximum false negative rate, and block_on_city a boolean that indicates whether to block on the city or not. It should return a 3-tuple containing DataFrames with matches, possible matches, and unmatches it found, in that order.

In particular, ensure you—

  1. Create zagat, fodors, matches, and unmatches.

  2. Determine relative frequencies for 27 tuples corresponding to pairs from matches and unmatches.

  3. sort tuples and create sets match_tuples, possible_tuples, and unmatch_tuples.

  4. generate matches, possible matches, and unmatches for all potential pairs from zagat and fodors, possibly after blocking on city.

  5. Return the generated matches, possible matches, and unmatches in DataFrames.

Getting started

We have seeded your repository with a directory for this assignment. To pick it up, change to your cmsc12200-win-21-username directory (where the string username should be replaced with your username), run git pull to make sure that your local copy of the repository is in sync with the server, and then run git pull upstream master to pick up the distribution.

Submission

Follow the usual submission instructions.

Acknowledgement

This assignment was originally designed and written by Amitabh Chaudhary.