Record Linkage¶
Due: February 19th at 4:30pm CDT
You must work alone for this project.
In this assignment, you will identify or link records from two datasets that refer to the same restaurant. The datasets, which come from two restaurant review companies, contain the names and addresses of a group of restaurants. The task of linking this information is non-trivial when the restaurants’ names and addresses are not guaranteed to be identical across the two data sets.
This assignment will give you experience in probabilistic record linkage, a technique that you will find useful when working with real-world data in the future. We will train an algorithm to classify pairs of entries as matches, unmatches (that is, not a match), and possible matches. The training algorithm has several steps:
Convert each restaurant name and address into three fields: a name, city, and street address.
Compute the probability of a match using hand-labeled match data.
Compute the probability of a unmatch using randomly-chosen unmatch data.
Use the computed probabilities to train the classifier.
Once you have trained the algorithm, we will apply it to the full datasets.
Getting started¶
Using the invitation URL provided on Ed Discussion, create a repository for your PA #4 files.
After accepting the assignment, Github will take a few minutes to create your repository. You should receive an email from Github when your repository is ready. Normally, it’s ready within seconds and you can just refresh the page.
Next, make sure you’ve
set the GITHUB_USERNAME
variable by running the following (replacing replace_me
with your GitHub username):
GITHUB_USERNAME=replace_me
(remember you can double-check whether the variable is properly set by
running echo $GITHUB_USERNAME
)
And finally, run these commands (if you don’t recall what some of these commands do, they are explained at the end of the Git Tutorial):
cd ~/capp30122
git clone git@github.com:uchicago-CAPP30122-win-2022/pa4-$GITHUB_USERNAME.git
cd pa4-$GITHUB_USERNAME
You will find the files you need for the programming assignment directly in the root of your repository.
Before describing the specifics of your task, we will briefly explain how to work with URLs and grab pages from the web.
The directory pa4
includes the following files:
record_linkage.py
: skeleton file for the assignment (you will add code to this file only),test_record_linkage.py
: test code,util.py
: utility functions,data
: data for directory for testingoutput
: expected output directory
Data¶
We will be using data provided by Zagat and Fodor’s. The companies
provide the information for a given restaurant as a single string. We
have split these strings into name, city, and address fields using
regular expressions and constructed two CSV files (data/zagat.csv
and data/fodors.csv
). This automated extraction process generates
a less-than-perfect split for some restaurants. While the results are
not perfect, the process used is realistic, and the data is in
sufficiently good shape to use with the probabilistic algorithm that
we will employ.
The files data/zagat.csv
and data/fodors.csv
each contain four
columns: index, restaurant name, city, and street address. Here, for
example, are the first few rows from data/zagat.csv
:
index,restaurant name,city,street address
0,Apple Pan The,West LA,10801 W. Pico Blvd.
1,Arnie Morton's of Chicago,Los Angeles,435 S. La Cienega Blvd.
2,Art's Deli,Studio City,12224 Ventura Blvd.
3,Asahi Ramen,West LA,2027 Sawtelle Blvd.
4,Baja Fresh,Westlake Village,3345 Kimber Dr.
You must read the review data into Pandas dataframes. When you load
the data, we recommend (1) using the index
column, which contains
a unique identifier (an integer) for each row, as the row index and
(2) setting the type for the remaining columns to str
.
In addition to the review data, we have also provided a training
dataset of known links in a file named data/known_links.csv
.
This file contains 50 manually-linked pairs of restaurants. In other
words, a human chose some of the rows in one dataset, and determined
the corresponding rows in the other dataset. The first restaurant in
the pair is from the Zagat dataset and the second from the Fodor’s
dataset. Rows in the known-links file correspond to these pairs, and
the two columns give the indexes for a restaurant within the
corresponding files. For example, the first non-header line of the
known-links file is “269,386”, so the row with index 269 in the Zagat
dataset is deemed to be a “match” with the row with index 386 in the
Fodor’s dataset.
We’ve also provided a file, data/unmatch_pairs.csv
, that contains
1000 pairs of randomly-chosen indexes, one from each review dataset.
We will deem these pairs to be unmatches. Why are we using 1000
random pairs for unmatches, when we only have 50 for matches? The more pairs we have, the
better the accuracy of our record linkage. Of course, having more matches
would also have been helpful as well, but that requires expensive manual work.
Your first task is to load the data.
Computing the probability of a match or an unmatch¶
We will explain the process of converting dataframes along with the matched and unmatched pairs into something more abstract that can be used to help classify other restaurants as matches or unmatches in two steps. First, we will explain how to compute similarity tuples and then how to use the resulting tuples to estimate the probability of a match or an unmatch.
Constructing similarity tuples¶
In this assignment, we will compute the similarity of the name, city, and address fields separately. A natural measure of similarity for strings is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. The Levenshtein distance is computationally intensive, so in practice, a related measure, called the Jaro-Winkler distance is used instead.
The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1.0, and a complete mismatch (“Medici” and “Subway”) has a score of 0.0.
You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. It is already installed on the CS machines, but you will need to install it on your VM using the following command:
$ sudo -H pip3 install jellyfish
Specifically, we’ll use the function jellyfish.jaro_winkler
to
compute the Jaro-Winkler distance. To make this discussion more
concrete, here are the results of computing the Jaro-Winkler score for
the name, city, and address fields from the first matched pair (index
269 from the Zagat dataset and index 386 from the Fodor’s dataset):
In [1]: import jellyfish
In [2]: jellyfish.jaro_winkler('Ritz-Carlton Restaurant',
...: 'Restaurant Ritz-Carlton Atlanta')
Out[2]: 0.7786561264822135
In [3]: jellyfish.jaro_winkler('Atlanta', 'Atlanta')
Out[3]: 1.0
In [4]: jellyfish.jaro_winkler('181 Peachtree St.',
...: '181 Peachtree St.')
Out[4]: 1.0
As we will see in class, the Jaro-Winkler similarity scores do not
increase monotonically, and so, a “cut-off” on the similarity for a
single field or many fields will not serve our purpose. Instead, we
break-up the range of similarity values into discrete chunks, which we
will call "low"
, "medium"
, and "high"
, and determine the
probability of a match for each combination of field-wise chunks
separately.
We have provided a simple utility function get_jw_category()
in
util.py
that essentially breaks up the range of Jaro-Winkler
similarity scores into three probability blocks. It takes a
Jaro-Winkler distance and returns the string "low"
, "medium"
,
or "high"
. We will apply this function to all three fields: names,
cities, and addresses. Thus, for any pair of rows from the review
datasets, we can create a similarity tuple (name_category,
city_category, address_category) that represents the similarity of the
fields for the corresponding pair of rows. Applying this process to
the rows that correspond to the indexes 269 and 386 in the Zagat and
Fodor’s datasets respectively yields the similarity tuple: ('low',
'high', 'high')
.
Estimating the probabilities¶
Whether, during record linkage, a pair of rows should be classified as a match, unmatch, or something in between will depend on whether its (name_category, city_category, address_category) tuple was most closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error.
Specifically, we will determine whether a tuple should be classified as a match, possible match, or unmatch based on estimates of the probability that a matched pair results in that tuple as well as the probability that an unmatched pair results in that tuple. Formally, assume that \((r_1, r_2)\) is a potential pair, \(t(r_1, r_2)\) is the tuple formed from its field similarities, and \(T\) is the set of all possible tuples. For every \(w \in T\) we need estimates for two quantities:
Your task is to compute estimates for the former by iterating through all the pairs of rows corresponding to the known matches, determining their similarity tuples, and counting the frequency of each of the 27 possible similarity tuples (combinations of the three similarity levels in the three different positions) during this process. We’ll use the relative frequency for a tuple \(w\) as our estimate of the probability for \(w\) given a matching pair. Similarly find an estimate for the probability given an unmatch by iterating through unmatch pairs and computing the frequency of the corresponding similarity tuples.
To make this more concrete: your second task is to construct two dictionaries: one that maps similarity tuples to match probabilities and the other that maps similarity tuples to unmatch probabilities.
Partition tuples into match, possible match, and unmatch sets¶
This step is tricky. It is important that you read this section extremely carefully.
The goal of this step is to compute a dictionary that maps each
similarity tuple to one of "match"
, "possible match"
,
or "unmatch"
.
How should we decide which tuples belong in which of the three
different categories: match
, possible match
, and unmatch
?
It depends on our tolerance for two kinds of errors:
false positive rate, the probability that we classify an actual unmatch as a match, and
false negative rate, the probability that we classify an actual match as an unmatch.
Assume that \(\mu\) is the maximum false positive rate and \(\lambda\) is the maximum false negative rate we are willing to accept, and, given these are satisfied, we want to maximize the number of our matches.
In order to classify which tuples should be associated with matches, unmatches, and possible matches, we do the following:
Label tuples \(w\) that have \(m(w) = 0\) and \(u(w) = 0\) with
"possible match"
. Intuition: we have not seen either matches or unmatches with these similarity levels in our training data, so our algorithm cannot make any inferences about them, and human intervention will be required. Given the data we are using for this assignment, the following nine tuples will be labeled with"possible match"
under this rule:('high', 'low', 'high') ('high', 'low', 'low') ('high', 'low', 'medium') ('low', 'low', 'high') ('low', 'low', 'medium') ('low', 'medium', 'high') ('medium', 'high', 'low') ('medium', 'low', 'medium') ('medium', 'medium', 'medium')
Sort remaining 18 tuples such that the tuples with \(u(w) = 0\) are in front of the list, and the remaining are in decreasing (non-increasing) order of \(m(w)/u(w)\). Intuition: we want a ranked order of how likely a tuple is to be associated with a match. If we have never seen a tuple associated with an unmatch in our training data, then we will assume that any pair with the same tuple is a match. We prioritize these “especially good” tuples in our sorted order. After those, we rank the tuples in terms of the ratio of how often they have been associated with matches, versus how frequently they have been associated with unmatches, and put the ones most likely to be associated with matches earlier.
Getting the order right is crucial, so we have provided a function
util.sort_prob_tuples
that takes a list of tuples of the form \((w, m(w), u(w))\) and sorts them appropriately for you. Notice that the tuples taken by the function include a similarity tuple along with its match probability and its unmatch probability.Here’s the ordering our code yields from this process
Index |
(similarity tuple, match probability, unmatch probability) |
---|---|
0 |
((‘high’, ‘high’, ‘high’), 0.27999999999999997, 0.0) |
1 |
((‘high’, ‘medium’, ‘high’), 0.19999999999999998, 0.0) |
2 |
((‘high’, ‘medium’, ‘low’), 0.1, 0.0) |
3 |
((‘medium’, ‘high’, ‘high’), 0.08, 0.0) |
4 |
((‘high’, ‘high’, ‘low’), 0.06, 0.0) |
5 |
((‘high’, ‘high’, ‘medium’), 0.06, 0.0) |
6 |
((‘low’, ‘high’, ‘high’), 0.06, 0.0) |
7 |
((‘medium’, ‘medium’, ‘high’), 0.06, 0.0) |
8 |
((‘high’, ‘medium’, ‘medium’), 0.02, 0.0) |
9 |
((‘medium’, ‘high’, ‘medium’), 0.02, 0.0) |
10 |
((‘medium’, ‘low’, ‘high’), 0.02, 0.0) |
11 |
((‘medium’, ‘low’, ‘low’), 0.02, 0.001) |
12 |
((‘low’, ‘medium’, ‘medium’), 0.02, 0.008) |
13 |
((‘low’, ‘high’, ‘low’), 0.0, 0.060000000000000046) |
14 |
((‘low’, ‘high’, ‘medium’), 0.0, 0.003) |
15 |
((‘low’, ‘low’, ‘low’), 0.0, 0.7900000000000006) |
16 |
((‘low’, ‘medium’, ‘low’), 0.0, 0.1370000000000001) |
17 |
((‘medium’, ‘medium’, ‘low’), 0.0, 0.001) |
Starting at the first tuple in the resulting list \((w_0, m(w_0), u(w_0))\), find the last tuple \((w_i, m(w_i), u(w_i))\) such that the cumulative sum of \(u(w_j)\) values (for \(0 <= j <= i\)) is at most \(\mu\). Label these similarity tuples \(w_j\) for \(0 <= j <=i\) with
"match"
. Intuition: we want to limit ourselves to no more than a specified false positive rate (\(\mu\)). We have sorted the tuples in order of their being most likely to be associated with matches. We take as many as we can, starting with the best ones, until we are at the point where taking another one would likely cause us to exceed our false positive rate because the likelihood that a pair is actually an unmatch has become too high.For example, given the above list and a value of \(0.005\) for \(\mu\), we would label the similarity tuples from the first 12 elements (index 0 through 11) with
"match"
. The sum of the unmatch probabilities from these elements is \(0.001\). Notice that the unmatch probability for the next element is \(0.008\), which would push our cumulative sum above \(\mu = 0.005\).Starting at the last tuple \((w_{n-1}, m(w_{n-1}), u(w_{n-1}))\), find the farthest-from-last tuple \((w_k, m(w_k), u(w_k))\) such that the cumulative sum of \(m(w_j)\) values for \(k <= j <= n-1\) is at most \(\lambda\). Label the similarity tuples \(w_j\) for \(k <= j <= n-1\) with
"unmatch"
. Intuition: we choose the worst tuples, in terms of being likely to be associated with a match, and assign them to be used for unmatches. If we overdo this, then we might exceed our false negative rate, so we stop just before the point where this would happen.For example, give the above list and a value of \(0.005\) for \(\lambda\), we would label the similarity tuples from index 13 through index 17 (inclusive) with
"unmatch"
. The sum of the match probabilities from these elements is 0.0. The tuple at index 12 has a match probability of \(0.02\). Including it would push the cumulative sum of the match probabilities above \(\lambda = 0.005\).Tuples from the sorted list that are still unlabeled should be labeled with
"possible match"
. Intuition: we are unwilling to let an algorithm make any decisions about pairs that have these tuples, because we have already used up our false positive and false negative budget. Any such restaurant pairs will need to be looked at by a human.In our running example, give \(\mu = 0.005\) and \(\lambda = 005\), the tuple at index 12 in the list above would be labeled with “possible match”.
There is a subtlety in this description: the above only works \(i< k.\) If they “cross over” instead, then this means that some of the tuples could sensibly be used to assign matches or unmatches, and either way, we would not exceed our targets for false positives or false negatives.
What should we do, if faced with this situation? We prefer having
more matches and so, we will label these ambiguous tuples with
"match"
. Remember, in this scenario, this is okay, since we still
will not exceed our false positive rate in the process. If we were to
prefer having more unmatches, then we would label these ambiguous
tuples with "unmatch"
. (FYI, this situation occurs with the above
sorted list when \(\mu = 0.005\) and \(\lambda = 0.02\).)
Computing the labels for the similarity tuples is an essential part of this algorithm. Stop and make sure your code is producing the correct labels before you move on to the next task. Here are few test cases you can run:
Test |
\(mu\) |
\(lambda\) |
matches |
possible matches |
unmatches |
---|---|---|---|---|---|
1 |
0.005 |
0.005 |
0-11 |
12-12 |
13-17 |
2 |
0.0 |
0.01 |
0-10 |
11-12 |
13-17 |
3 |
0.0 |
0.03 |
0-10 |
11-11 |
12-17 |
4 |
0.005 |
0.05 |
0-11 |
12-17 |
The ranges refer to the sorted list of probability tuples above and
are inclusive. So, the first row means that you should label the
similarity tuples from index 0 through index 11 (inclusive) in the
sorted list above with "match"
, the similarity tuple from index 12
with "possible match"
, and similarity tuples from index 13 through
the end of the list with "unmatch"
. The last case tests the
cross-over situation. Notice that we have biased towards matches.
To reiterate: for this part, your code will need to use the match and
unmatch probabilities along with \(\mu\) and \(\lambda\) to
generate a dictionary that maps similarity tuples to one "match"
,
"possible match"
, "unmatch"
.
Find matches, possible matches, and unmatches¶
Given a dictionary that maps similarity tuples to labels, you can now
classify any pair of rows from the Zagat and Fodor’s datasets as a
"match"
, a "possible match"
, or an "unmatch"
simply by
converting the pair of rows into a similarity tuple and then looking up
the resulting tuple in the label dictionary constructed in the previous task.
When time is not a factor, we will compare every row in one dataset to every row in the other. As you will see, this process is computationally expensive. So, when time is a factor, we will use a technique known as blocking. Specifically, we will only classify pairs that have the exact same value for the city field. We will ignore all other pairs (and assume that they do not match.)
Specifically, your task is to complete the function:
def find_matches(output_filename, mu, lambda_, block_on_city=False):
which takes output_filename
, the name of the output file, mu
,
the maximum false positive rate, lambda_
, the maximum false
negative rate, and block_on_city
a boolean that indicates whether
to block on the city or not. (lambda
is a reserved word in Python
and so notice that the parameter ends with a trailing _
(i.e., lambda_
).) The function should generate a CSV file as described
below.
When block_on_city
is false, your function should iterate through
every possible pair of rows in the Zagat and Fodor’s datasets and
classify the pair as a "match"
, a "possible match"
, or an
"unmatch"
.
When block_on_city
is true, your function should classify only the
pairs in which the cities match exactly. More concretely, when
looking for matches for a given row in the Zagat dataset, you should
consider only those rows from the Fodor’s dataset that have exactly
the same city as the Zagat row.
The final result of your function will be a CSV file that that
includes a row for each possible pair. Each row will have the Zagat
row index, Fodor’s row index, and the label ("match"
, "possible
match"
, or "unmatch"
).
Important: You should write the rows to the CSV file as you generate them. Pandas does not perform well when working with multiple large dataframes at once. As a result, your code will perform very poorly if you construct an intermediate data structure with the indexes and labels before you output the results.
Testing¶
We have provided the usual pytest code for find_match
, but it is
quite slow (e.g., testing our solution takes roughly 100 seconds).
Before you run it, you should do some simple sanity checking.
First, make sure you have checked that your training code is producing the correct labeling for the similarity tuples for different values of \(\mu\) and \(\lambda\). See the table of tests for labeling similarity tuples above.
Second, given an output file (say, output/basic_expected.csv
) in
the expected format you can count the number of items labeled as
"match"
, "possible match"
, "unmatch"
using the following
linux commands:
$ cut -f3 -d"," output/basic_expected.csv | sort | uniq -c
271 match
1618 possible match
174534 unmatch
Replace output/basic_expected.csv
with the appropriate file name
to count the number of matches, etc in another file.
Here is a table that shows the expected counts for the different test cases we use in out test code:
Prefix |
\(mu\) |
\(lambda\) |
block_on_city |
matches |
possible matches |
unmatches |
---|---|---|---|---|---|---|
output/basic |
0.005 |
0.005 |
False |
271 |
1618 |
174534 |
output/push_match_to_possible |
0.0 |
0.01 |
False |
120 |
1769 |
174534 |
output/pickup_unmatch_from_possible |
0.0 |
0.03 |
False |
120 |
427 |
175876 |
output/cross_over |
0.005 |
0.05 |
False |
271 |
276 |
175876 |
output/basic_blocking |
0.005 |
0.005 |
True |
75 |
6 |
9379 |
The first column is a prefix for the output filename. To construct
the filename with the expected results, which can be found in the
output
directory, append "_expected.csv"
to the prefix. When
you run your code by hand, we recommend using the prefix appended with
“_actual.csv” for the output file (as we do in the test code).
Once your code passed these two sanity checks, give the test code a try by running the following linux command:
$ py.test -xv
Like previous assignments, you can obtain your test score by running py.test
followed by ./grader.py
.
Grading¶
Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)
The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:
Completeness: 50%
Correctness: 15%
Design: 25%
Style: 10%
Notice that design is a big component of this assignment. We will not be providing design advice specifically for this assignment. You should think very carefully about how to decompose the work for this assignment into functions. Don’t be surprised, if you revise your decomposition a few times. We did!
Cleaning up¶
Before you submit your final solution, you should, remove
any
print
statements that you added for debugging purposes andall in-line comments of the form: “YOUR CODE HERE” and “REPLACE …”
Also, check your code against the style guide. Did you use good variable names? Do you have any lines that are too long, etc.
Make sure you have included header comments, that is, the triple-quote strings that describe the purpose, inputs, and return values of each function, for every function you have written.
As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.
Submission¶
You must submit your work through Gradescope (linked from our Canvas site). In the “Programming Assignment #5” assignment, simply upload file record_linkage.py
(do not upload any other file!). Please note:
You are allowed to make as many submissions as you want before the deadline.
For students working in a pair, one student should upload the pair’s solution and use GradeScope’s mechanism for adding group members to add the second person in the pair.
Please make sure you have read and understood our Late Submission Policy
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.
deadline.
Acknowledgments¶
This assignment was originally designed and written by Amitabh Chaudhary.