Data Cleaning and Record Linkage¶
Due: Thursday, Feb 16th at 5pm
You may work alone or in a pair on this assignment.
In this assignment you will take a pair of datasets containing restaurant names and addresses, clean them, and link them, i.e., find records in the two datasets that refer to the same restaurant.
This assignment will give you extensive experience in data cleaning and munging with pandas, as well as experience in using matplotlib, and in probabilistic record linkage.
Here is an overview of the various tasks you’ll perform. Some of these will make more sense after you have read the full description.
- The restaurant datasets are lists of names and addresses provided
by the restaurant review companies Zagat and Fodor’s, and are in the
files
zagat.txt
andfodors.txt
. Your first task is to create a pandas dataframe for each consisting of four columns: (i) the original full name and address string, and extracted columns consisting of (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the dataframeszagat
andfodors
. - For record linkage you will use a training dataset of known
links, available in the file
known_links.txt
. Your next task is to read this dataset into a pandas dataframe and find the correspondence between the pairs in the known links and the rows inzagat
andfodors
. Call a pair from the known links a match, and the dataframematches
. - You will also need a dataset of pairs that are not matches. We will
call such a pair an unmatch. You will create this by choosing
random pairs from
zagat
andfodors
. Call this theunmatches
dataframe. (The terms match and unmatch follow the convention in record linkage. ) - You will use the Jaro-Winkler distance to measure the similarity
between a field in a pair of records (one each from
zagat
andfodors
). Your fourth task will be to explore the distribution of the Jaro-Winkler scores when a pair is a match and compare it to when the pair is an unmatch. (This step is not strictly required for record linkage, but will help clarify the rationale behind the approach we adopt.) For this you will plot histograms for the distance values for six cases, and create a pdf file containing all your histograms. - To link records, you will use (a simplified version of) the probabilistic record
linkage approach. In this you will represent the distances between
the fields of a pair of restaurants as a vector of three integers.
There are 27 possible such vectors. Your next task is to find the
probability of each vector given whether the pair it came from is a
match or an unmatch. Here you will use the
matches
andunmatches
dataframes. - Next, you will order the vectors, and partition them into three
sets:
match_vectors
,possible_vectors
, andunmatch_vectors
. If a test pair maps to a vector in the first set, we classify it as a match; if it maps to a vector in the second set, we classify it as a possible match; else we classify it as an unmatch. (Possible matches need to be verified manually.) - It will then be time to find matches. You will iterate through all potential pairs, for each find the corresponding vector, and the set it belongs to. Based on the set you will mark it as a match, possible match, or unmatch. You will print the number of each type you find, and also store all matches in a csv file.
- Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above step but only consider potential pairs that also have the same value for either name, city, or address.
Create restaurant dataframes¶
The restaurant datasets are lists of names and addresses provided by
the restaurant review companies Zagat and Fodor’s, and are in the
files zagat.txt
and fodors.txt
. Your first task is to create
a pandas dataframe for each consisting of four columns: (i) the
original full name and address string, (ii) the restaurant name, (iii)
the city, and (iv) the address (except for the city). You will extract
the latter three columns from the first, and use them for record
linkage later on.
Extracting new fields from text¶
pandas provides the function extract() that takes a regular
expression to create new fields from text data. Assume, e.g., that the
original name and address strings from Zagat are in a column named
nameaddress
in zagat
. Then the call:
zagat['nameaddress'].str.extract(r'^([^\d]*)(\d.*)$', expand=True)
returns a dataframe with two columns—each corresponding to a group
in the regular expression—the first containing the longest
prefix of the nameaddress
string that contains no digits, and the
second containing the rest of the string. The columns of the returned
dataframe can be appended to the zagat
dataframe using the pandas
function concat().
Use (possibly multiple calls to) extract()
to create the name, city,
and address columns. We found the following useful in this data munging process:
- Build the regular expression(s) you need by iterative refinement—look at some of
the data, try out an initial regular expression, observe where it
works and where it fails, and improve upon it. If the regular
expression doesn’t match a given string,
extract
returns null values. So if you want to look at the strings where your regular expression failed, use, e.g.,zagat.ix[df.iloc[:,0].isnull()]
, in whichdf
is the name of the dataframe returned byextract()
. - Even so, you may not find a single regular expression that extracts
all three columns. It may be easier to extract the columns in
stages, using separate calls to
extract()
. Further, there may be special cases that are unlike all other data and are best handled by hard-coding their values. In the approximately 850 rows in the combined data, we hard-coded the extraction for about 20 restaurants. You should not hard-code more than 50 cases. - Addresses often start with a street number. But not always, and sometimes the restaurant name itself has a number as well.
- City names often begin after the name of a road, i.e., after a substring ending with “Blvd.”, “Ave.”, “PCH”, or “Broadway”. It is simple to build a complete list of such suffixes by iterative refinement.
- We recommend using a jupyter notebook, at least for the iterative refinement process.
- You should be comfortable with the different indexing and selection mechanisms in pandas.
- It is best not to name the restaurant name column
name
, because pandas dataframes have a predefined attribute calledname
, and you get unexpected results if you use thedataframename.columnname
style of referring to a column.
Create the matches dataframe¶
For record linkage, you will use a training dataset of
known links, available in the file known_links.txt
. It contains 50
pairs of restaurant names and addresses that are known to refer to the
same restaurant. The first in the pair is from Zagat and the second
from Fodor’s. Your next data munging task is to read this dataset
into a pandas dataframe called matches
. You will find read_csv()
unsuitable for
this due to the irregular formatting in known_links.txt
. For
example, sometimes a restaurant string is in a single line, and
sometimes split across two lines. We recommend writing custom Python
code to read the dataset.
You will also find that not all restaurant strings are exactly equal to the corresponding ones in the Zagat and Fodor’s datasets, and will have to correct those. This correspondence will be useful during record linkage.
Create the unmatches dataframe¶
It turns out that for record linkage you also need a dataset of
unmatches.
For this choose 1000 random restaurants
from zagat
and pair them with 1000 random restaurants from fodors
.
There are several ways to make random choices: Python’s random library, Numpy functions
for creating random arrays,
or pandas own sample
function for sampling a dataframe.
Technically the random pairs may include some pairs that actually match, but the overwhelming number will be unmatches, and we’ll ignore this minor source of error.
Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can, and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work.
Caution. The random choices you make will influence the rest of your results. This makes it hard to debug a program if the bug only shows up for some random choices. So during the development and debugging process, it is useful to seed the random generator with a fixed number. E.g., when using pandas the following call:
zagat.sample(1000, replace = True, random_state = 1234)
always returns the same 1000 rows because the random state is fixed
through the seed 1234. Once your code is bug-free you should remove
the random_state
option.
Explore Jaro-Winkler scores, plot histograms¶
To link records we take a potential pair, one each from zagat
and fodors
, and
decide if their strings are similar enough. Our results are better
when the strings are broken up into natural components, which you have
already accomplished—into the components name, city, and address.
When the components are strings, a natural measure of similarity, or rather the distance between two strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. In practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the Jaro-Winkler distance is used instead.
The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”) has a score of 0.
You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install it using the standard:
sudo pip3 install -U jellyfish
To find the Jaro-Winkler distance use, e.g.,:
import jellyfish
score = jellyfish.jaro_winkler("medici", "noodles etc.")
Your next task is to explore the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants, both when the pairs are matches and when they are unmatches.
Your specific task here is to create a pdf file, named
histograms.pdf
containing six histograms, labelled appropriately.
The first pair of histograms is for the Jaro-Winkler distances between
names for restaurants, one each for matches and unmatches. The second
and third are similar, except for cities and addresses.
Here you will find useful the pandas function plot.hist(),
and the matplotlib.pyplot
functions figure()
, subplot()
,
xlabel()
, title()
, and savefig()
. See, e.g., the tutorial
on working with multiple figures.
Our histograms are in histograms_sample.pdf
. Your results would
vary because of different random choices, and possibly different
extraction processing.
Estimate vector probabilities given match or unmatch¶
Observe that the probability of a match does not increase monotonically with the similarity for a field; rather, there are clusters in which this probability is high. For instance, in our results, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for the unmatch pairs.
The above implies that a single “cut-off” on the similarity for a field, or even a group of fields will not serve our purpose. So we break-up the range of similarity values into discrete chunks, and determine the probability of a match for each chunk separately from that of neighboring chunks. Each chunk (in the 3-dimensional space of name, city, and address similarities) can be identified by a vector as described below.
We have adopted (a simplified version of) the probabilistic record
linkage approach proposed by Fellegi and Sunter.
Provided in utils.py
is a simple utility function
get_jw_category()
that takes a Jaro-Winkler distance and returns
an integer category between 0 to 2, essentially breaking the range of
the Jaro-Winkler score into three blocks. We apply this on all three
fields: names, cities, and addresses. Thus for any pair of
restaurants, we can create a vector (name_category, city_category,
address_category) that represents the similarity of the fields for
that pair. Whether, during record linkage, the pair should be
classified as a match or unmatch, or something in between, will depend
on certain probabilities associated with the corresponding vector.
We determine whether a vector should be classified as a match, possible match, or unmatch, based on estimates of the probability that a pair results in that vector when given that the pair is a match, as well as when given that the pair is an unmatch. Formally, assume that \((r_1, r_2)\) is a potential pair, \(v(r_1, r_2)\) is the vector formed from its field similarities, and \(V\) is the set of all possible vectors. For every \(w \in V\) we need estimates for two quantities:
Compute estimates for the former by iterating through all the pairs in
matches
, determining their vectors, and counting the frequency of
each of 27 possible vectors during this process. The relative
frequency for a vector \(w\) is the estimate of the probability
for \(w\) given a matching pair. Similarly find an estimate for
the probability given an unmatch by iterating through vectors for all
pairs in unmatches
.
Partition vectors into match, possible match, and unmatch sets¶
How should we partition the set of vectors \(V\) into the three
different sets: match_vectors
, possible_vectors
, and
unmatch_vectors
? It depends on our tolerance for two kinds of
errors:
- false positive rate, the probability that we classify an actual unmatch as a match, and
- false negative rate, the probability that we classify an actual match as an unmatch.
Assume that \(\mu\) is the maximum false positive rate and \(\lambda\) is the maximum false negative rate we are willing to tolerate, and given these are satisfied we want to maximize the number of our matches.
In order to partition the set of vectors \(V\), we do the following—
- Assign all vectors \(w\) that have \(m(w) = 0\) and \(u(w) =
0\) to
possible_vectors
. - Order the rest of the vectors such that the vectors with \(u(w) = 0\) are in front of everyone else, and the remaining are in decreasing (non-increasing) order of \(m(w)/u(w)\).
- Beginning from the first vector in the order above find the last
vector \(w_1\) such that the cumulative sum of \(u(w)\)
values (for \(w\) from first to \(w_1\)) is at most \(\mu\). Add all vectors from the first till
\(w_1\) (inclusive) to
match_vectors
. - Beginning from the last vector in the reverse order, find the
furthest-from-last vector \(w_2\) such that the cumulative sum
of \(m(w)\) values is at most \(\lambda\). Add vectors from
the last till \(w_2\) (inclusive) to
unmatch_vectors
. - Add any remaining vectors (in the ones that were ordered) to
possible_vectors
.
You should first convince yourself that the above approach makes sense. You will then realize that the above only works when \(w_1\) happens to be ahead of \(w_2\). Work out the version of the algorithm when this is not the case, and implement the complete algorithm.
Find matches and count possible matches and unmatches¶
Once you have the three sets for given maximum false positive and false
negative rates, simply iterate through every pair in zagat
and
fodor
to determine counts of matches, possible matches, and
unmatches.
Block on name, city, or address¶
Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same value for a blocking key. Implement a version of the record linking algorithm in which either name, city, or address is supplied as a blocking key.
Task checklist¶
Write Python code, with appropriate auxiliary functions and documentation that implements the following function:
find_matches(mu, lambda_, outfile, block_on)
which takes mu
, the maximum false positive rate, lambda_
, the
maximum false negative rate, outfile
, the file name in which to store the
matches, and block
to indicate the blocking key to use, if any.
It should create the file histograms.pdf
, and store all the
matches it finds in outfile
in the csv format (Zagat name and
address followed by Fodor’s name and address), and returns a 3-tuple
containing the number of matches, possible matches, and unmatches it
found.
In particular, ensure you—
- Create
zagat
,fodors
,matches
, andunmatches
. - Plot histograms for names, city, and address similarity for pairs
from
matches
andunmatches
. - Determine relative frequencies for 27 vectors corresponding to
pairs from
matches
andunmatches
. - Order vectors and create sets
match_vectors
,possible_vectors
, andunmatch_vectors
. - Count the number of matches, possible matches, and unmatches for
all potential pairs from
zagat
andfodors
, possibly after applying a blocking key. - Store the matches in a csv file.
Getting started¶
Follow these start-up instructions if you plan to work alone.
Follow these start-up instructions if you are going to work with the same partner as you had for PA #2 or PA #3.
Follow these start-up instructions if you are going to work in a new pair.
Once you follow these instructions, your repository will contain a
directory named pa4
,
Submission¶
Follow these submission instructions if you are working alone.
Follow these submission instructions if you are working in a pair.