========================================== Data Cleaning and Record Linkage ========================================== .. COMMENT Clarifications that should be incorporated --------------------------------- The exact format/columns of the found matches file that is to be written. The exact names of columns to use in block_on option. **Due: Thursday, Feb 16th at 5pm** You may work alone or in a pair on this assignment. In this assignment you will take a pair of datasets containing restaurant names and addresses, clean them, and *link them*, i.e., find records in the two datasets that refer to the same restaurant. This assignment will give you extensive experience in data cleaning and munging with pandas, as well as experience in using matplotlib, and in probabilistic record linkage. Here is an overview of the various tasks you'll perform. Some of these will make more sense after you have read the full description. #. The restaurant datasets are lists of names and addresses provided by the restaurant review companies Zagat and Fodor's, and are in the files ``zagat.txt`` and ``fodors.txt``. Your first task is to create a pandas dataframe for each consisting of four columns: (i) the original full name and address string, and extracted columns consisting of (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the dataframes ``zagat`` and ``fodors``. #. For record linkage you will use a *training dataset* of known links, available in the file ``known_links.txt``. Your next task is to read this dataset into a pandas dataframe and find the correspondence between the pairs in the known links and the rows in ``zagat`` and ``fodors``. Call a pair from the known links a *match*, and the dataframe ``matches``. #. You will also need a dataset of pairs that are not matches. We will call such a pair an *unmatch*. You will create this by choosing random pairs from ``zagat`` and ``fodors``. Call this the ``unmatches`` dataframe. (The terms *match* and *unmatch* follow the convention in record linkage. ) #. You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records (one each from ``zagat`` and ``fodors``). Your fourth task will be to explore the distribution of the Jaro-Winkler scores when a pair is a match and compare it to when the pair is an unmatch. (This step is not strictly required for record linkage, but will help clarify the rationale behind the approach we adopt.) For this you will plot histograms for the distance values for six cases, and create a pdf file containing all your histograms. #. To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this you will represent the distances between the fields of a pair of restaurants as a vector of three integers. There are 27 possible such vectors. Your next task is to find the probability of each vector given whether the pair it came from is a match or an unmatch. Here you will use the ``matches`` and ``unmatches`` dataframes. #. Next, you will order the vectors, and partition them into three sets: ``match_vectors``, ``possible_vectors``, and ``unmatch_vectors``. If a *test* pair maps to a vector in the first set, we classify it as a match; if it maps to a vector in the second set, we classify it as a *possible match*; else we classify it as an unmatch. (Possible matches need to be verified manually.) #. It will then be time to find matches. You will iterate through all potential pairs, for each find the corresponding vector, and the set it belongs to. Based on the set you will mark it as a match, possible match, or unmatch. You will print the number of each type you find, and also store all matches in a csv file. #. Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the above step but only consider potential pairs that also have the same value for either name, city, or address. Create restaurant dataframes ------------------------------------ The restaurant datasets are lists of names and addresses provided by the restaurant review companies Zagat and Fodor's, and are in the files ``zagat.txt`` and ``fodors.txt``. Your first task is to create a pandas dataframe for each consisting of four columns: (i) the original full name and address string, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). You will extract the latter three columns from the first, and use them for record linkage later on. Extracting new fields from text ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pandas provides the function `extract() `_ that takes a regular expression to create new fields from text data. Assume, e.g., that the original name and address strings from Zagat are in a column named ``nameaddress`` in ``zagat``. Then the call:: zagat['nameaddress'].str.extract(r'^([^\d]*)(\d.*)$', expand=True) returns a dataframe with two columns---each corresponding to a group in the regular expression---the first containing the longest prefix of the ``nameaddress`` string that contains no digits, and the second containing the rest of the string. The columns of the returned dataframe can be appended to the ``zagat`` dataframe using the pandas function `concat() `_. Use (possibly multiple calls to) ``extract()`` to create the name, city, and address columns. We found the following useful in this data munging process: - Build the regular expression(s) you need by iterative refinement---look at some of the data, try out an initial regular expression, observe where it works and where it fails, and improve upon it. If the regular expression doesn't match a given string, ``extract`` returns null values. So if you want to look at the strings where your regular expression failed, use, e.g., ``zagat.ix[df.iloc[:,0].isnull()]``, in which ``df`` is the name of the dataframe returned by ``extract()``. - Even so, you may not find a single regular expression that extracts all three columns. It may be easier to extract the columns in stages, using separate calls to ``extract()``. Further, there may be special cases that are unlike all other data and are best handled by hard-coding their values. In the approximately 850 rows in the combined data, we hard-coded the extraction for about 20 restaurants. You should not hard-code more than 50 cases. - Addresses often start with a street number. But not always, and sometimes the restaurant name itself has a number as well. - City names often begin after the name of a road, i.e., after a substring ending with "Blvd.", "Ave.", "PCH", or "Broadway". It is simple to build a complete list of such suffixes by iterative refinement. - We recommend using a `jupyter notebook `_, at least for the iterative refinement process. - You should be comfortable with the different indexing and selection mechanisms in pandas. - It is best not to name the restaurant name column ``name``, because pandas dataframes have a predefined attribute called ``name``, and you get unexpected results if you use the ``dataframename.columnname`` style of referring to a column. .. comment **Sample results**. The first 100 rows we extracted from the Zagat dataset is in the file ``zagat_extracted.csv``. Please use it only to get a sense of what is expected; our results may have errors. If your code performs better, please do not change it to match our results. Create the matches dataframe --------------------------------------------- For record linkage, you will use a *training dataset* of known links, available in the file ``known_links.txt``. It contains 50 pairs of restaurant names and addresses that are known to refer to the same restaurant. The first in the pair is from Zagat and the second from Fodor's. Your next data munging task is to read this dataset into a pandas dataframe called ``matches``. You will find ``read_csv()`` unsuitable for this due to the irregular formatting in ``known_links.txt``. For example, sometimes a restaurant string is in a single line, and sometimes split across two lines. We recommend writing custom Python code to read the dataset. You will also find that not all restaurant strings are exactly equal to the corresponding ones in the Zagat and Fodor's datasets, and will have to correct those. This correspondence will be useful during record linkage. Create the unmatches dataframe -------------------------------------------------- It turns out that for record linkage you also need a dataset of unmatches. For this choose 1000 random restaurants from ``zagat`` and pair them with 1000 random restaurants from ``fodors``. There are several ways to make random choices: `Python's random library `_, `Numpy functions `_ for creating random arrays, or pandas own `sample `_ function for sampling a dataframe. Technically the random pairs may include some pairs that actually match, but the overwhelming number will be unmatches, and we'll ignore this minor source of error. Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can, and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches would also have been helpful as well, but that requires expensive manual work. **Caution**. The random choices you make will influence the rest of your results. This makes it hard to debug a program if the bug only shows up for some random choices. So during the development and debugging process, it is useful to seed the random generator with a fixed number. E.g., when using pandas the following call:: zagat.sample(1000, replace = True, random_state = 1234) always returns the same 1000 rows because the random state is fixed through the seed 1234. Once your code is bug-free you should remove the ``random_state`` option. Explore Jaro-Winkler scores, plot histograms ------------------------------------------------- To link records we take a potential pair, one each from ``zagat`` and ``fodors``, and decide if their strings are similar enough. Our results are better when the strings are broken up into natural components, which you have already accomplished---into the components name, city, and address. When the components are strings, a natural measure of similarity, or rather the distance between two strings, is the *edit distance*, also known as the Levenshtein distance. It is defined as the minimum number of single-character insertions, deletions, and substitutions required to convert one string to another. In practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the Jaro-Winkler distance is used instead. The exact definition of the Jaro-Winkler distance is somewhat technical (but available `here `_). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match ("Medici" and "Medici") has a score of 1, and a complete mismatch ("Medici" and "Subway") has a score of 0. You will use the Python library `jellyfish `_ to compute the Jaro-Winkler distance between two strings. Install it using the standard:: sudo pip3 install -U jellyfish To find the Jaro-Winkler distance use, e.g.,:: import jellyfish score = jellyfish.jaro_winkler("medici", "noodles etc.") Your next task is to explore the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants, both when the pairs are matches and when they are unmatches. Your specific task here is to create a pdf file, named ``histograms.pdf`` containing six histograms, labelled appropriately. The first pair of histograms is for the Jaro-Winkler distances between names for restaurants, one each for matches and unmatches. The second and third are similar, except for cities and addresses. Here you will find useful the pandas function `plot.hist() `_, and the ``matplotlib.pyplot`` functions ``figure()``, ``subplot()``, ``xlabel()``, ``title()``, and ``savefig()``. See, e.g., the tutorial on `working with multiple figures `_. Our histograms are in ``histograms_sample.pdf``. Your results would vary because of different random choices, and possibly different extraction processing. Estimate vector probabilities given match or unmatch ------------------------------------------------------------- Observe that the probability of a match does not increase monotonically with the similarity for a field; rather, there are clusters in which this probability is high. For instance, in our results, for the *address* field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for the unmatch pairs. The above implies that a single "cut-off" on the similarity for a field, or even a group of fields will not serve our purpose. So we break-up the range of similarity values into discrete chunks, and determine the probability of a match for each chunk separately from that of neighboring chunks. Each chunk (in the 3-dimensional space of name, city, and address similarities) can be identified by a vector as described below. We have adopted (a simplified version of) the probabilistic record linkage approach proposed by `Fellegi and Sunter `_. Provided in ``utils.py`` is a simple utility function ``get_jw_category()`` that takes a Jaro-Winkler distance and returns an integer category between 0 to 2, essentially breaking the range of the Jaro-Winkler score into three blocks. We apply this on all three fields: names, cities, and addresses. Thus for any pair of restaurants, we can create a vector (name_category, city_category, address_category) that represents the similarity of the fields for that pair. Whether, during record linkage, the pair should be classified as a match or unmatch, or something in between, will depend on certain probabilities associated with the corresponding vector. .. comment: **Brief description**. Some of these vectors are more common in the match condition and others in the unmatch condition. So we identify the "good" vectors that are more likely to indicate a match---based on our training data and the level of error we are willing to tolerate---and mark all pairs that map to a good vector as a match, and the rest as an unmatch. .. COMMENTED OUT Specifically, we take the following approach. One of the inputs to ``find_matches()`` is ``sim_buckets`` an array of values that defines how to break your similarity values into buckets of values. Suppose ``sim_buckets = [a, b, c]``, then the buckets and their corresponding intervals are as shown in the table below =========== =============== Bucket Similarity values =========== =============== 0 [0, a] 1 (a, b] 2 (b, c] 3 (c, 1.0] =========== =============== Observe that some ranges are open (endpoint exclusive) and some closed (endpoint inclusive). Thus the values 0, a/2, a, b/2, 1.0 would fall into buckets 0, 0, 0, 1, and 3 respectively. We determine whether a vector should be classified as a match, possible match, or unmatch, based on estimates of the probability that a pair results in that vector when given that the pair is a match, as well as when given that the pair is an unmatch. Formally, assume that :math:`(r_1, r_2)` is a potential pair, :math:`v(r_1, r_2)` is the vector formed from its field similarities, and :math:`V` is the set of all possible vectors. For every :math:`w \in V` we need estimates for two quantities: .. math:: \begin{array}{l} m(w) \text{ defined as }\mathrm{P}[v(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is a match}], \text{ and} \\ u(w) \text{ defined as } \mathrm{P}[v(r_1, r_2) = w \ |\ (r_1, r_2) \text{ is an unmatch}]. \end{array} Compute estimates for the former by iterating through all the pairs in ``matches``, determining their vectors, and counting the frequency of each of 27 possible vectors during this process. The relative frequency for a vector :math:`w` is the estimate of the probability for :math:`w` given a matching pair. Similarly find an estimate for the probability given an unmatch by iterating through vectors for all pairs in ``unmatches``. Partition vectors into match, possible match, and unmatch sets ---------------------------------------------------------------------- How should we partition the set of vectors :math:`V` into the three different sets: ``match_vectors``, ``possible_vectors``, and ``unmatch_vectors``? It depends on our tolerance for two kinds of errors: - **false positive rate**, the probability that we classify an actual unmatch as a match, and - **false negative rate**, the probability that we classify an actual match as an unmatch. Assume that :math:`\mu` is the maximum false positive rate and :math:`\lambda` is the maximum false negative rate we are willing to tolerate, and given these are satisfied we want to maximize the number of our matches. In order to partition the set of vectors :math:`V`, we do the following--- #. Assign all vectors :math:`w` that have :math:`m(w) = 0` and :math:`u(w) = 0` to ``possible_vectors``. #. Order the rest of the vectors such that the vectors with :math:`u(w) = 0` are in front of everyone else, and the remaining are in decreasing (non-increasing) order of :math:`m(w)/u(w)`. #. Beginning from the first vector in the order above find the last vector :math:`w_1` such that the cumulative sum of :math:`u(w)` values (for :math:`w` from first to :math:`w_1`) is at most :math:`\mu`. Add all vectors from the first till :math:`w_1` (inclusive) to ``match_vectors``. #. Beginning from the last vector in the reverse order, find the furthest-from-last vector :math:`w_2` such that the cumulative sum of :math:`m(w)` values is at most :math:`\lambda`. Add vectors from the last till :math:`w_2` (inclusive) to ``unmatch_vectors``. #. Add any remaining vectors (in the ones that were ordered) to ``possible_vectors``. You should first convince yourself that the above approach makes sense. You will then realize that **the above only works when** :math:`w_1` **happens to be ahead of** :math:`w_2`. Work out the version of the algorithm when this is not the case, and implement the complete algorithm. Find matches and count possible matches and unmatches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once you have the three sets for given maximum false positive and false negative rates, simply iterate through every pair in ``zagat`` and ``fodor`` to determine counts of matches, possible matches, and unmatches. Block on name, city, or address ------------------------------------------------------------------- Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same value for a *blocking key*. Implement a version of the record linking algorithm in which either name, city, or address is supplied as a blocking key. Task checklist ------------------------------------------------- Write Python code, with appropriate auxiliary functions and documentation that implements the following function:: find_matches(mu, lambda_, outfile, block_on) which takes ``mu``, the maximum false positive rate, ``lambda_``, the maximum false negative rate, ``outfile``, the file name in which to store the matches, and ``block`` to indicate the blocking key to use, if any. It should create the file ``histograms.pdf``, and store all the matches it finds in ``outfile`` in the csv format (Zagat name and address followed by Fodor's name and address), and returns a 3-tuple containing the number of matches, possible matches, and unmatches it found. In particular, ensure you--- #. Create ``zagat``, ``fodors``, ``matches``, and ``unmatches``. #. Plot histograms for names, city, and address similarity for pairs from ``matches`` and ``unmatches``. #. Determine relative frequencies for 27 vectors corresponding to pairs from ``matches`` and ``unmatches``. #. Order vectors and create sets ``match_vectors``, ``possible_vectors``, and ``unmatch_vectors``. #. Count the number of matches, possible matches, and unmatches for all potential pairs from ``zagat`` and ``fodors``, possibly after applying a blocking key. #. Store the matches in a csv file. Getting started --------------- Follow `these start-up instructions `_ if you plan to work alone. Follow `these start-up instructions `_ if you are going to work with the same partner as you had for PA #2 or PA #3. Follow `these start-up instructions `_ if you are going to work in a new pair. Once you follow these instructions, your repository will contain a directory named ``pa4``, Submission ---------- Follow these `submission instructions `_ if you are working alone. Follow these `submission instructions `_ if you are working in a pair.