==========================================
Data Cleaning and Record Linkage
==========================================

.. COMMENT

   Clarifications that should be incorporated
   ---------------------------------

   The exact format/columns of the found matches file that is to be
   written.

   The exact names of columns to use in block_on option.


**Due: Thursday,  Feb 16th at 5pm**

You may work alone or in a pair on this assignment.

In this assignment you will take a pair of datasets containing
restaurant names and addresses, clean them, and *link them*, i.e.,
find records in the two datasets that refer to the same restaurant.

This assignment will give you extensive experience in data cleaning
and munging with pandas, as well as experience in using matplotlib,
and in probabilistic record linkage.

Here is an overview of the various tasks you'll perform.  Some of
these will make more sense after you have read the full description.

#.  The restaurant datasets are lists of names and addresses provided
    by the restaurant review companies Zagat and Fodor's, and are in the
    files ``zagat.txt`` and ``fodors.txt``. Your first task is to create
    a pandas dataframe for each consisting of four columns: (i) the
    original full name and address string, and extracted columns
    consisting of (ii) the restaurant name, (iii)
    the city, and (iv) the address (except for the city).  Call the
    dataframes ``zagat`` and ``fodors``.

#. For record linkage you will use a *training dataset* of known
   links, available in the file ``known_links.txt``. Your next task is
   to read this dataset into a pandas dataframe and find the
   correspondence between the pairs in the known links and the rows in
   ``zagat`` and ``fodors``. Call a pair from the known links
   a *match*, and the dataframe ``matches``.

#. You will also need a dataset of pairs that are not matches. We will
   call such a pair an *unmatch*. You will create this by choosing
   random pairs from ``zagat`` and ``fodors``. Call this the ``unmatches``
   dataframe. (The terms *match* and *unmatch* follow the convention
   in record linkage. )

#. You will use the Jaro-Winkler distance to measure the similarity
   between a field in a pair of records (one each from ``zagat`` and
   ``fodors``). Your fourth task will be to explore the distribution
   of the Jaro-Winkler scores when a pair is a match and compare it to
   when the pair is an unmatch. (This step is not strictly required
   for record linkage, but will help clarify the rationale behind the
   approach we adopt.) For this you will plot histograms for the
   distance values for six cases, and create a pdf file containing all
   your histograms.

#. To link records, you will use (a simplified version of) the probabilistic record
   linkage approach. In this you will represent the distances between
   the fields of a pair of restaurants as a vector of three integers.
   There are 27 possible such vectors.  Your next task is to find the
   probability of each vector given whether the pair it came from is a
   match or an unmatch.  Here you will use the ``matches`` and
   ``unmatches`` dataframes.

#. Next, you will order the vectors, and partition them into three
   sets: ``match_vectors``, ``possible_vectors``, and
   ``unmatch_vectors``. If a *test* pair maps to a vector in the first
   set, we classify it as a match; if it maps to a vector in the
   second set, we classify it as a *possible match*; else we
   classify it as an unmatch. (Possible matches need to be verified manually.)

#. It will then be time to find matches.  You will iterate through all
   potential pairs, for each find the corresponding vector, and the
   set it
   belongs to.  Based on the set you will mark it as a match,
   possible match, or unmatch.  You will print the number of each type
   you find, and also
   store all matches in a csv file.

#. Iterating through all possible pairs is time consuming in general.
   In the final step, you will repeat the above step but only consider
   potential pairs that also have the same value for either name,
   city, or address.

Create restaurant dataframes
------------------------------------

The restaurant datasets are lists of names and addresses provided by
the restaurant review companies Zagat and Fodor's, and are in the
files ``zagat.txt`` and ``fodors.txt``. Your first task is to create
a pandas dataframe for each consisting of four columns: (i) the
original full name and address string, (ii) the restaurant name, (iii)
the city, and (iv) the address (except for the city). You will extract
the latter three columns from the first, and use them for record
linkage later on.


Extracting new fields from text
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pandas provides the function `extract() <http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings>`_ that takes a regular
expression to create new fields from text data. Assume, e.g., that the
original name and address strings from Zagat are in a column named
``nameaddress`` in ``zagat``. Then the call::

  zagat['nameaddress'].str.extract(r'^([^\d]*)(\d.*)$', expand=True)

returns a dataframe with two columns---each corresponding to a group
in the regular expression---the first containing the longest
prefix of the ``nameaddress`` string that contains no digits, and the
second containing the rest of the string.  The columns of the returned
dataframe can be appended to the ``zagat`` dataframe using the pandas
function `concat() <http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects>`_.

Use (possibly multiple calls to) ``extract()`` to create the name, city,
and address columns.  We found the following useful in this data munging process:

- Build the regular expression(s) you need by iterative refinement---look at some of
  the data, try out an initial regular expression, observe where it
  works and where it fails, and improve upon it.  If the regular
  expression doesn't match a given string, ``extract`` returns null
  values.  So if you want to look at the strings where your regular
  expression failed, use, e.g., ``zagat.ix[df.iloc[:,0].isnull()]``, in which
  ``df`` is the name of the dataframe returned by ``extract()``.

- Even so, you may not find a single regular expression that extracts
  all three columns.  It may be easier to extract the columns in
  stages, using separate calls to ``extract()``.  Further, there may
  be special cases that are unlike all other data and are best handled
  by hard-coding their values.  In the approximately 850 rows in the
  combined data, we hard-coded the extraction for about 20 restaurants.
  You should not hard-code more than 50 cases.

- Addresses often start with a street number.  But not always, and
  sometimes the restaurant name itself has a number as well.

- City names often begin after the name of a road, i.e., after a
  substring ending with "Blvd.", "Ave.", "PCH", or "Broadway". It is
  simple to build a complete list of such suffixes by iterative
  refinement.

- We recommend using a `jupyter notebook <http://jupyter.org/>`_, at
  least for the iterative refinement process.

- You should be comfortable with the different indexing and selection
  mechanisms in pandas.

- It is best not to name the restaurant name column ``name``, because
  pandas dataframes have a predefined attribute called ``name``, and you get
  unexpected results if you use the ``dataframename.columnname`` style
  of referring to a column.

.. comment 

   **Sample results**. The first 100 rows we extracted from
   the Zagat dataset is in the file ``zagat_extracted.csv``. Please
   use it only to get a sense of what is expected; our results may
   have errors. If your code performs better, please do not change it
   to match our results.


Create the matches dataframe
---------------------------------------------

For record linkage, you will use a *training dataset* of
known links, available in the file ``known_links.txt``. It contains 50
pairs of restaurant names and addresses that are known to refer to the
same restaurant. The first in the pair is from Zagat and the second
from Fodor's. Your next data munging task is to read this dataset
into a pandas dataframe called ``matches``. You will find ``read_csv()`` unsuitable for
this due to the irregular formatting in ``known_links.txt``. For
example, sometimes a restaurant string is in a single line, and
sometimes split across two lines. We recommend writing custom Python
code to read the dataset.

You will also find that not all restaurant strings are exactly equal
to the corresponding ones in the Zagat and Fodor's datasets, and will
have to correct those.  This correspondence will be useful during record linkage.


Create the unmatches dataframe
--------------------------------------------------

It turns out that for record linkage you also need a dataset of
unmatches.
For this choose 1000 random restaurants
from ``zagat`` and pair them with 1000 random restaurants from ``fodors``.
There are several ways to make random choices: `Python's random library
<https://docs.python.org/3.6/library/random.html>`_, `Numpy functions
<https://docs.scipy.org/doc/numpy/reference/routines.random.html>`_
for creating random arrays,
or pandas own `sample
<http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.sample.html>`_
function for sampling a dataframe.

Technically the random pairs may include some pairs that
actually match, but the overwhelming number will be unmatches, and
we'll ignore this minor source of error.

Why are we using 1000 random pairs for unmatches, when we only have
50 for matches? Because we can, and the more pairs we have, the better
the accuracy of our record linkage. Of course, having more matches
would also have been helpful as well, but that requires expensive
manual work.

**Caution**.  The random choices you make will influence the rest of
your results.  This makes it hard to debug a program if the bug only
shows up for some random choices.  So during the development and
debugging process, it is useful to seed the random generator with a
fixed number.  E.g., when using pandas the following call::

  zagat.sample(1000, replace = True, random_state = 1234)

always returns the same 1000 rows because the random state is fixed
through the seed 1234.  Once your code is bug-free you should remove
the ``random_state`` option. 


Explore Jaro-Winkler scores, plot histograms
-------------------------------------------------

To link records we take a potential pair, one each from ``zagat`` and ``fodors``, and
decide if their strings are similar enough.   Our results are better
when the strings are broken up into natural components, which you have
already accomplished---into the components name, city, and address.

When the components are strings, a natural measure of similarity, or
rather the distance between two strings, is the *edit distance*, also
known as the Levenshtein distance.  It is defined as the minimum
number of single-character insertions, deletions, and substitutions
required to convert one string to another.   In practice, however, the
Levenshtein distance is computationally intensive, and a related
measure, called the Jaro-Winkler distance is used instead.   

The exact definition of the Jaro-Winkler distance is somewhat
technical (but available `here
<https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance>`_). Also,
although it is called a distance, it actually measures the similarity
between two strings: a complete match ("Medici" and "Medici") has a
score of 1, and a complete mismatch ("Medici" and "Subway") has a
score of 0.

You will use the Python library `jellyfish
<https://github.com/jamesturk/jellyfish>`_ to compute the 
Jaro-Winkler distance between two strings.  Install it using the
standard::

  sudo pip3 install -U jellyfish

To find the Jaro-Winkler distance use, e.g.,::

  import jellyfish
  score = jellyfish.jaro_winkler("medici", "noodles etc.")

Your next task is to explore the distribution of the Jaro-Winkler
distance between names, cities, and addresses of pairs of restaurants,
both when the pairs are matches and when they are unmatches.

Your specific task here is to create a pdf file, named
``histograms.pdf`` containing six histograms, labelled appropriately.
The first pair of histograms is for the Jaro-Winkler distances between
names for restaurants, one each for matches and unmatches. The second
and third are similar, except for cities and addresses.

Here you will find useful the pandas function `plot.hist() <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.hist.html>`_,
and the ``matplotlib.pyplot`` functions ``figure()``, ``subplot()``,
``xlabel()``, ``title()``, and ``savefig()``.  See, e.g., the tutorial
on `working with multiple figures
<http://matplotlib.org/users/pyplot_tutorial.html#working-with-multiple-figures-and-axes>`_.
Our histograms are in ``histograms_sample.pdf``.  Your results would
vary because of different random choices, and possibly different
extraction processing.


Estimate vector probabilities given match or unmatch
-------------------------------------------------------------

Observe that the probability of a match does not increase monotonically
with the similarity for a field; rather, there are clusters
in which this probability is high. For instance, in our results, for the *address* field,
the probability is lower in the range 0.80-0.95 than just outside this
range.  The clusters are even more pronounced for the unmatch pairs.

The above implies that a single "cut-off" on the similarity for a
field, or even a group of fields will not serve our purpose. So we
break-up the range of similarity values into discrete chunks, and
determine the probability of a match for each chunk separately from
that of neighboring chunks. Each chunk (in the 3-dimensional space of
name, city, and address similarities) can be identified by a vector as
described below. 

We have adopted (a simplified version of) the probabilistic record
linkage approach proposed by `Fellegi and Sunter
<http://www.heinz.cmu.edu/~roehrig/Record%20Linkage/Fellegi-Sunter.pdf>`_.
Provided in ``utils.py`` is a simple utility function
``get_jw_category()`` that takes a Jaro-Winkler distance and returns
an integer category between 0 to 2, essentially breaking the range of
the Jaro-Winkler score into three blocks. We apply this on all three
fields: names, cities, and addresses. Thus for any pair of
restaurants, we can create a vector (name_category, city_category,
address_category) that represents the similarity of the fields for
that pair. Whether, during record linkage, the pair should be
classified as a match or unmatch, or something in between, will depend
on certain probabilities associated with the corresponding vector.

.. comment:

   **Brief description**. Some of these vectors are more common in the
   match condition and others in the unmatch condition. So we identify
   the "good" vectors that are more likely to indicate a match---based
   on our training data and the level of error we are willing to
   tolerate---and mark all pairs that map to a good vector as a match,
   and the rest as an unmatch.

.. COMMENTED OUT

   Specifically, we take the following approach.   One of the inputs to
   ``find_matches()`` is ``sim_buckets`` an array of values that
   defines how to break your similarity values into buckets of values.
   Suppose ``sim_buckets = [a, b, c]``, then the buckets and their
   corresponding intervals are as shown in the table below

   ===========      ===============
   Bucket                       Similarity values
   ===========      ===============
   0                                [0, a]
   1                                (a, b]
   2                                (b, c]
   3                                (c, 1.0]
   ===========      ===============

   Observe that some ranges are open (endpoint exclusive) and some closed
   (endpoint inclusive).  Thus the values 0, a/2, a, b/2, 1.0 would fall into buckets 0, 0, 0,
   1, and 3 respectively. 


We determine whether a vector should be classified as a
match, possible match, or unmatch, based on estimates of the
probability that a pair results in that vector when given that the pair is
a match, as well as when given that the pair is an unmatch.
Formally, assume that :math:`(r_1, r_2)` is a potential pair,
:math:`v(r_1, r_2)` is the vector formed from its field similarities,
and :math:`V` is the set of all possible vectors.   For every
:math:`w \in V` we need estimates for two quantities:

.. math::
   \begin{array}{l}
        m(w) \text{ defined as }\mathrm{P}[v(r_1, r_2) = w \ |\  (r_1,
        r_2) \text{ is a match}], \text{ and} \\
        u(w) \text{ defined as } \mathrm{P}[v(r_1, r_2) = w \ |\  (r_1, r_2) \text{ is an unmatch}].
   \end{array}
 
Compute estimates for the former by iterating through all the pairs in
``matches``, determining their vectors, and counting the frequency of
each of 27 possible vectors during this process. The relative
frequency for a vector :math:`w` is the estimate of the probability
for :math:`w` given a matching pair. Similarly find an estimate for
the probability given an unmatch by iterating through vectors for all
pairs in ``unmatches``.


Partition vectors into match, possible match, and unmatch sets
----------------------------------------------------------------------

How should we partition the set of vectors :math:`V` into the three
different sets: ``match_vectors``, ``possible_vectors``, and
``unmatch_vectors``? It depends on our tolerance for two kinds of
errors:

- **false positive rate**, the probability that we classify an actual
  unmatch as a match, and

- **false negative rate**, the probability that we classify an actual
  match as an unmatch.

Assume that :math:`\mu` is the maximum false positive rate and
:math:`\lambda` is the maximum false negative rate we are willing to
tolerate, and given these are satisfied we want to maximize the
number of our matches.

In order to partition the set of vectors :math:`V`, we do the
following---

#. Assign all vectors :math:`w` that have :math:`m(w) = 0` and :math:`u(w) =
   0` to ``possible_vectors``.

#. Order the rest of the vectors such that the vectors with :math:`u(w) =
   0` are in front of everyone else, and the remaining are in
   decreasing (non-increasing) order of :math:`m(w)/u(w)`.

#. Beginning from the first vector in the order above find the last
   vector :math:`w_1` such that the cumulative sum of   :math:`u(w)`
   values (for :math:`w` from first to :math:`w_1`) is at most  :math:`\mu`.  Add all vectors from the first till
   :math:`w_1` (inclusive) to  ``match_vectors``.

#. Beginning from the last vector in the reverse order, find the
   furthest-from-last vector :math:`w_2` such that the cumulative sum
   of :math:`m(w)` values is at most :math:`\lambda`. Add vectors from
   the last till :math:`w_2` (inclusive) to  ``unmatch_vectors``.

#. Add any remaining vectors (in the ones that were ordered) to  ``possible_vectors``.

You should first convince yourself  that the above
approach makes sense.  You will then realize that **the above only works when**
:math:`w_1` **happens to be ahead of** :math:`w_2`. Work out the
version of the algorithm when this is not the case, and implement the
complete algorithm.


Find matches and count possible matches and unmatches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you have the three sets for given maximum false positive and false
negative rates, simply iterate through every pair in ``zagat`` and
``fodor`` to determine counts of matches, possible matches, and
unmatches.


Block on name, city, or address
-------------------------------------------------------------------

Iterating through all potential pairs is computationally prohibitive
for large datasets.  So the number of potential pairs is often reduced
by applying a constraint, such as only considering pairs that have the
same value for a *blocking key*.  Implement a version of the record
linking algorithm in which either name, city, or address is supplied
as a blocking key.


Task checklist
-------------------------------------------------

Write Python code, with appropriate auxiliary functions and
documentation that implements the following function::

  find_matches(mu, lambda_, outfile, block_on)

which takes ``mu``, the maximum false positive rate, ``lambda_``, the
maximum false negative rate, ``outfile``, the file name in which to store the
matches, and ``block`` to indicate the blocking key to use, if any.
It should create the file ``histograms.pdf``, and store all the
matches it finds in ``outfile`` in the csv format (Zagat name and
address followed by Fodor's name and address), and returns a 3-tuple
containing the number of matches, possible matches, and unmatches it
found.

In particular, ensure you---

#.  Create ``zagat``, ``fodors``, ``matches``, and ``unmatches``.

#. Plot histograms for names, city, and address similarity for pairs
   from ``matches`` and ``unmatches``.

#. Determine relative frequencies for 27 vectors corresponding to
   pairs from ``matches`` and ``unmatches``.

#. Order vectors and create sets ``match_vectors``,
   ``possible_vectors``, and ``unmatch_vectors``.

#. Count the number of matches, possible matches, and unmatches for
   all potential pairs from ``zagat`` and ``fodors``, possibly after
   applying a blocking key.

#. Store the matches in a csv file.


Getting started
---------------

Follow `these start-up instructions <startup-alone.html>`_ if you plan to work alone.

Follow `these start-up instructions <startup-pair-continuing.html>`_ if you are going to work with the same partner as you had for PA #2 or PA #3.

Follow `these start-up instructions <startup-pair.html>`_ if you are going to work in a new pair.

Once you follow these instructions, your repository will contain a
directory named ``pa4``,


Submission
----------

Follow these `submission instructions <submit-alone.html>`_ if you are working alone.

Follow these `submission instructions <submit-pair.html>`_ if you are working in a pair.