======
Pandas
======

Pandas is a Python library for working with data.  Pandas is a large
library with many useful features.  In this lab, we will focus on
concepts that will be useful for PA #6.


Getting started
---------------

Navigate (``cd``) to your
|repo_name| directory. Run ``git pull upstream master``
to collect the lab materials and ``git pull`` to sync with your
personal repository.

We will be using data from the
`Current Population Survey <https://www.census.gov/programs-surveys/cps.html>`__
in this lab.  In your lab directory, please run the following command on the Linux
command line to get the data:

::

    $ ./get_files.sh

The ``lab7`` directory contains a file named ``lab7.py``, where you
will write your code for this lab.  We have included code to import
pandas as ``pd``.

Fire up ``ipython3`` and use it to experiment with Pandas in each
of the tasks in this lab. Once you're sure you've arrived at the right
Pandas code for a given task, add it to ``lab7.py``. You can then run this
file to see the result of all the tasks. Please note that ``lab7.py``
will fail to run until you've implemented Task 1 below.


Reading the data
----------------

As mentioned, we will be using a sample dataset from the Current
Population Survey for this assignment.  The file
``morg_d07_strings.csv`` contains a modified version of the 2007 MORG
data, which we downloaded from the Census Bureau's website.

The file is in `comma-separated value (CSV)
<https://en.wikipedia.org/wiki/Comma-separated_values>`__ format. It
can be understood to represent a table with multiple rows and columns
(in fact, the CSV format is supported by most spreadsheet programs,
and you can try opening the file in Excel, Libreoffice Calc, etc.).
The first line of the file is the *header* of the file. It contains
the names of the columns, separated by commas. After the header, each
line in the file represents a *row* in the table, with each value in
the row (corresponding to the columns specified in the header)
separated by a comma. A common way to refer to a value in a row is as
a *field*. So, if a CSV file has an ``age`` column, in an individual
row we would refer to the ``age`` field (instead of column, which
tends to refer to an entire column of values).


Each row in ``morg_d07_strings.csv`` corresponds to the survey data
obtained from a unique individual. We consider the following variables
for each individual in this assignment (although there are a lot more
variables available in the MORG datasets):

- ``h_id``: a string that serves as a unique identifier, which we created by concatenating several variables in the original MORG datasets.
- ``age``: an integer value specifying the age of the individual.
- ``gender``: the gender (or sex) recorded for the individual.
- ``race``: the race recorded for the individual.
- ``ethnicity``: the ethnicity recorded for the individual.
- ``employment_status``: the employment status record for the individual.
- ``hours_worked_per_week``: an integer that specifies the usual weekly work hours of the individual.
- ``earnings_per_week``: a float that indicates the weekly earnings of the individual.

The CSV file has a column for each of these variables. Here are the first few lines of the file::

    h_id,age,gender,race,ethnicity,employment_status,hours_worked_per_week,earnings_per_week
    1_1_1,32,Female,BlackOnly,Non-Hispanic,Working,40.0,1250.0
    1_2_2,80,Female,WhiteOnly,Non-Hispanic,Others2,,
    1_3_3,20,Female,BlackOnly,Non-Hispanic,Others2,,
    1_4_4,28,Male,WhiteOnly,Non-Hispanic,Working,40.0,1100.0
    1_5_5,32,Male,WhiteOnly,Non-Hispanic,Working,52.0,1289.23


The original data uses codes for the different string variables.  We
replaced the codes with strings to simplify the process of using the
data.

**Task 1**: Use ``pd.read_csv`` to read the sample data into a pandas
dataframe and save the result in a variable named ``morg_df``. Use
``h_id``, which uniquely identifies each row, and as the row index.

You may find it helpful to set the maximum numbers of rows and columns
you see in your IPython session. You can do so using `pandas' options
<https://pandas.pydata.org/pandas-docs/stable/options.html>`__::
    
    In [X]: pd.options.display.max_columns = 15

    In [X]: pd.options.display.max_rows = 20


Basic indexing
---------------

If you are unsure of how to perform the following tasks, make sure to
review the section on `DataFrames
<https://classes.cs.uchicago.edu/archive/2018/fall/12100-1/textbook/html/working_with_data/pandas/index.html#dataframes>`__
in the course textbook.

**Task 2**: Extract the ``"age"`` column from ``morg_df``.

**Task 3**: Extract the row that corresponds to ``h_id`` 1_2_2 from ``morg_df``.

**Task 4**: Use slicing to extract the first four rows of ``morg_df``.

Missing values
--------------

The ``read_csv`` function inserts ``NaN``, which stands for "Not a
Number", when a field is missing.  It often makes sense to replace
these values with something more appropriate for the application.
Before we talk about how to do that let's figure out which columns in
our dataframe have missing values. We can answer this question using
the ``isna`` method in combination with the ``any`` function.  For
example::

   any(morg_df["hours_worked_per_week"].isna())

will yield ``True``, where as::

   any(morg_df["age"].isna())

yields ``False``.  

**Task 5**: Write a loop that identifies the columns that have at
least one missing value and constructs a dictionary that maps these
column names to zero.

**Task 6**: Use the ``fillna`` method to replace the missing values in
the columns you identified in the previous task with zero.  Hint: take
a careful look at the `documentation
<https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html>`__
and you'll notice that the dictionary you constructed in the previous
task will be useful for this task.

Categorical values
------------------

Categoricals are a pandas data type used to represent variables that
can take on only a limited, and usually fixed, number of possible
values.  We can replace a columns of strings with categorical values
using the ``.astype`` method and assignment::

      morg_df["gender"] = morg_df["gender"].astype("category")

**Task 7**: Convert the four string columns (ethnicity, gender, race, and employment status) to categoricals.


Converting floats to bins
-------------------------

On occasion, it is useful to group floating point values, such as
ages or salaries, into bins.  Pandas provides a function `cut
<http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html#pandas.cut>`__
that serves this purpose.

Let's look at some code for adding an "age bin" column to the Morg
dataframe.  But first, let's look at the range of ages seen in our
data:

::

    In [19]: morg_df["age"].min()
    Out[19]: 16

    In [20]: morg_df["age"].max()
    Out[20]: 85


We'd like to group the data into 9 bins.  For this lab, we'll use
``range(16, 89, 8)`` for this purpose.  This approach is not ideal,
since the last bin includes four years that do not correspond to the
ages of any of the people in our dataset.

The bins need to be labeled.  For simplicity, we'll label the bins
with integers.  Keep in mind that we need :math:`N+1` boundaries for
:math:`N` bins (and :math:`N` labels).

Once we have defined the bin boundaries and the labels, we can use
``cut`` to compute a new column where the `ith` entry corresponds to the
appropriate bin for the age of the person in the `ith` row of ``morg_df``:

::

    boundaries = range(16, 89, 8)
    morg_df["age_bin"] = pd.cut(morg_df["age"],
                                bins=boundaries,
                                labels=range(len(boundaries)-1),
                                include_lowest=True, right=False)

We use ``include_lowest=True`` because we have at least one person who
is 16 and we want to include his age in a bin.  We use
``right=False``, because we do not want to include 89 in our last bin.

**Task 8**: Add a column (``hwpw_bin``) to ``morg_df`` for hours worked per week that assigns a person's hours worked per week to one of ten buckets, evenly spaced from the fewest number of hours worked (0) to the most   (99) inclusive.

Once you have finished this task, take a look types of the columns
using the ``dtypes`` property.  You should see something like the
following::

    In [12]: morg_df.dtypes
    Out[12]: 
    age                         int64
    gender                   category
    race                     category
    ethnicity                category
    employment_status        category
    hours_worked_per_week     float64
    earnings_per_week         float64
    age_bin                  category
    hwpw_bin                 category
   dtype: object


Filtering
---------

As we discussed in class, the ability to filter the rows of a
dataframe based on a boolean condition is very useful.  If you don't
remember how to filter, take another look at the section on filtering
in the course textbook.

**Task 9**: Use filtering to extract all rows that correspond to a person who works 35 or more hours per week.

**Task 10**: Use filtering to extract the rows that correspond to the people who are not working.

**Task 11**: Use filtering to extract the rows that correspond to people who worked at least 35 hours per week or who earned more than $1000 per week.


Counting
--------

Counting the number of occurrences of different values is a very
common operation.  We saw two ways to do this task in class: using
``value_counts`` and ``groupby`` plus ``size``:

::

    In [18]: es_counts = morg_df["employment_status"].value_counts()

    In [19]:
    Out[19]: es_counts
    Working                       159477
    Others1                        52664
    Others2                        39360
    Unable to work or disabled     15489
    With a job but not at work      8213
    Looking                         7391
    Layoff                          1426
    Name: employment_status, dtype: int64


    In [19]: morg_df.groupby("employment_status").size()
    Out[19]: 
    employment_status
    Layoff                          1426
    Looking                         7391
    Others1                        52664
    Others2                        39360
    Unable to work or disabled     15489
    With a job but not at work      8213
    Working                       159477
    dtype: int64

In both cases, the result is a Pandas Series (see the `relevant section of the textbook <https://classes.cs.uchicago.edu/archive/2018/fall/12100-1/textbook/html/working_with_data/pandas/index.html#series>`__

**Task 12**: Use the value count method to count the number of people recorded for different races/race combinations and then print the three five most frequent races (or combinations thereof).

**Task 13**: Use the group by approach to count the number of people in each race/gender combination.

Note that you can get the values sorted by the employment status for either approach using the Series ``sort_index`` method:

::

    In [31]: morg_df["employment_status"].value_counts().sort_index()
    Out[31]: 
    Layoff                          1426
    Looking                         7391
    Others1                        52664
    Others2                        39360
    Unable to work or disabled     15489
    With a job but not at work      8213
    Working                       159477
    Name: employment_status, dtype: int64


Merging two dataframes
----------------------

Combining data from different datasets is a very common task.  We'll
explain how to do this task in Pandas using two very simple
dataframes.  The first contains information about students::

    In [41]: students = pd.read_csv("data/students.csv")

    In [42]: students
    Out[42]: 
      First Name   Last Name  UCID                     Email        Major
    0        Sam       Spade  1234        spade@uchicago.edu    Sociology
    1      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics
    2   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology
    3       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics

The second contains information about grades for a specific course::

    In [63]: grades = pd.read_csv("data/grades.csv")

    In [64]: grades
    Out[64]: 
       Course  UCID  Score Grade
    0  CS 121  2789     90    A-
    1  CS 121  1234     65     C
    2  CS 121  7654     85    B+
    3  CS 121  9999    100    A+

Notice that the ``grades`` dataframe does not include the students'
names, email addresses, etc.  To associate a name with each grade, we
need to combine the two dataframes.

More specifically, we can produce a new dataframe that contains
information from both the ``students`` dataframe and the ``grades``
dataframe using the ``pd.merge`` function.  We'll use this function
with four arguments: the two data frames, the name of the column to
use to "match up" the rows in the different frames, and the method we
want to use to deal with rows from one dataframe that have no mate in
the other dataframe (for example, Sherlock Holmes (UCID: 2222) does
not appear in the ``grades`` dataframe and the student with UCID 9999
does not appear in the ``students`` dataframe.

Since the ``UCID`` column occurs in both dataframes, we can use it as
to match-up the rows that correspond to specific students in the two
dataframes.  The column or columns that are used to match up the rows
are known as the merge or join key.

If we use the most basic kind of merge (known as an *inner join*), the
result will contain three rows: one for each UCID that appears in both
the ``students`` dataframe and the ``grades`` dataframe::

    In [79]: pd.merge(students, grades, on="UCID", how="inner")
    Out[79]: 
      First Name   Last Name  UCID               Email        Major  Course  Score Grade
    0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
    1      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
    2       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+


Notice that the columns from the first argument (``students``) are
followed by the corresponding columns from the second argument
(``grades``) minus the ``UCID`` and that each row contains information
for the same UCID.  Sherlock Holmes does not appear in the result,
because there is no row with his ``UCID`` (2222) in the ``grades ``
dataframe.  Also, notice that UCID 9999, which appears in the
``grades`` dataframe, does not appear in the result, because it has no
mate in the ``students`` dataframe.

To include Sherlock Holmes (but not UCID 9999) in the result, we need
to switch the merge operation (``how``) from ``"inner"`` to
``"left"``, meaning that we want to do a left merge (aka left outer
join)::

    In [80]: pd.merge(students, grades, on="UCID", how="left")
    Out[80]: 
      First Name   Last Name  UCID                     Email        Major  Course  Score Grade
    0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
    1      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
    2   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
    3       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+

In this version, Sherlock Holmes has the value ``NaN`` for both his
score and his grade, because he did not have a grade in the ``grades``
dataframe (and we did a left join).

Correspondingly, if we want to include UCID 9999 in our result, but
not Sherlock Holmes, then we can do a ``"right"`` merge (aka right
outer join)::

    In [81]: pd.merge(students, grades, on="UCID", how="right")
    Out[81]: 
      First Name   Last Name  UCID               Email        Major  Course  Score Grade
    0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
    1      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
    2       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+
    3        NaN         NaN  9999                 NaN          NaN  CS 121    100    A+


Notice that UCID has NaNs for all of the ``students`` fields, since
UCID 9999 did not appear in the ``students`` dataframe.

Finally, we can include both Sherlock Holmes and UCID 9999 in our
result, by doing an ``"outer"`` merge (aka, outer join)::

    In [82]: pd.merge(students, grades, on="UCID", how="outer")
    Out[82]: 
      First Name   Last Name  UCID                     Email        Major  Course  Score Grade
    0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
    1      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
    2   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
    3       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+
    4        NaN         NaN  9999                       NaN          NaN  CS 121  100.0    A+

So far, each row in one dataframe matched with at most one row in the
other dataframe.  Let's look at an example, where the second dataframe
contains grades for multiple courses and the same UCID occurs more
than once.  Here's the dataframe::

    In [85]: extended_grades = pd.read_csv("extended_grades.csv")

    In [86]: extended_grades
    Out[86]: 
       Course  UCID  Score Grade
    0  CS 121  2789     90    A-
    1  CS 121  1234     65     C
    2  CS 121  7654     85    B+
    3  CS 121  9999    100    A+
    4  CS 122  2789     87    B+
    5  CS 122  1234     97     A
    6  CS 122  8888     76    B-

Notice that UCIDs 2789 and 1234 both occur twice.

Here's what we get when we do inner merge of the students dataframe
and the ``extended_grades`` dataframe::

    In [88]: pd.merge(students, extended_grades, on="UCID", how="inner")
    Out[88]: 
      First Name   Last Name  UCID               Email        Major  Course  Score Grade
    0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
    1        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 122     97     A
    2      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
    3      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 122     87    B+
    4       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+

We get two entries for Sam Spade and Nancy Drew: once for each course.
V.I. Warshawski appears only once, because she has a grade for only
one course.

As in the previous example, doing a left merge will add an entry for Sherlock Holmes::

    In [89]: pd.merge(students, extended_grades, on="UCID", how="left")
    Out[89]: 
      First Name   Last Name  UCID                     Email        Major  Course  Score Grade
    0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
    1        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 122   97.0     A
    2      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
    3      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 122   87.0    B+
    4   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
    5       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+

While doing a right merge, will add entries for UCIDs 8888 and 9999::

    In [90]: pd.merge(students, extended_grades, on="UCID", how="right")
    Out[90]: 
      First Name   Last Name  UCID               Email        Major  Course  Score Grade
    0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
    1        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 122     97     A
    2      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
    3      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 122     87    B+
    4       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+
    5        NaN         NaN  9999                 NaN          NaN  CS 121    100    A+
    6        NaN         NaN  8888                 NaN          NaN  CS 122     76    B-

Finally, doing an outer merge will include data for Sherlock Holmes
and both UCID 8888 and 9999 as well as data for all the students who
appeared in both dataframes::

    In [91]: pd.merge(students, extended_grades, on="UCID", how="outer")
    Out[91]: 
      First Name   Last Name  UCID                     Email        Major  Course  Score Grade
    0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
    1        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 122   97.0     A
    2      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
    3      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 122   87.0    B+
    4   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
    5       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+
    6        NaN         NaN  9999                       NaN          NaN  CS 121  100.0    A+
    7        NaN         NaN  8888                       NaN          NaN  CS 122   76.0    B-


**Task 14** Use ``pd.merge`` and other pandas methods to produce a dataframe that contains a count of grades by major::

      Grade        Major  Count
    0     A    Sociology      1
    1    A-  Mathematics      1
    2    B+  Mathematics      2
    3     C    Sociology      1


Hint: you'll need methods you learned in this lab plus a couple
(``reset_index`` and ``rename``) that you'll need to figure out from
the documentation.

































When finished
-------------

When finished with the lab please check in your work (assuming you are
inside the lab directory):

.. code::

    git add lab7.py
    git commit -m "Finished with lab7"
    git push

No, we are not grading your work.  We just want to make sure your
repository is in a clean state and that you have access to work your
on both at CSIL and on your VM.