====== Pandas ====== Pandas is a Python library for working with data. Pandas is a large library with many useful features. In this lab, we will focus on concepts that will be useful for PA #6. Getting started --------------- Navigate (``cd``) to your |repo_name| directory. Run ``git pull upstream master`` to collect the lab materials and ``git pull`` to sync with your personal repository. We will be using data from the `Current Population Survey <https://www.census.gov/programs-surveys/cps.html>`__ in this lab. In your lab directory, please run the following command on the Linux command line to get the data: :: $ ./get_files.sh The ``lab7`` directory contains a file named ``lab7.py``, where you will write your code for this lab. We have included code to import pandas as ``pd``. Fire up ``ipython3`` and use it to experiment with Pandas in each of the tasks in this lab. Once you're sure you've arrived at the right Pandas code for a given task, add it to ``lab7.py``. You can then run this file to see the result of all the tasks. Please note that ``lab7.py`` will fail to run until you've implemented Task 1 below. Reading the data ---------------- As mentioned, we will be using a sample dataset from the Current Population Survey for this assignment. The file ``morg_d07_strings.csv`` contains a modified version of the 2007 MORG data, which we downloaded from the Census Bureau's website. The file is in `comma-separated value (CSV) <https://en.wikipedia.org/wiki/Comma-separated_values>`__ format. It can be understood to represent a table with multiple rows and columns (in fact, the CSV format is supported by most spreadsheet programs, and you can try opening the file in Excel, Libreoffice Calc, etc.). The first line of the file is the *header* of the file. It contains the names of the columns, separated by commas. After the header, each line in the file represents a *row* in the table, with each value in the row (corresponding to the columns specified in the header) separated by a comma. A common way to refer to a value in a row is as a *field*. So, if a CSV file has an ``age`` column, in an individual row we would refer to the ``age`` field (instead of column, which tends to refer to an entire column of values). Each row in ``morg_d07_strings.csv`` corresponds to the survey data obtained from a unique individual. We consider the following variables for each individual in this assignment (although there are a lot more variables available in the MORG datasets): - ``h_id``: a string that serves as a unique identifier, which we created by concatenating several variables in the original MORG datasets. - ``age``: an integer value specifying the age of the individual. - ``gender``: the gender (or sex) recorded for the individual. - ``race``: the race recorded for the individual. - ``ethnicity``: the ethnicity recorded for the individual. - ``employment_status``: the employment status record for the individual. - ``hours_worked_per_week``: an integer that specifies the usual weekly work hours of the individual. - ``earnings_per_week``: a float that indicates the weekly earnings of the individual. The CSV file has a column for each of these variables. Here are the first few lines of the file:: h_id,age,gender,race,ethnicity,employment_status,hours_worked_per_week,earnings_per_week 1_1_1,32,Female,BlackOnly,Non-Hispanic,Working,40.0,1250.0 1_2_2,80,Female,WhiteOnly,Non-Hispanic,Others2,, 1_3_3,20,Female,BlackOnly,Non-Hispanic,Others2,, 1_4_4,28,Male,WhiteOnly,Non-Hispanic,Working,40.0,1100.0 1_5_5,32,Male,WhiteOnly,Non-Hispanic,Working,52.0,1289.23 The original data uses codes for the different string variables. We replaced the codes with strings to simplify the process of using the data. **Task 1**: Use ``pd.read_csv`` to read the sample data into a pandas dataframe and save the result in a variable named ``morg_df``. Use ``h_id``, which uniquely identifies each row, and as the row index. You may find it helpful to set the maximum numbers of rows and columns you see in your IPython session. You can do so using `pandas' options <https://pandas.pydata.org/pandas-docs/stable/options.html>`__:: In [X]: pd.options.display.max_columns = 15 In [X]: pd.options.display.max_rows = 20 Basic indexing --------------- If you are unsure of how to perform the following tasks, make sure to review the section on `DataFrames <https://classes.cs.uchicago.edu/archive/2018/fall/12100-1/textbook/html/working_with_data/pandas/index.html#dataframes>`__ in the course textbook. **Task 2**: Extract the ``"age"`` column from ``morg_df``. **Task 3**: Extract the row that corresponds to ``h_id`` 1_2_2 from ``morg_df``. **Task 4**: Use slicing to extract the first four rows of ``morg_df``. Missing values -------------- The ``read_csv`` function inserts ``NaN``, which stands for "Not a Number", when a field is missing. It often makes sense to replace these values with something more appropriate for the application. Before we talk about how to do that let's figure out which columns in our dataframe have missing values. We can answer this question using the ``isna`` method in combination with the ``any`` function. For example:: any(morg_df["hours_worked_per_week"].isna()) will yield ``True``, where as:: any(morg_df["age"].isna()) yields ``False``. **Task 5**: Write a loop that identifies the columns that have at least one missing value and constructs a dictionary that maps these column names to zero. **Task 6**: Use the ``fillna`` method to replace the missing values in the columns you identified in the previous task with zero. Hint: take a careful look at the `documentation <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html>`__ and you'll notice that the dictionary you constructed in the previous task will be useful for this task. Categorical values ------------------ Categoricals are a pandas data type used to represent variables that can take on only a limited, and usually fixed, number of possible values. We can replace a columns of strings with categorical values using the ``.astype`` method and assignment:: morg_df["gender"] = morg_df["gender"].astype("category") **Task 7**: Convert the four string columns (ethnicity, gender, race, and employment status) to categoricals. Converting floats to bins ------------------------- On occasion, it is useful to group floating point values, such as ages or salaries, into bins. Pandas provides a function `cut <http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html#pandas.cut>`__ that serves this purpose. Let's look at some code for adding an "age bin" column to the Morg dataframe. But first, let's look at the range of ages seen in our data: :: In [19]: morg_df["age"].min() Out[19]: 16 In [20]: morg_df["age"].max() Out[20]: 85 We'd like to group the data into 9 bins. For this lab, we'll use ``range(16, 89, 8)`` for this purpose. This approach is not ideal, since the last bin includes four years that do not correspond to the ages of any of the people in our dataset. The bins need to be labeled. For simplicity, we'll label the bins with integers. Keep in mind that we need :math:`N+1` boundaries for :math:`N` bins (and :math:`N` labels). Once we have defined the bin boundaries and the labels, we can use ``cut`` to compute a new column where the `ith` entry corresponds to the appropriate bin for the age of the person in the `ith` row of ``morg_df``: :: boundaries = range(16, 89, 8) morg_df["age_bin"] = pd.cut(morg_df["age"], bins=boundaries, labels=range(len(boundaries)-1), include_lowest=True, right=False) We use ``include_lowest=True`` because we have at least one person who is 16 and we want to include his age in a bin. We use ``right=False``, because we do not want to include 89 in our last bin. **Task 8**: Add a column (``hwpw_bin``) to ``morg_df`` for hours worked per week that assigns a person's hours worked per week to one of ten buckets, evenly spaced from the fewest number of hours worked (0) to the most (99) inclusive. Once you have finished this task, take a look types of the columns using the ``dtypes`` property. You should see something like the following:: In [12]: morg_df.dtypes Out[12]: age int64 gender category race category ethnicity category employment_status category hours_worked_per_week float64 earnings_per_week float64 age_bin category hwpw_bin category dtype: object Filtering --------- As we discussed in class, the ability to filter the rows of a dataframe based on a boolean condition is very useful. If you don't remember how to filter, take another look at the section on filtering in the course textbook. **Task 9**: Use filtering to extract all rows that correspond to a person who works 35 or more hours per week. **Task 10**: Use filtering to extract the rows that correspond to the people who are not working. **Task 11**: Use filtering to extract the rows that correspond to people who worked at least 35 hours per week or who earned more than $1000 per week. Counting -------- Counting the number of occurrences of different values is a very common operation. We saw two ways to do this task in class: using ``value_counts`` and ``groupby`` plus ``size``: :: In [18]: es_counts = morg_df["employment_status"].value_counts() In [19]: Out[19]: es_counts Working 159477 Others1 52664 Others2 39360 Unable to work or disabled 15489 With a job but not at work 8213 Looking 7391 Layoff 1426 Name: employment_status, dtype: int64 In [19]: morg_df.groupby("employment_status").size() Out[19]: employment_status Layoff 1426 Looking 7391 Others1 52664 Others2 39360 Unable to work or disabled 15489 With a job but not at work 8213 Working 159477 dtype: int64 In both cases, the result is a Pandas Series (see the `relevant section of the textbook <https://classes.cs.uchicago.edu/archive/2018/fall/12100-1/textbook/html/working_with_data/pandas/index.html#series>`__ **Task 12**: Use the value count method to count the number of people recorded for different races/race combinations and then print the three five most frequent races (or combinations thereof). **Task 13**: Use the group by approach to count the number of people in each race/gender combination. Note that you can get the values sorted by the employment status for either approach using the Series ``sort_index`` method: :: In [31]: morg_df["employment_status"].value_counts().sort_index() Out[31]: Layoff 1426 Looking 7391 Others1 52664 Others2 39360 Unable to work or disabled 15489 With a job but not at work 8213 Working 159477 Name: employment_status, dtype: int64 Merging two dataframes ---------------------- Combining data from different datasets is a very common task. We'll explain how to do this task in Pandas using two very simple dataframes. The first contains information about students:: In [41]: students = pd.read_csv("data/students.csv") In [42]: students Out[42]: First Name Last Name UCID Email Major 0 Sam Spade 1234 spade@uchicago.edu Sociology 1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics 2 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology 3 V.I. Warshawski 7654 viw@uchicago.edu Mathematics The second contains information about grades for a specific course:: In [63]: grades = pd.read_csv("data/grades.csv") In [64]: grades Out[64]: Course UCID Score Grade 0 CS 121 2789 90 A- 1 CS 121 1234 65 C 2 CS 121 7654 85 B+ 3 CS 121 9999 100 A+ Notice that the ``grades`` dataframe does not include the students' names, email addresses, etc. To associate a name with each grade, we need to combine the two dataframes. More specifically, we can produce a new dataframe that contains information from both the ``students`` dataframe and the ``grades`` dataframe using the ``pd.merge`` function. We'll use this function with four arguments: the two data frames, the name of the column to use to "match up" the rows in the different frames, and the method we want to use to deal with rows from one dataframe that have no mate in the other dataframe (for example, Sherlock Holmes (UCID: 2222) does not appear in the ``grades`` dataframe and the student with UCID 9999 does not appear in the ``students`` dataframe. Since the ``UCID`` column occurs in both dataframes, we can use it as to match-up the rows that correspond to specific students in the two dataframes. The column or columns that are used to match up the rows are known as the merge or join key. If we use the most basic kind of merge (known as an *inner join*), the result will contain three rows: one for each UCID that appears in both the ``students`` dataframe and the ``grades`` dataframe:: In [79]: pd.merge(students, grades, on="UCID", how="inner") Out[79]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C 1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A- 2 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+ Notice that the columns from the first argument (``students``) are followed by the corresponding columns from the second argument (``grades``) minus the ``UCID`` and that each row contains information for the same UCID. Sherlock Holmes does not appear in the result, because there is no row with his ``UCID`` (2222) in the ``grades `` dataframe. Also, notice that UCID 9999, which appears in the ``grades`` dataframe, does not appear in the result, because it has no mate in the ``students`` dataframe. To include Sherlock Holmes (but not UCID 9999) in the result, we need to switch the merge operation (``how``) from ``"inner"`` to ``"left"``, meaning that we want to do a left merge (aka left outer join):: In [80]: pd.merge(students, grades, on="UCID", how="left") Out[80]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C 1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A- 2 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN 3 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+ In this version, Sherlock Holmes has the value ``NaN`` for both his score and his grade, because he did not have a grade in the ``grades`` dataframe (and we did a left join). Correspondingly, if we want to include UCID 9999 in our result, but not Sherlock Holmes, then we can do a ``"right"`` merge (aka right outer join):: In [81]: pd.merge(students, grades, on="UCID", how="right") Out[81]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C 1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A- 2 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+ 3 NaN NaN 9999 NaN NaN CS 121 100 A+ Notice that UCID has NaNs for all of the ``students`` fields, since UCID 9999 did not appear in the ``students`` dataframe. Finally, we can include both Sherlock Holmes and UCID 9999 in our result, by doing an ``"outer"`` merge (aka, outer join):: In [82]: pd.merge(students, grades, on="UCID", how="outer") Out[82]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C 1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A- 2 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN 3 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+ 4 NaN NaN 9999 NaN NaN CS 121 100.0 A+ So far, each row in one dataframe matched with at most one row in the other dataframe. Let's look at an example, where the second dataframe contains grades for multiple courses and the same UCID occurs more than once. Here's the dataframe:: In [85]: extended_grades = pd.read_csv("extended_grades.csv") In [86]: extended_grades Out[86]: Course UCID Score Grade 0 CS 121 2789 90 A- 1 CS 121 1234 65 C 2 CS 121 7654 85 B+ 3 CS 121 9999 100 A+ 4 CS 122 2789 87 B+ 5 CS 122 1234 97 A 6 CS 122 8888 76 B- Notice that UCIDs 2789 and 1234 both occur twice. Here's what we get when we do inner merge of the students dataframe and the ``extended_grades`` dataframe:: In [88]: pd.merge(students, extended_grades, on="UCID", how="inner") Out[88]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C 1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97 A 2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A- 3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87 B+ 4 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+ We get two entries for Sam Spade and Nancy Drew: once for each course. V.I. Warshawski appears only once, because she has a grade for only one course. As in the previous example, doing a left merge will add an entry for Sherlock Holmes:: In [89]: pd.merge(students, extended_grades, on="UCID", how="left") Out[89]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C 1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97.0 A 2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A- 3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87.0 B+ 4 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN 5 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+ While doing a right merge, will add entries for UCIDs 8888 and 9999:: In [90]: pd.merge(students, extended_grades, on="UCID", how="right") Out[90]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C 1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97 A 2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A- 3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87 B+ 4 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+ 5 NaN NaN 9999 NaN NaN CS 121 100 A+ 6 NaN NaN 8888 NaN NaN CS 122 76 B- Finally, doing an outer merge will include data for Sherlock Holmes and both UCID 8888 and 9999 as well as data for all the students who appeared in both dataframes:: In [91]: pd.merge(students, extended_grades, on="UCID", how="outer") Out[91]: First Name Last Name UCID Email Major Course Score Grade 0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C 1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97.0 A 2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A- 3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87.0 B+ 4 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN 5 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+ 6 NaN NaN 9999 NaN NaN CS 121 100.0 A+ 7 NaN NaN 8888 NaN NaN CS 122 76.0 B- **Task 14** Use ``pd.merge`` and other pandas methods to produce a dataframe that contains a count of grades by major:: Grade Major Count 0 A Sociology 1 1 A- Mathematics 1 2 B+ Mathematics 2 3 C Sociology 1 Hint: you'll need methods you learned in this lab plus a couple (``reset_index`` and ``rename``) that you'll need to figure out from the documentation. When finished ------------- When finished with the lab please check in your work (assuming you are inside the lab directory): .. code:: git add lab7.py git commit -m "Finished with lab7" git push No, we are not grading your work. We just want to make sure your repository is in a clean state and that you have access to work your on both at CSIL and on your VM.