Team Tutorial #7: Pandas

Pandas is a Python library for working with data. Pandas is a large library with many useful features. In this team tutorial, we will focus on concepts that will be useful for PA #7, an optional, bonus assignment focusing on Pandas.

Getting started

If this is your first time working through a Team Tutorial, please see the “Getting started” section of Team Tutorial #1 to set up your Team Tutorials repository.

To get the files for this tutorial, navigate to your Team Tutorials repository and pull the new material:

cd ~/cmsc12100
cd team-tutorials-$GITHUB_USERNAME
git pull upstream main

We will be using data from the Current Population Survey in this tutorial. In your tt7/ directory, please run the following command on the Linux command line to get the data:

$ ./get_files.sh

The tt7 directory contains a file named tt7.py, where you will write your code for this tutorial. We have included code to import pandas as pd.

Fire up ipython3 and use it to experiment with Pandas in each of the tasks in this tutorial. Once you’re sure you’ve arrived at the right Pandas code for a given task, add it to tt7.py. You can then run this file to see the result of all the tasks. Please note that tt7.py will fail to run until you’ve implemented Task 1 below.

Reading the data

As mentioned, we will be using a sample dataset from the Current Population Survey for this tutorial. The file morg_d07_strings.csv contains a modified version of the 2007 MORG data, which we downloaded from the Census Bureau’s website.

The file is in comma-separated value (CSV) format. It can be understood to represent a table with multiple rows and columns (in fact, the CSV format is supported by most spreadsheet programs, and you can try opening the file in Excel, Libreoffice Calc, etc.). The first line of the file is the header of the file. It contains the names of the columns, separated by commas. After the header, each line in the file represents a row in the table, with each value in the row (corresponding to the columns specified in the header) separated by a comma. A common way to refer to a value in a row is as a field. So, if a CSV file has an age column, in an individual row we would refer to the age field (instead of column, which tends to refer to an entire column of values).

Each row in morg_d07_strings.csv corresponds to the survey data obtained from a unique individual. We consider the following variables for each individual in this assignment (although there are a lot more variables available in the MORG datasets):

  • h_id: a string that serves as a unique identifier, which we created by concatenating several variables in the original MORG datasets.

  • age: an integer value specifying the age of the individual.

  • gender: the gender (or sex) recorded for the individual.

  • race: the race recorded for the individual.

  • ethnicity: the ethnicity recorded for the individual.

  • employment_status: the employment status record for the individual.

  • hours_worked_per_week: an integer that specifies the usual weekly work hours of the individual.

  • earnings_per_week: a float that indicates the weekly earnings of the individual.

The CSV file has a column for each of these variables. Here are the first few lines of the file:

h_id,age,gender,race,ethnicity,employment_status,hours_worked_per_week,earnings_per_week
1_1_1,32,Female,BlackOnly,Non-Hispanic,Working,40.0,1250.0
1_2_2,80,Female,WhiteOnly,Non-Hispanic,Others2,,
1_3_3,20,Female,BlackOnly,Non-Hispanic,Others2,,
1_4_4,28,Male,WhiteOnly,Non-Hispanic,Working,40.0,1100.0
1_5_5,32,Male,WhiteOnly,Non-Hispanic,Working,52.0,1289.23

The original data uses codes for the different string variables. We replaced the codes with strings to simplify the process of using the data.

Task 1: Use pd.read_csv to read the sample data into a pandas dataframe and save the result in a variable named morg_df. Use h_id, which uniquely identifies each row, and as the row index.

You may find it helpful to set the maximum numbers of rows and columns you see in your IPython session. You can do so using pandas’ options:

In [X]: pd.options.display.max_columns = 15

In [X]: pd.options.display.max_rows = 20

Basic indexing

If you are unsure of how to perform the following tasks, make sure to review the section on DataFrames in the course textbook or watch the Pandas lecture videos again.

Task 2: Extract the "age" column from morg_df.

Task 3: Extract the row that corresponds to h_id 1_2_2 from morg_df.

Task 4: Use slicing to extract the first four rows of morg_df.

Missing values

The read_csv function inserts NaN, which stands for “Not a Number”, when a field is missing. It often makes sense to replace these values with something more appropriate for the application. Before we talk about how to do that let’s figure out which columns in our dataframe have missing values. We can answer this question using the isna method in combination with the any function. For example:

any(morg_df.loc[:, "hours_worked_per_week"].isna())

will yield True, where as:

any(morg_df.loc[:, "age"].isna())

yields False.

Task 5: Write a loop that identifies the columns that have at least one missing value and constructs a dictionary that maps these column names to zero. Recall that you can access the column names for a dataframe via the columns attribute.

Task 6: Use the fillna method to replace the missing values in the columns you identified in the previous task with zero. Hint: take a careful look at the documentation and you’ll notice that the dictionary you constructed in the previous task will be useful for this task.

Categorical values

Categoricals are a pandas data type used to represent variables that can take on only a limited, and usually fixed, number of possible values. We can replace a columns of strings with categorical values using the .astype method and assignment:

morg_df.loc[:, "gender"] = morg_df.loc[:, "gender"].astype("category")

Task 7: Convert the four string columns (ethnicity, gender, race, and employment status) to categoricals.

Converting floats to bins

On occasion, it is useful to group floating point values, such as ages or salaries, into bins. Pandas provides a function cut that serves this purpose.

Let’s look at some code for adding an “age bin” column to the Morg dataframe. But first, let’s look at the range of ages seen in our data:

In [19]: morg_df.loc[:, "age"].min()
Out[19]: 16

In [20]: morg_df.loc[:, "age"].max()
Out[20]: 85

We’d like to group the data into 9 bins. For this tutorial, we’ll use range(16, 89, 8) for this purpose. This approach is not ideal, since the last bin includes four years that do not correspond to the ages of any of the people in our dataset.

The bins need to be labeled. For simplicity, we’ll label the bins with integers. Keep in mind that we need \(N+1\) boundaries for \(N\) bins (and \(N\) labels).

Once we have defined the bin boundaries and the labels, we can use cut to compute a new column where the ith entry corresponds to the appropriate bin for the age of the person in the ith row of morg_df:

boundaries = range(16, 89, 8)
morg_df.loc[:, "age_bin"] = pd.cut(morg_df.loc[:, "age"],
                                   bins=boundaries,
                                   labels=range(len(boundaries)-1),
                                   include_lowest=True, right=False)

We use include_lowest=True because we have at least one person who is 16 and we want to include his age in a bin. We use right=False, because we do not want to include 89 in our last bin.

Task 8: Add a column (hwpw_bin) to morg_df for hours worked per week that assigns a person’s hours worked per week to one of ten buckets, evenly spaced from the fewest number of hours worked (0) to the most (99) inclusive.

Once you have finished this task, take a look types of the columns using the dtypes property. You should see something like the following:

 In [12]: morg_df.dtypes
 Out[12]:
 age                         int64
 gender                   category
 race                     category
 ethnicity                category
 employment_status        category
 hours_worked_per_week     float64
 earnings_per_week         float64
 age_bin                  category
 hwpw_bin                 category
dtype: object

Filtering

As we discussed in the lecture videos, the ability to filter the rows of a dataframe based on a boolean condition is very useful. If you don’t remember how to filter, take another look at the section on filtering in the course textbook or watch the lecture videos again.

Task 9: Use filtering to extract all rows that correspond to a person who works 35 or more hours per week.

Task 10: Use filtering to extract the rows that correspond to the people who are not working.

Task 11: Use filtering to extract the rows that correspond to people who worked at least 35 hours per week or who earned more than $1000 per week. Hint: it might be useful to review the documentation on boolean indexing.

Counting

Counting the number of occurrences of different values is a very common operation. We can perform this task in several different ways. For instance, using value_counts and groupby plus size:

In [18]: es_counts = morg_df.loc[:, "employment_status"].value_counts()

In [19]: es_counts
Out[19]:
Working                       159477
Others1                        52664
Others2                        39360
Unable to work or disabled     15489
With a job but not at work      8213
Looking                         7391
Layoff                          1426
Name: employment_status, dtype: int64


In [19]: morg_df.groupby("employment_status").size()
Out[19]:
employment_status
Layoff                          1426
Looking                         7391
Others1                        52664
Others2                        39360
Unable to work or disabled     15489
With a job but not at work      8213
Working                       159477
dtype: int64

In both cases, the result is a Pandas Series (see the relevant section of the textbook

Task 12: Use the value count method to count the number of people recorded for different races/race combinations and then print the five most frequent races (or combinations thereof).

Task 13: Use the group by approach to count the number of people in each race/race combination.

Note that you can get the values sorted by the employment status for either approach using the Series sort_index method:

In [31]: morg_df.loc[:, "employment_status"].value_counts().sort_index()
Out[31]:
Layoff                          1426
Looking                         7391
Others1                        52664
Others2                        39360
Unable to work or disabled     15489
With a job but not at work      8213
Working                       159477
Name: employment_status, dtype: int64

Merging two dataframes

Combining data from different datasets is a very common task. We’ll explain how to do this task in Pandas using two very simple dataframes. The first contains information about students:

In [41]: students = pd.read_csv("data/students.csv")

In [42]: students
Out[42]:
  First Name   Last Name  UCID                     Email        Major
0        Sam       Spade  1234        spade@uchicago.edu    Sociology
1      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics
2   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology
3       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics

The second contains information about grades for a specific course:

In [63]: grades = pd.read_csv("data/grades.csv")

In [64]: grades
Out[64]:
   Course  UCID  Score Grade
0  CS 121  2789     90    A-
1  CS 121  1234     65     C
2  CS 121  7654     85    B+
3  CS 121  9999    100    A+

Notice that the grades dataframe does not include the students’ names, email addresses, etc. To associate a name with each grade, we need to combine the two dataframes.

More specifically, we can produce a new dataframe that contains information from both the students dataframe and the grades dataframe using the pd.merge function. We’ll use this function with four arguments: the two data frames, the name of the column to use to “match up” the rows in the different frames, and the method we want to use to deal with rows from one dataframe that have no mate in the other dataframe (for example, Sherlock Holmes (UCID: 2222) does not appear in the grades dataframe and the student with UCID 9999 does not appear in the students dataframe).

Since the UCID column occurs in both dataframes, we can use it to match-up the rows that correspond to specific students in the two dataframes. The column or columns that are used to match up the rows are known as the merge or join key.

If we use the most basic kind of merge (known as an inner join, which is also the default kind of merge for pd.merge), the result will contain three rows: one for each UCID that appears in both the students dataframe and the grades dataframe:

In [79]: pd.merge(students, grades, on="UCID", how="inner")
Out[79]:
  First Name   Last Name  UCID               Email        Major  Course  Score Grade
0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
1      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
2       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+

Notice that the columns from the first argument (students) are followed by the corresponding columns from the second argument (grades) minus the UCID and that each row contains information for the same UCID. Sherlock Holmes does not appear in the result, because there is no row with his UCID (2222) in the grades dataframe. Also, notice that UCID 9999, which appears in the grades dataframe, does not appear in the result, because it has no mate in the students dataframe.

To include Sherlock Holmes (but not UCID 9999) in the result, we need to switch the merge operation (how) from "inner" to "left", meaning that we want to do a left merge (aka left outer join):

In [80]: pd.merge(students, grades, on="UCID", how="left")
Out[80]:
  First Name   Last Name  UCID                     Email        Major  Course  Score Grade
0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
1      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
2   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
3       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+

In this version, Sherlock Holmes has the value NaN for both his score and his grade, because he did not have a grade in the grades dataframe (and we did a left join).

Correspondingly, if we want to include UCID 9999 in our result, but not Sherlock Holmes, then we can do a "right" merge (aka right outer join):

In [81]: pd.merge(students, grades, on="UCID", how="right")
Out[81]:
  First Name   Last Name  UCID               Email        Major  Course  Score Grade
0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
1      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
2       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+
3        NaN         NaN  9999                 NaN          NaN  CS 121    100    A+

Notice that UCID has NaNs for all of the students fields, since UCID 9999 did not appear in the students dataframe.

Finally, we can include both Sherlock Holmes and UCID 9999 in our result, by doing an "outer" merge (aka, outer join):

In [82]: pd.merge(students, grades, on="UCID", how="outer")
Out[82]:
  First Name   Last Name  UCID                     Email        Major  Course  Score Grade
0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
1      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
2   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
3       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+
4        NaN         NaN  9999                       NaN          NaN  CS 121  100.0    A+

So far, each row in one dataframe matched with at most one row in the other dataframe. Let’s look at an example, where the second dataframe contains grades for multiple courses and the same UCID occurs more than once. Here’s the dataframe:

In [85]: extended_grades = pd.read_csv("data/extended_grades.csv")

In [86]: extended_grades
Out[86]:
   Course  UCID  Score Grade
0  CS 121  2789     90    A-
1  CS 121  1234     65     C
2  CS 121  7654     85    B+
3  CS 121  9999    100    A+
4  CS 122  2789     87    B+
5  CS 122  1234     97     A
6  CS 122  8888     76    B-

Notice that UCIDs 2789 and 1234 both occur twice.

Here’s what we get when we do inner merge of the students dataframe and the extended_grades dataframe:

In [88]: pd.merge(students, extended_grades, on="UCID", how="inner")
Out[88]:
  First Name   Last Name  UCID               Email        Major  Course  Score Grade
0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
1        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 122     97     A
2      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
3      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 122     87    B+
4       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+

We get two entries for Sam Spade and Nancy Drew: once for each course. V.I. Warshawski appears only once, because she has a grade for only one course.

As in the previous example, doing a left merge will add an entry for Sherlock Holmes:

In [89]: pd.merge(students, extended_grades, on="UCID", how="left")
Out[89]:
  First Name   Last Name  UCID                     Email        Major  Course  Score Grade
0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
1        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 122   97.0     A
2      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
3      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 122   87.0    B+
4   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
5       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+

While doing a right merge, will add entries for UCIDs 8888 and 9999:

In [90]: pd.merge(students, extended_grades, on="UCID", how="right")
Out[90]:
  First Name   Last Name  UCID               Email        Major  Course  Score Grade
0        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 121     65     C
1        Sam       Spade  1234  spade@uchicago.edu    Sociology  CS 122     97     A
2      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 121     90    A-
3      Nancy        Drew  2789  ndrew@uchicago.edu  Mathematics  CS 122     87    B+
4       V.I.  Warshawski  7654    viw@uchicago.edu  Mathematics  CS 121     85    B+
5        NaN         NaN  9999                 NaN          NaN  CS 121    100    A+
6        NaN         NaN  8888                 NaN          NaN  CS 122     76    B-

Finally, doing an outer merge will include data for Sherlock Holmes and both UCID 8888 and 9999 as well as data for all the students who appeared in both dataframes:

In [91]: pd.merge(students, extended_grades, on="UCID", how="outer")
Out[91]:
  First Name   Last Name  UCID                     Email        Major  Course  Score Grade
0        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 121   65.0     C
1        Sam       Spade  1234        spade@uchicago.edu    Sociology  CS 122   97.0     A
2      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 121   90.0    A-
3      Nancy        Drew  2789        ndrew@uchicago.edu  Mathematics  CS 122   87.0    B+
4   Sherlock      Holmes  2222  bakerstreet@uchicago.edu   Psychology     NaN    NaN   NaN
5       V.I.  Warshawski  7654          viw@uchicago.edu  Mathematics  CS 121   85.0    B+
6        NaN         NaN  9999                       NaN          NaN  CS 121  100.0    A+
7        NaN         NaN  8888                       NaN          NaN  CS 122   76.0    B-

Task 14 Use pd.merge and other pandas methods to produce a dataframe that contains a count of grades by major:

  Grade        Major  Count
0     A    Sociology      1
1    A-  Mathematics      1
2    B+  Mathematics      2
3     C    Sociology      1

Hint: you’ll need methods you learned in this tutorial plus a couple (reset_index and rename) that you’ll need to figure out from the documentation.

When finished

Once you finish working on the tutorial, you should add, commit, and push the files in the tt7 directory. No, we won’t be looking at them or grading them, but this ensures you can access those files later on if you start working on a different computer, and also allows us to look at them if you do have any specific questions about your solutions.