Pandas¶
Pandas is a Python library for working with data. Pandas is a large library with many useful features. In this lab, we will focus on concepts that will be useful for PA #6.
Getting started¶
Navigate (cd
) to your
cmsc12100-aut-18-username
directory. Run git pull upstream master
to collect the lab materials and git pull
to sync with your
personal repository.
We will be using data from the Current Population Survey in this lab. In your lab directory, please run the following command on the Linux command line to get the data:
$ ./get_files.sh
The lab7
directory contains a file named lab7.py
, where you
will write your code for this lab. We have included code to import
pandas as pd
.
Fire up ipython3
and use it to experiment with Pandas in each
of the tasks in this lab. Once you’re sure you’ve arrived at the right
Pandas code for a given task, add it to lab7.py
. You can then run this
file to see the result of all the tasks. Please note that lab7.py
will fail to run until you’ve implemented Task 1 below.
Reading the data¶
As mentioned, we will be using a sample dataset from the Current
Population Survey for this assignment. The file
morg_d07_strings.csv
contains a modified version of the 2007 MORG
data, which we downloaded from the Census Bureau’s website.
The file is in comma-separated value (CSV) format. It
can be understood to represent a table with multiple rows and columns
(in fact, the CSV format is supported by most spreadsheet programs,
and you can try opening the file in Excel, Libreoffice Calc, etc.).
The first line of the file is the header of the file. It contains
the names of the columns, separated by commas. After the header, each
line in the file represents a row in the table, with each value in
the row (corresponding to the columns specified in the header)
separated by a comma. A common way to refer to a value in a row is as
a field. So, if a CSV file has an age
column, in an individual
row we would refer to the age
field (instead of column, which
tends to refer to an entire column of values).
Each row in morg_d07_strings.csv
corresponds to the survey data
obtained from a unique individual. We consider the following variables
for each individual in this assignment (although there are a lot more
variables available in the MORG datasets):
h_id
: a string that serves as a unique identifier, which we created by concatenating several variables in the original MORG datasets.age
: an integer value specifying the age of the individual.gender
: the gender (or sex) recorded for the individual.race
: the race recorded for the individual.ethnicity
: the ethnicity recorded for the individual.employment_status
: the employment status record for the individual.hours_worked_per_week
: an integer that specifies the usual weekly work hours of the individual.earnings_per_week
: a float that indicates the weekly earnings of the individual.
The CSV file has a column for each of these variables. Here are the first few lines of the file:
h_id,age,gender,race,ethnicity,employment_status,hours_worked_per_week,earnings_per_week
1_1_1,32,Female,BlackOnly,Non-Hispanic,Working,40.0,1250.0
1_2_2,80,Female,WhiteOnly,Non-Hispanic,Others2,,
1_3_3,20,Female,BlackOnly,Non-Hispanic,Others2,,
1_4_4,28,Male,WhiteOnly,Non-Hispanic,Working,40.0,1100.0
1_5_5,32,Male,WhiteOnly,Non-Hispanic,Working,52.0,1289.23
The original data uses codes for the different string variables. We replaced the codes with strings to simplify the process of using the data.
Task 1: Use pd.read_csv
to read the sample data into a pandas
dataframe and save the result in a variable named morg_df
. Use
h_id
, which uniquely identifies each row, and as the row index.
You may find it helpful to set the maximum numbers of rows and columns you see in your IPython session. You can do so using pandas’ options:
In [X]: pd.options.display.max_columns = 15
In [X]: pd.options.display.max_rows = 20
Basic indexing¶
If you are unsure of how to perform the following tasks, make sure to review the section on DataFrames in the course textbook.
Task 2: Extract the "age"
column from morg_df
.
Task 3: Extract the row that corresponds to h_id
1_2_2 from morg_df
.
Task 4: Use slicing to extract the first four rows of morg_df
.
Missing values¶
The read_csv
function inserts NaN
, which stands for “Not a
Number”, when a field is missing. It often makes sense to replace
these values with something more appropriate for the application.
Before we talk about how to do that let’s figure out which columns in
our dataframe have missing values. We can answer this question using
the isna
method in combination with the any
function. For
example:
any(morg_df["hours_worked_per_week"].isna())
will yield True
, where as:
any(morg_df["age"].isna())
yields False
.
Task 5: Write a loop that identifies the columns that have at least one missing value and constructs a dictionary that maps these column names to zero.
Task 6: Use the fillna
method to replace the missing values in
the columns you identified in the previous task with zero. Hint: take
a careful look at the documentation
and you’ll notice that the dictionary you constructed in the previous
task will be useful for this task.
Categorical values¶
Categoricals are a pandas data type used to represent variables that
can take on only a limited, and usually fixed, number of possible
values. We can replace a columns of strings with categorical values
using the .astype
method and assignment:
morg_df["gender"] = morg_df["gender"].astype("category")
Task 7: Convert the four string columns (ethnicity, gender, race, and employment status) to categoricals.
Converting floats to bins¶
On occasion, it is useful to group floating point values, such as ages or salaries, into bins. Pandas provides a function cut that serves this purpose.
Let’s look at some code for adding an “age bin” column to the Morg dataframe. But first, let’s look at the range of ages seen in our data:
In [19]: morg_df["age"].min()
Out[19]: 16
In [20]: morg_df["age"].max()
Out[20]: 85
We’d like to group the data into 9 bins. For this lab, we’ll use
range(16, 89, 8)
for this purpose. This approach is not ideal,
since the last bin includes four years that do not correspond to the
ages of any of the people in our dataset.
The bins need to be labeled. For simplicity, we’ll label the bins with integers. Keep in mind that we need \(N+1\) boundaries for \(N\) bins (and \(N\) labels).
Once we have defined the bin boundaries and the labels, we can use
cut
to compute a new column where the ith entry corresponds to the
appropriate bin for the age of the person in the ith row of morg_df
:
boundaries = range(16, 89, 8)
morg_df["age_bin"] = pd.cut(morg_df["age"],
bins=boundaries,
labels=range(len(boundaries)-1),
include_lowest=True, right=False)
We use include_lowest=True
because we have at least one person who
is 16 and we want to include his age in a bin. We use
right=False
, because we do not want to include 89 in our last bin.
Task 8: Add a column (hwpw_bin
) to morg_df
for hours worked per week that assigns a person’s hours worked per week to one of ten buckets, evenly spaced from the fewest number of hours worked (0) to the most (99) inclusive.
Once you have finished this task, take a look types of the columns
using the dtypes
property. You should see something like the
following:
In [12]: morg_df.dtypes
Out[12]:
age int64
gender category
race category
ethnicity category
employment_status category
hours_worked_per_week float64
earnings_per_week float64
age_bin category
hwpw_bin category
dtype: object
Filtering¶
As we discussed in class, the ability to filter the rows of a dataframe based on a boolean condition is very useful. If you don’t remember how to filter, take another look at the section on filtering in the course textbook.
Task 9: Use filtering to extract all rows that correspond to a person who works 35 or more hours per week.
Task 10: Use filtering to extract the rows that correspond to the people who are not working.
Task 11: Use filtering to extract the rows that correspond to people who worked at least 35 hours per week or who earned more than $1000 per week.
Counting¶
Counting the number of occurrences of different values is a very
common operation. We saw two ways to do this task in class: using
value_counts
and groupby
plus size
:
In [18]: es_counts = morg_df["employment_status"].value_counts()
In [19]:
Out[19]: es_counts
Working 159477
Others1 52664
Others2 39360
Unable to work or disabled 15489
With a job but not at work 8213
Looking 7391
Layoff 1426
Name: employment_status, dtype: int64
In [19]: morg_df.groupby("employment_status").size()
Out[19]:
employment_status
Layoff 1426
Looking 7391
Others1 52664
Others2 39360
Unable to work or disabled 15489
With a job but not at work 8213
Working 159477
dtype: int64
In both cases, the result is a Pandas Series (see the relevant section of the textbook
Task 12: Use the value count method to count the number of people recorded for different races/race combinations and then print the three five most frequent races (or combinations thereof).
Task 13: Use the group by approach to count the number of people in each race/gender combination.
Note that you can get the values sorted by the employment status for either approach using the Series sort_index
method:
In [31]: morg_df["employment_status"].value_counts().sort_index()
Out[31]:
Layoff 1426
Looking 7391
Others1 52664
Others2 39360
Unable to work or disabled 15489
With a job but not at work 8213
Working 159477
Name: employment_status, dtype: int64
Merging two dataframes¶
Combining data from different datasets is a very common task. We’ll explain how to do this task in Pandas using two very simple dataframes. The first contains information about students:
In [41]: students = pd.read_csv("data/students.csv")
In [42]: students
Out[42]:
First Name Last Name UCID Email Major
0 Sam Spade 1234 spade@uchicago.edu Sociology
1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics
2 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology
3 V.I. Warshawski 7654 viw@uchicago.edu Mathematics
The second contains information about grades for a specific course:
In [63]: grades = pd.read_csv("data/grades.csv")
In [64]: grades
Out[64]:
Course UCID Score Grade
0 CS 121 2789 90 A-
1 CS 121 1234 65 C
2 CS 121 7654 85 B+
3 CS 121 9999 100 A+
Notice that the grades
dataframe does not include the students’
names, email addresses, etc. To associate a name with each grade, we
need to combine the two dataframes.
More specifically, we can produce a new dataframe that contains
information from both the students
dataframe and the grades
dataframe using the pd.merge
function. We’ll use this function
with four arguments: the two data frames, the name of the column to
use to “match up” the rows in the different frames, and the method we
want to use to deal with rows from one dataframe that have no mate in
the other dataframe (for example, Sherlock Holmes (UCID: 2222) does
not appear in the grades
dataframe and the student with UCID 9999
does not appear in the students
dataframe.
Since the UCID
column occurs in both dataframes, we can use it as
to match-up the rows that correspond to specific students in the two
dataframes. The column or columns that are used to match up the rows
are known as the merge or join key.
If we use the most basic kind of merge (known as an inner join), the
result will contain three rows: one for each UCID that appears in both
the students
dataframe and the grades
dataframe:
In [79]: pd.merge(students, grades, on="UCID", how="inner")
Out[79]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C
1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A-
2 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+
Notice that the columns from the first argument (students
) are
followed by the corresponding columns from the second argument
(grades
) minus the UCID
and that each row contains information
for the same UCID. Sherlock Holmes does not appear in the result,
because there is no row with his UCID
(2222) in the grades ``
dataframe. Also, notice that UCID 9999, which appears in the
``grades
dataframe, does not appear in the result, because it has no
mate in the students
dataframe.
To include Sherlock Holmes (but not UCID 9999) in the result, we need
to switch the merge operation (how
) from "inner"
to
"left"
, meaning that we want to do a left merge (aka left outer
join):
In [80]: pd.merge(students, grades, on="UCID", how="left")
Out[80]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C
1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A-
2 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN
3 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+
In this version, Sherlock Holmes has the value NaN
for both his
score and his grade, because he did not have a grade in the grades
dataframe (and we did a left join).
Correspondingly, if we want to include UCID 9999 in our result, but
not Sherlock Holmes, then we can do a "right"
merge (aka right
outer join):
In [81]: pd.merge(students, grades, on="UCID", how="right")
Out[81]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C
1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A-
2 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+
3 NaN NaN 9999 NaN NaN CS 121 100 A+
Notice that UCID has NaNs for all of the students
fields, since
UCID 9999 did not appear in the students
dataframe.
Finally, we can include both Sherlock Holmes and UCID 9999 in our
result, by doing an "outer"
merge (aka, outer join):
In [82]: pd.merge(students, grades, on="UCID", how="outer")
Out[82]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C
1 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A-
2 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN
3 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+
4 NaN NaN 9999 NaN NaN CS 121 100.0 A+
So far, each row in one dataframe matched with at most one row in the other dataframe. Let’s look at an example, where the second dataframe contains grades for multiple courses and the same UCID occurs more than once. Here’s the dataframe:
In [85]: extended_grades = pd.read_csv("extended_grades.csv")
In [86]: extended_grades
Out[86]:
Course UCID Score Grade
0 CS 121 2789 90 A-
1 CS 121 1234 65 C
2 CS 121 7654 85 B+
3 CS 121 9999 100 A+
4 CS 122 2789 87 B+
5 CS 122 1234 97 A
6 CS 122 8888 76 B-
Notice that UCIDs 2789 and 1234 both occur twice.
Here’s what we get when we do inner merge of the students dataframe
and the extended_grades
dataframe:
In [88]: pd.merge(students, extended_grades, on="UCID", how="inner")
Out[88]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C
1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97 A
2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A-
3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87 B+
4 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+
We get two entries for Sam Spade and Nancy Drew: once for each course. V.I. Warshawski appears only once, because she has a grade for only one course.
As in the previous example, doing a left merge will add an entry for Sherlock Holmes:
In [89]: pd.merge(students, extended_grades, on="UCID", how="left")
Out[89]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C
1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97.0 A
2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A-
3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87.0 B+
4 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN
5 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+
While doing a right merge, will add entries for UCIDs 8888 and 9999:
In [90]: pd.merge(students, extended_grades, on="UCID", how="right")
Out[90]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65 C
1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97 A
2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90 A-
3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87 B+
4 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85 B+
5 NaN NaN 9999 NaN NaN CS 121 100 A+
6 NaN NaN 8888 NaN NaN CS 122 76 B-
Finally, doing an outer merge will include data for Sherlock Holmes and both UCID 8888 and 9999 as well as data for all the students who appeared in both dataframes:
In [91]: pd.merge(students, extended_grades, on="UCID", how="outer")
Out[91]:
First Name Last Name UCID Email Major Course Score Grade
0 Sam Spade 1234 spade@uchicago.edu Sociology CS 121 65.0 C
1 Sam Spade 1234 spade@uchicago.edu Sociology CS 122 97.0 A
2 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 121 90.0 A-
3 Nancy Drew 2789 ndrew@uchicago.edu Mathematics CS 122 87.0 B+
4 Sherlock Holmes 2222 bakerstreet@uchicago.edu Psychology NaN NaN NaN
5 V.I. Warshawski 7654 viw@uchicago.edu Mathematics CS 121 85.0 B+
6 NaN NaN 9999 NaN NaN CS 121 100.0 A+
7 NaN NaN 8888 NaN NaN CS 122 76.0 B-
Task 14 Use pd.merge
and other pandas methods to produce a dataframe that contains a count of grades by major:
Grade Major Count
0 A Sociology 1
1 A- Mathematics 1
2 B+ Mathematics 2
3 C Sociology 1
Hint: you’ll need methods you learned in this lab plus a couple
(reset_index
and rename
) that you’ll need to figure out from
the documentation.
When finished¶
When finished with the lab please check in your work (assuming you are inside the lab directory):
git add lab7.py
git commit -m "Finished with lab7"
git push
No, we are not grading your work. We just want to make sure your repository is in a clean state and that you have access to work your on both at CSIL and on your VM.