Team Tutorial #3: Dictionaries¶
For instructions on how to get started on these tutorials, please see Team Tutorial #1.
Structure of this tutorial¶
In this tutorial, you will practice using dictionaries,
a useful data type built into Python. A dictionary (dict
for
short) is a generalization of arrays/lists that associates keys with
values. In computer science, this data type is also referred to as an
associative array or a map.
By the end of this tutorial, you should be able to:
Perform basic operations on dictionaries,
Use dictionaries to count multiple different items at once,
Iterate through the key/value pairs in a dictionary, and
Construct nested dictionaries to describe data with a complex structure.
You will need many of these skills for Programming Assignment #3. This tutorial has two main sections:
Part 1: Simple practice: In this part, we will demonstrate basic operations of and common uses for dictionaries, and then you will practice these operations.
Part 2: Extended activity: The second part will provide further practice with nested dictionaries and composition. Then, you will make a decision about code design.
Please note that, while you could do both of these activities with your team in one sitting, it may be better to work through the first part earlier in the module, and the second part once you’ve become more comfortable with dictionaries.
Getting started¶
If this is your first time working through a Team Tutorial, please see the “Getting started” section of Team Tutorial #1 to set up your Team Tutorials repository.
To get the files for this tutorial, navigate to your Team Tutorials repository and pull
the new material:
cd ~/capp30121
cd team-tutorials-$GITHUB_USERNAME
git pull upstream main
You will find the files you need in the tt3
directory.
Simple practice: Dictionaries¶
In this section, we will show you how to create dictionaries and perform common operations, and then we will give you some tasks to work on.
When you get to the tasks, we suggest that one person on the team have Visual Studio Code (VS Code) and ipython3
open and visible to everyone, and that you all collaboratively decide on the answer to each task.
If you get stumped in any of these, you may want to see if you can figure out the answer by re-reading the relevant sections of the book. The concepts in this tutorial are discussed in the Dictionaries and Sets chapter of the book. If you get really stumped, don’t hesitate to ask for help.
Data¶
We will be using data from the Consumer Financial Protection Bureau’s Consumer Complaint Database. Each complaint has information such as:
the company,
the date the complaint was received,
a unique ID,
the issue,
the product,
the consumer’s complaint narrative,
the company’s public response,
the consumer’s home state, and
the consumer’s ZIP code.
etc.
We have included code in the file cfpb.py
to define a variable
CFPB_16
. This variable contains information on 1000 complaints that
the Consumer Financial Protection Bureau (CFPB) received in 2016.
Dictionaries as a simple data representation¶
We could store this information in a list:
complaint_as_list =
['Wells Fargo & Company',
'07/29/2013',
'468882',
'Managing the loan or lease',
'Consumer Loan',
'',
'Closed with explanation',
'VA',
'24540',
'',
'N/A',
'No',
'07/30/2013',
'',
'Vehicle loan',
'Phone',
'',
'Yes',
2]
But then we would have to keep track of the fact that the name of the
company is at index 0, (complaint_as_list[0]
), the date received
is at index 1 (complaint_as_list[1]
), and so on.
This code is cryptic to read (especially if the reader is unfamiliar with the structure of the data) and (consequently) it is more likely that you will accidentally introduce bugs in the code. It is also unnecessarily difficult to update the code if the structure of the data changes (for example, if a new version of the data stops tracking the ZIP code, or starts tracking the time the complaint was received).
Dictionaries allow us
to use more meaningful values to access the different parts of a
complaint. In particular, we can use strings as keys. Here, for
example, is the same complaint represented using a dictionary:
complaint_as_dict =
{'Company': 'Wells Fargo & Company',
'Date received': '07/29/2013',
'Complaint ID': '468882',
'Issue': 'Managing the loan or lease',
'Product': 'Consumer Loan',
'Consumer complaint narrative': '',
'Company response to consumer': 'Closed with explanation',
'State': 'VA',
'ZIP code': '24540',
'Company public response': '',
'Consumer consent provided?': 'N/A',
'Consumer disputed?': 'No',
'Date sent to company': '07/30/2013',
'Sub-issue': '',
'Sub-product': 'Vehicle loan',
'Submitted via': 'Phone',
'Tags': '',
'Timely response?': 'Yes',
'Wells Fargo & Company': 2}
Given such a dictionary, we can extract the name of the company using
the string "Company"
as the index or key:
complaint_as_dict["Company"]
. We can extract the home state of
complainant using the expression complaint_as_dict["State"]
.
Notice that although the types of the keys are all the same (strings), the types of values associated with the keys are different. This arrangement is common, but not required.
From this point onward, we will represent a complaint as a dictionary using the representation shown above. The file cfpb.py
provides a variable CFPB_16
, which is a list of complaints (which is to say, a list of dictionaries).
Open the file cfpb.py
in VS Code and an ipython3
window. We will write our code in cfpb.py
and test it in ipython3
. When you start ipython3
, it’s a good idea to set up autoreload so that the changes you make in your .py files will be reflected in the interpreter:
In [1]: %load_ext autoreload
In [2]: %autoreload 2
In [3]: import cfpb
Dictionaries provide a mechanism for mapping keys (often, but not always, strings) to values. They are often used to represent multi-part data, like the CFPB complaint data discussed above.
Printing out the whole list is overwhelming, but just like any list, we can access individual elements. Here is the complaint at index 20 of the list:
In [4]: cfpb.CFPB_16[20]
Out[4]:
{'Issue': 'Incorrect information on credit report',
'Company': 'Experian',
'Experian': 3,
'Consumer complaint narrative': '',
'State': '',
'Complaint ID': '1725857',
'Sub-issue': 'Account status',
'Company public response': 'Company chooses not to provide a public response',
'Submitted via': 'Referral',
'Consumer disputed?': 'No',
'Date sent to company': '01/05/2016',
'Date received': '01/04/2016',
'Consumer consent provided?': 'N/A',
'Sub-product': '',
'Tags': '',
'Product': 'Credit reporting',
'Company response to consumer': 'Closed with explanation',
'Timely response?': 'Yes',
'ZIP code': ''}
Here is the issue associated with that complaint:
In [5]: cfpb.CFPB_16[20]["Issue"]
Out[5]: 'Incorrect information on credit report'
Let’s say we want to see all the issues involved in this set of complaints. We could iterate through the complaints and for each one, access its "Issue"
key:
In [6]: for complaint in cfpb.CFPB_16:
...: print(complaint["Issue"])
...:
If you run this code, you’ll get a long printout of issues (1000 lines, corresponding to the 1000 complaints). However, this list contains a lot of redundancy: There are many different complaints about the same issue. It might be more informative to store these complaints as a set; in a set, we ignore duplicate elements.
Note
Python has a built in set data structure that will be useful for
this task. The expression set()
creates an empty set. The
add
method can be used to add an element to the set. For
example, executing this code:
s = set()
s.add("a")
s.add("b")
s.add("a")
s.add("c")
print(s)
yields the set:
{'a', 'b', 'c'}
You can also pass a list to the set constructor set(["a", "b",
"a", "c"])
and it will construct a set from the elements of the
list. Note that sets do not preserve order, so {'a', 'b',
'c'}
and {'b', 'c', 'a'}
are both possible results
from evaluating set(["a", "b", "a", "c"])
.
We start by making an empty set:
In [7]: issues = set()
Then, we iterate through the list of complaints, and for each one, we add its issue to the set:
In [8]: for complaint in cfpb.CFPB_16:
...: issues.add(complaint["Issue"])
...:
We get a (relatively) small set of issues:
In [9]: issues
Out[9]:
{'APR or interest rate',
'Account opening, closing, or management',
'Adding money',
'Advertising and marketing',
'Application, originator, mortgage broker',
'Billing disputes',
'Billing statement',
"Can't contact lender",
"Can't repay my loan",
"Can't stop charges to bank account",
"Charged fees or interest I didn't expect",
'Closing/Cancelling account',
'Communication tactics',
"Cont'd attempts collect debt not owed",
'Credit card protection / Debt protection',
'Credit decision / Underwriting',
'Credit determination',
'Credit line increase/decrease',
'Credit monitoring or identity protection',
"Credit reporting company's investigation",
'Customer service / Customer relations',
'Dealing with my lender or servicer',
'Delinquent account',
'Deposits and withdrawals',
'Disclosure verification of debt',
'Disclosures',
'False statements or representation',
'Fees',
'Forbearance / Workout plans',
'Fraud or scam',
'Getting a loan',
'Identity theft / Fraud / Embezzlement',
'Improper contact or sharing of info',
'Improper use of my credit report',
'Incorrect information on credit report',
'Late fee',
'Loan modification,collection,foreclosure',
'Loan servicing, payments, escrow account',
'Making/receiving payments, sending money',
'Managing the loan or lease',
'Managing, opening, or closing account',
'Money was not available when promised',
'Other',
'Other fee',
'Other service issues',
'Other transaction issues',
'Overdraft, savings or rewards features',
'Payment to acct not credited',
'Payoff process',
'Privacy',
'Problems caused by my funds being low',
'Problems when you are unable to pay',
"Received a loan I didn't apply for",
'Rewards',
'Settlement process and costs',
'Shopping for a loan or lease',
'Taking out the loan or lease',
'Taking/threatening an illegal action',
'Transaction issue',
'Unable to get credit report/credit score',
'Unauthorized transactions/trans. issues',
'Unsolicited issuance of credit card',
'Using a debit or ATM card'}
Our first few tasks take a list of complaints and compute a simple value.
Task 1: In cfpb.py
, write a function
def find_companies(complaints):
that takes a list of complaints and returns a list (or set — see above) of the companies that received at least one complaint.
Remember: we’ve included a variable called CFPB_16
that contains
information from 1000 complaints in 2016. You will be using this variable
when testing these functions, as shown below.
For example, in the interpreter you can run
In [10]: cfpb.find_companies(cfpb.CFPB_16)
We’re not including the full output because it includes 256 companies.
Your function should also return 256 companies! (Hint: you can determine the size of a dictionary using the len
function.)
Note
You may have noticed that the value of variable CFPB_16
is not written out explicitly in cfpb.py
. If you’re curious, this data is stored in the cfpb16_1000.json
file
and loaded at the top of the cfpb.py
file. The file uses a format called
JSON that we will describe in more detail later in the quarter; for now,
you do not need to concern yourself with that file. Simply use the
CFPB_16
variable we provide.
Task 2: In cfpb.py
, write a function
def count_complaints_about(complaints, company_name):
that takes a list of complaint dictionaries and the name of a company as a string and returns the number of complaints received for that company.
You can test this function like this:
In [11]: cfpb.count_complaints_about(cfpb.CFPB_16, "Citibank")
Out[11]: 39
Counting with dictionaries¶
The complaint representation we discussed in the last section is static in the sense that the contents of a complaint dictionary do not change. Dictionaries are also used in more dynamic ways. For example, let’s say we wanted to compute the number of complaints received per company.
Our goal is to compute a dictionary that maps a company name to the number of complaints received about that company. The dictionary would include an entry for every company that received at least one complaint.
We could start by using the result of Task 2 to initialize a dictionary that maps each company name to zero.
by_company = {}
for company in find_companies(complaints):
by_company[company] = 0
(You can follow along with this example by writing this code in a file, and then copy-pasting it into the interpreter to test it.)
We could then loop over the complaints, extracting the company from the complaint and updating the associated count appropriately.
for complaint in complaints:
c = complaint["Company"]
by_company[c] = by_company[c] + 1
This approach requires two passes over the data: one to identify the companies and one to compute the counts. It would be better to do the computation in one pass over the data.
We can use the in
operator, which allows us to check whether a
given key has a value associated with it in a dictionary, and
initialize the value associated with the key, if necessary.
Given this operation, we can re-write our counting code as follows:
by_company = {}
for complaint in complaints:
c = complaint["Company"]
if c in by_company:
by_company[c] = by_company[c] + 1
else:
by_company[c] = 1
We could simplify this code a bit using not
, as in:
by_company = {}
for complaint in complaints:
c = complaint["Company"]
if c not in by_company:
by_company[c] = 0
by_company[c] = by_company[c] + 1
To simplify further, the get
method for dictionaries allows us to specify a
value to use as a default if a key does not appear in a dictionary:
by_company = {}
for complaint in complaints:
c = complaint["Company"]
by_company[c] = by_company.get(c, 0) + 1
At this point, we could put this code into a function, add it to the file cfpb.py
, and do further testing.
Task 3: Write a function:
def count_by_state(complaints):
that takes a list of complaint dictionaries and returns a dictionary that maps a state to the number of complaints reported from that state.
You can test this function like this
In [12]: cfpb.count_by_state(cfpb.CFPB_16)
We’re not including the full output, but you can spot-check that "CA"
has 122 complaints,
and that there is an entry in the dictionary for the empty string ""
with 8 complaints
(presumably from complaints where the state was not specified).
Iterating over dictionaries¶
We often want to perform an action on every key or key/value pair in a
dictionary. We can use the three dictionary methods, .keys()
,
.values()
, and .items()
, that return “list-like structures”
(a dictionary view object, which we will cover in future lectures in
more details) that contain the keys, values, and key/value tuples
respectively. For example, if we want to print out the results of the
previous function, we can write (again, we can write this in a file and then copy it into the interpreter):
state_cnts = count_by_state(complaints)
for key in state_cnts.keys():
print(key, state_cnts[key])
print()
Because iterating over the keys of a dictionary is common, Python allows you to omit
you can omit the .keys()
:
for key in state_cnts:
print(key, state_cnts[key])
print()
Python does not guarantee that the keys will be generated in any
specific order. If we want to print the results in alphabetical order by key, we
can call the built-in sorted
method:
for key in sorted(state_cnts):
print(key, state_cnts[key])
print()
Finally, the items()
method is useful, when, as in our example,
you need both the values and the keys:
for key, cnt in sorted(state_cnts.items()):
print(key, cnt)
print()
Warning: you should never remove mappings (that is, key/value pairs) from a dictionary as you iterate over that dictionary! It will not behave the way that you expect.
Task 4: Write a function:
def state_with_most_complaints(cnt_by_state):
that takes the output of your function from Task 3 and determines the state with the most complaints. You can break ties arbitrarily.
You can test this as follows:
In [13]: by_state = cfpb.count_by_state(cfpb.CFPB_16)
In [14]: cfpb.state_with_most_complaints(by_state)
Out[14]: 'CA'
Nested dictionaries¶
Dictionaries can be nested, that is, the value associated with a key can itself be a dictionary. For example, we might have a dictionary that maps each company name to another dictionary that maps a state to the number of complaints about that company in that state. Here’s an abridged version of this dictionary computed using the 2016 complaint data:
by_company_by_state =
{'Bank of Hawaii': {'HI': 1},
'DriveTime': {'FL': 2},
'MB Financial, INC': {'IL': 1},
'Specialized Loan Servicing LLC': {'AZ': 1, 'CA': 1},
'The Money Source Inc': {'NJ': 1},
'Transworld Systems Inc.': {'IL': 1, 'PA': 3, 'TN': 1, 'TX': 2, 'VA': 1}
}
The expression by_company_by_state['The Money Source Inc']['NJ']
would yield 1, the number of complaints made from New Jersey about this company.
Task 5: Write a function:
def count_by_company_by_state(complaints):
that takes a list of complaints and computes the
by_company_by_state
dictionary described above.
You can test this function like this:
In [15]: cfpb.count_by_company_by_state(cfpb.CFPB_16)
This will produce a lot of output, but you can spot-check your output
using the values in the by_company_by_state
dictionary shown above.
Dictionaries can also map keys to lists or even to lists of dictionaries.
For more practice with nested dictionaries, complete the extended activity.
Extended activity: Nested dictionaries and composition¶
For the extended activity, we will keep working with the CFPB complaints data.
Task 6: Write a function:
def complaints_by_company(complaints):
that takes a list of complaint dictionaries and returns a dictionary that maps the name of a company to a list of the complaint dictionaries that concern that company.
You can test this function like this:
In [16]: cfpb.complaints_by_company(cfpb.CFPB_16)
You can spot-check the resulting dictionary by checking the following:
Does the dictionary have 256 entries?
Does
'The Money Source Inc'
contain a list with a single complaint?Does
'Specialized Loan Servicing LLC'
contain a list with two complaints, one in AZ and another in CA?
Task 7: Write a function:
def count_by_company_by_state_2(complaints):
that has the exact same behavior as the count_by_company_by_state
function from Task 5; that is, it takes a list of complaints and computes the
by_company_by_state
dictionary. But this time, instead of computing the by_company_by_state
dictionary directly, the function should compute by_company_by_state
by composing the complaints_by_company
function from Task 6 with the count_by_state
function from Task 3 (you will also use a for
loop).
You can test count_by_company_by_state_2
the same way you tested count_by_company_by_state
. You can also test that these two functions give the same output as each other. (Hint: you can use the ==
operator to check if two dictionaries are equal.)
Task 8: With your team, compare the implementations of count_by_company_by_state
and count_by_company_by_state_2
, and discuss the advantages and disadvantages of each. Some aspects to consider:
Which is easier to read and understand?
Which uses less code?
Which is easier to debug?
Which is faster? As a proxy for speed, you can ask: Which requires fewer passes through the data?
Which implementation would you choose?
When finished¶
Once you finish working on the tutorial, you should add, commit, and push
the files in the tt3
directory. No, we won’t be looking at them or grading
them, but this ensures you can access those files later on if you start
working on a different computer, and also allows us to look at them if you
do have any specific questions about your solutions.