Team Tutorial #3: Dictionaries

For instructions on how to get started on these tutorials, please see Team Tutorial #1.

Structure of this tutorial

In this tutorial, you will practice using dictionaries, a useful data type built into Python. A dictionary (dict for short) is a generalization of arrays/lists that associates keys with values. In computer science, this data type is also referred to as an associative array or a map.

By the end of this tutorial, you should be able to:

  1. Perform basic operations on dictionaries,

  2. Use dictionaries to count multiple different items at once,

  3. Iterate through the key/value pairs in a dictionary, and

  4. Construct nested dictionaries to describe data with a complex structure.

You will need many of these skills for Programming Assignment #3. This tutorial has two main sections:

  • Part 1: Simple practice: In this part, we will demonstrate basic operations of and common uses for dictionaries, and then you will practice these operations.

  • Part 2: Extended activity: The second part will provide further practice with nested dictionaries and composition. Then, you will make a decision about code design.

Please note that, while you could do both of these activities with your team in one sitting, it may be better to work through the first part earlier in the module, and the second part once you’ve become more comfortable with dictionaries.

Getting started

If this is your first time working through a Team Tutorial, please see the “Getting started” section of Team Tutorial #1 to set up your Team Tutorials repository.

To get the files for this tutorial, navigate to your Team Tutorials repository and pull the new material:

cd ~/cmsc12100
cd team-tutorials-$GITHUB_USERNAME
git pull upstream main

You will find the files you need in the tt3 directory.

Simple practice: Dictionaries

In this section, we will show you how to create dictionaries and perform common operations, and then we will give you some tasks to work on.

When you get to the tasks, we suggest that one person on the team have Visual Studio Code (VS Code) and ipython3 open and visible to everyone, and that you all collaboratively decide on the answer to each task.

If you get stumped in any of these, you may want to see if you can figure out the answer by re-reading the relevant sections of the book. The concepts in this tutorial are discussed in the Dictionaries and Sets chapter of the book. If you get really stumped, don’t hesitate to ask for help.

Data

We will be using data from the Consumer Financial Protection Bureau’s Consumer Complaint Database. Each complaint has information such as:

  • the company,

  • the date the complaint was received,

  • a unique ID,

  • the issue,

  • the product,

  • the consumer’s complaint narrative,

  • the company’s public response,

  • the consumer’s home state, and

  • the consumer’s ZIP code.

  • etc.

We have included code in the file cfpb.py to define a variable CFPB_16. This variable contains information on 1000 complaints that the Consumer Financial Protection Bureau (CFPB) received in 2016.

Dictionaries as a simple data representation

We could store this information in a list:

complaint_as_list =
    ['Wells Fargo & Company',
     '07/29/2013',
     '468882',
     'Managing the loan or lease',
     'Consumer Loan',
     '',
     'Closed with explanation',
     'VA',
     '24540',
     '',
     'N/A',
     'No',
     '07/30/2013',
     '',
     'Vehicle loan',
     'Phone',
     '',
     'Yes',
     2]

But then we would have to keep track of the fact that the name of the company is at index 0, (complaint_as_list[0]), the date received is at index 1 (complaint_as_list[1]), and so on. This code is cryptic to read (especially if the reader is unfamiliar with the structure of the data) and (consequently) it is more likely that you will accidentally introduce bugs in the code. It is also unnecessarily difficult to update the code if the structure of the data changes (for example, if a new version of the data stops tracking the ZIP code, or starts tracking the time the complaint was received). Dictionaries allow us to use more meaningful values to access the different parts of a complaint. In particular, we can use strings as keys. Here, for example, is the same complaint represented using a dictionary:

complaint_as_dict =
   {'Company': 'Wells Fargo & Company',
    'Date received': '07/29/2013',
    'Complaint ID': '468882',
    'Issue': 'Managing the loan or lease',
    'Product': 'Consumer Loan',
    'Consumer complaint narrative': '',
    'Company response to consumer': 'Closed with explanation',
    'State': 'VA',
    'ZIP code': '24540',
    'Company public response': '',
    'Consumer consent provided?': 'N/A',
    'Consumer disputed?': 'No',
    'Date sent to company': '07/30/2013',
    'Sub-issue': '',
    'Sub-product': 'Vehicle loan',
    'Submitted via': 'Phone',
    'Tags': '',
    'Timely response?': 'Yes',
    'Wells Fargo & Company': 2}

Given such a dictionary, we can extract the name of the company using the string "Company" as the index or key: complaint_as_dict["Company"]. We can extract the home state of complainant using the expression complaint_as_dict["State"].

Notice that although the types of the keys are all the same (strings), the types of values associated with the keys are different. This arrangement is common, but not required.

From this point onward, we will represent a complaint as a dictionary using the representation shown above. The file cfpb.py provides a variable CFPB_16, which is a list of complaints (which is to say, a list of dictionaries).

Open the file cfpb.py in VS Code and an ipython3 window. We will write our code in cfpb.py and test it in ipython3. When you start ipython3, it’s a good idea to set up autoreload so that the changes you make in your .py files will be reflected in the interpreter:

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import cfpb

Dictionaries provide a mechanism for mapping keys (often, but not always, strings) to values. They are often used to represent multi-part data, like the CFPB complaint data discussed above.

Printing out the whole list is overwhelming, but just like any list, we can access individual elements. Here is the complaint at index 20 of the list:

In [4]: cfpb.CFPB_16[20]
Out[4]:
 {'Issue': 'Incorrect information on credit report',
 'Company': 'Experian',
 'Experian': 3,
 'Consumer complaint narrative': '',
 'State': '',
 'Complaint ID': '1725857',
 'Sub-issue': 'Account status',
 'Company public response': 'Company chooses not to provide a public response',
 'Submitted via': 'Referral',
 'Consumer disputed?': 'No',
 'Date sent to company': '01/05/2016',
 'Date received': '01/04/2016',
 'Consumer consent provided?': 'N/A',
 'Sub-product': '',
 'Tags': '',
 'Product': 'Credit reporting',
 'Company response to consumer': 'Closed with explanation',
 'Timely response?': 'Yes',
 'ZIP code': ''}

Here is the issue associated with that complaint:

In [5]: cfpb.CFPB_16[20]["Issue"]
Out[5]: 'Incorrect information on credit report'

Let’s say we want to see all the issues involved in this set of complaints. We could iterate through the complaints and for each one, access its "Issue" key:

In [6]: for complaint in cfpb.CFPB_16:
   ...:     print(complaint["Issue"])
   ...:

If you run this code, you’ll get a long printout of issues (1000 lines, corresponding to the 1000 complaints). However, this list contains a lot of redundancy: There are many different complaints about the same issue. It might be more informative to store these complaints as a set; in a set, we ignore duplicate elements.

Note

Python has a built in set data structure that will be useful for this task. The expression set() creates an empty set. The add method can be used to add an element to the set. For example, executing this code:

s = set()
s.add("a")
s.add("b")
s.add("a")
s.add("c")
print(s)

yields the set:

{'a', 'b', 'c'}

You can also pass a list to the set constructor set(["a", "b", "a", "c"]) and it will construct a set from the elements of the list. Note that sets do not preserve order, so {'a', 'b', 'c'} and {'b', 'c', 'a'} are both possible results from evaluating set(["a", "b", "a", "c"]).

We start by making an empty set:

In [7]: issues = set()

Then, we iterate through the list of complaints, and for each one, we add its issue to the set:

In [8]: for complaint in cfpb.CFPB_16:
   ...:     issues.add(complaint["Issue"])
   ...:

We get a (relatively) small set of issues:

In [9]: issues
Out[9]:
 {'APR or interest rate',
 'Account opening, closing, or management',
 'Adding money',
 'Advertising and marketing',
 'Application, originator, mortgage broker',
 'Billing disputes',
 'Billing statement',
 "Can't contact lender",
 "Can't repay my loan",
 "Can't stop charges to bank account",
 "Charged fees or interest I didn't expect",
 'Closing/Cancelling account',
 'Communication tactics',
 "Cont'd attempts collect debt not owed",
 'Credit card protection / Debt protection',
 'Credit decision / Underwriting',
 'Credit determination',
 'Credit line increase/decrease',
 'Credit monitoring or identity protection',
 "Credit reporting company's investigation",
 'Customer service / Customer relations',
 'Dealing with my lender or servicer',
 'Delinquent account',
 'Deposits and withdrawals',
 'Disclosure verification of debt',
 'Disclosures',
 'False statements or representation',
 'Fees',
 'Forbearance / Workout plans',
 'Fraud or scam',
 'Getting a loan',
 'Identity theft / Fraud / Embezzlement',
 'Improper contact or sharing of info',
 'Improper use of my credit report',
 'Incorrect information on credit report',
 'Late fee',
 'Loan modification,collection,foreclosure',
 'Loan servicing, payments, escrow account',
 'Making/receiving payments, sending money',
 'Managing the loan or lease',
 'Managing, opening, or closing account',
 'Money was not available when promised',
 'Other',
 'Other fee',
 'Other service issues',
 'Other transaction issues',
 'Overdraft, savings or rewards features',
 'Payment to acct not credited',
 'Payoff process',
 'Privacy',
 'Problems caused by my funds being low',
 'Problems when you are unable to pay',
 "Received a loan I didn't apply for",
 'Rewards',
 'Settlement process and costs',
 'Shopping for a loan or lease',
 'Taking out the loan or lease',
 'Taking/threatening an illegal action',
 'Transaction issue',
 'Unable to get credit report/credit score',
 'Unauthorized transactions/trans. issues',
 'Unsolicited issuance of credit card',
 'Using a debit or ATM card'}

Our first few tasks take a list of complaints and compute a simple value.

Task 1: In cfpb.py, write a function

def find_companies(complaints):

that takes a list of complaints and returns a list (or set — see above) of the companies that received at least one complaint.

Remember: we’ve included a variable called CFPB_16 that contains information from 1000 complaints in 2016. You will be using this variable when testing these functions, as shown below.

For example, in the interpreter you can run

In [10]: cfpb.find_companies(cfpb.CFPB_16)

We’re not including the full output because it includes 256 companies. Your function should also return 256 companies! (Hint: you can determine the size of a dictionary using the len function.)

Note

You may have noticed that the value of variable CFPB_16 is not written out explicitly in cfpb.py. If you’re curious, this data is stored in the cfpb16_1000.json file and loaded at the top of the cfpb.py file. The file uses a format called JSON that we will describe in more detail later in the quarter; for now, you do not need to concern yourself with that file. Simply use the CFPB_16 variable we provide.

Task 2: In cfpb.py, write a function

def count_complaints_about(complaints, company_name):

that takes a list of complaint dictionaries and the name of a company as a string and returns the number of complaints received for that company.

You can test this function like this:

In [11]: cfpb.count_complaints_about(cfpb.CFPB_16, "Citibank")
Out[11]: 39

Counting with dictionaries

The complaint representation we discussed in the last section is static in the sense that the contents of a complaint dictionary do not change. Dictionaries are also used in more dynamic ways. For example, let’s say we wanted to compute the number of complaints received per company.

Our goal is to compute a dictionary that maps a company name to the number of complaints received about that company. The dictionary would include an entry for every company that received at least one complaint.

We could start by using the result of Task 2 to initialize a dictionary that maps each company name to zero.

by_company = {}
for company in find_companies(complaints):
    by_company[company] = 0

(You can follow along with this example by writing this code in a file, and then copy-pasting it into the interpreter to test it.)

We could then loop over the complaints, extracting the company from the complaint and updating the associated count appropriately.

for complaint in complaints:
    c = complaint["Company"]
    by_company[c] = by_company[c] + 1

This approach requires two passes over the data: one to identify the companies and one to compute the counts. It would be better to do the computation in one pass over the data.

We can use the in operator, which allows us to check whether a given key has a value associated with it in a dictionary, and initialize the value associated with the key, if necessary. Given this operation, we can re-write our counting code as follows:

by_company = {}

for complaint in complaints:
    c = complaint["Company"]
    if c in by_company:
        by_company[c] = by_company[c] + 1
    else:
        by_company[c] = 1

We could simplify this code a bit using not, as in:

by_company = {}

for complaint in complaints:
    c = complaint["Company"]
    if c not in by_company:
        by_company[c] = 0
    by_company[c] = by_company[c] + 1

To simplify further, the get method for dictionaries allows us to specify a value to use as a default if a key does not appear in a dictionary:

by_company = {}

for complaint in complaints:
    c = complaint["Company"]
    by_company[c] = by_company.get(c, 0) + 1

At this point, we could put this code into a function, add it to the file cfpb.py, and do further testing.

Task 3: Write a function:

def count_by_state(complaints):

that takes a list of complaint dictionaries and returns a dictionary that maps a state to the number of complaints reported from that state.

You can test this function like this

In [12]: cfpb.count_by_state(cfpb.CFPB_16)

We’re not including the full output, but you can spot-check that "CA" has 122 complaints, and that there is an entry in the dictionary for the empty string "" with 8 complaints (presumably from complaints where the state was not specified).

Iterating over dictionaries

We often want to perform an action on every key or key/value pair in a dictionary. We can use the three dictionary methods, .keys(), .values(), and .items(), that return “list-like structures” (a dictionary view object, which we will cover in future lectures in more details) that contain the keys, values, and key/value tuples respectively. For example, if we want to print out the results of the previous function, we can write (again, we can write this in a file and then copy it into the interpreter):

state_cnts = count_by_state(complaints)
for key in state_cnts.keys():
    print(key, state_cnts[key])
print()

Because iterating over the keys of a dictionary is common, Python allows you to omit you can omit the .keys():

for key in state_cnts:
    print(key, state_cnts[key])
print()

Python does not guarantee that the keys will be generated in any specific order. If we want to print the results in alphabetical order by key, we can call the built-in sorted method:

for key in sorted(state_cnts):
    print(key, state_cnts[key])
print()

Finally, the items() method is useful, when, as in our example, you need both the values and the keys:

for key, cnt in sorted(state_cnts.items()):
    print(key, cnt)
print()

Warning: you should never remove mappings (that is, key/value pairs) from a dictionary as you iterate over that dictionary! It will not behave the way that you expect.

Task 4: Write a function:

def state_with_most_complaints(cnt_by_state):

that takes the output of your function from Task 3 and determines the state with the most complaints. You can break ties arbitrarily.

You can test this as follows:

In [13]: by_state = cfpb.count_by_state(cfpb.CFPB_16)

In [14]: cfpb.state_with_most_complaints(by_state)
Out[14]: 'CA'

Nested dictionaries

Dictionaries can be nested, that is, the value associated with a key can itself be a dictionary. For example, we might have a dictionary that maps each company name to another dictionary that maps a state to the number of complaints about that company in that state. Here’s an abridged version of this dictionary computed using the 2016 complaint data:

by_company_by_state =
    {'Bank of Hawaii': {'HI': 1},
     'DriveTime': {'FL': 2},
     'MB Financial, INC': {'IL': 1},
     'Specialized Loan Servicing LLC': {'AZ': 1, 'CA': 1},
     'The Money Source Inc': {'NJ': 1},
     'Transworld Systems Inc.': {'IL': 1, 'PA': 3, 'TN': 1, 'TX': 2, 'VA': 1}
    }

The expression by_company_by_state['The Money Source Inc']['NJ'] would yield 1, the number of complaints made from New Jersey about this company.

Task 5: Write a function:

def count_by_company_by_state(complaints):

that takes a list of complaints and computes the by_company_by_state dictionary described above.

You can test this function like this:

In [15]: cfpb.count_by_company_by_state(cfpb.CFPB_16)

This will produce a lot of output, but you can spot-check your output using the values in the by_company_by_state dictionary shown above.

Dictionaries can also map keys to lists or even to lists of dictionaries.

For more practice with nested dictionaries, complete the extended activity.

Extended activity: Nested dictionaries and composition

For the extended activity, we will keep working with the CFPB complaints data.

Task 6: Write a function:

def complaints_by_company(complaints):

that takes a list of complaint dictionaries and returns a dictionary that maps the name of a company to a list of the complaint dictionaries that concern that company.

You can test this function like this:

In [16]: cfpb.complaints_by_company(cfpb.CFPB_16)

You can spot-check the resulting dictionary by checking the following:

  • Does the dictionary have 256 entries?

  • Does 'The Money Source Inc' contain a list with a single complaint?

  • Does 'Specialized Loan Servicing LLC' contain a list with two complaints, one in AZ and another in CA?

Task 7: Write a function:

def count_by_company_by_state_2(complaints):

that has the exact same behavior as the count_by_company_by_state function from Task 5; that is, it takes a list of complaints and computes the by_company_by_state dictionary. But this time, instead of computing the by_company_by_state dictionary directly, the function should compute by_company_by_state by composing the complaints_by_company function from Task 6 with the count_by_state function from Task 3 (you will also use a for loop).

You can test count_by_company_by_state_2 the same way you tested count_by_company_by_state. You can also test that these two functions give the same output as each other. (Hint: you can use the == operator to check if two dictionaries are equal.)

Task 8: With your team, compare the implementations of count_by_company_by_state and count_by_company_by_state_2, and discuss the advantages and disadvantages of each. Some aspects to consider:

  1. Which is easier to read and understand?

  2. Which uses less code?

  3. Which is easier to debug?

  4. Which is faster? As a proxy for speed, you can ask: Which requires fewer passes through the data?

Which implementation would you choose?

When finished

Once you finish working on the tutorial, you should add, commit, and push the files in the tt3 directory. No, we won’t be looking at them or grading then, but this ensures you can access those files later on if you start working on a different computer, and also allows us to look at them if you do have any specific questions about your solutions.