Lab #3: Dictionaries

Introduction

The objective of this lab is to give you practice using dictionaries, a useful data type built into Python. A dictionary (dict for short) is a generalization of arrays/lists that associates keys with values. In computer science, this data type is also referred to as an associative array or a map.

By the end of the lab, you should be able to:

  • Perform basic operations on dictionaries

  • Apply dictionaries in several usage scenarios

Getting started

Open up a terminal and navigate (cd) to your cmsc12100-aut-20-username directory, where username is your CNetID. Run git pull upstream master to collect the lab materials and git pull to sync with your personal repository.

Once you have collected the lab materials, navigate to the lab3 directory and fire up ipython3.

You will do your work in this lab in the file named lab3/cfpb.py. Make sure you import this module and activate autoreload:

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import cfpb

Data

We will be using data from the Consumer Financial Protection Bureau’s Consumer Complaint Database. Each complaint has information such as:

  • the company,

  • the date the complaint was received,

  • a unique ID,

  • the issue,

  • the product,

  • the consumer’s complaint narrative,

  • the company’s public response,

  • the consumer’s home state, and

  • the consumer’s ZIP code.

We have included code in the file cfpb.py to define a variable CFPB_16. This variable contains information on 1000 complaints that the Consumer Financial Protection Bureau (CFPB) received in 2016.

Dictionaries as a simple data representation

Dictionaries provide a mechanism for mapping keys (often, but not always, strings) to values. They are often used to represent multi-part data, like the CFPB complaint data discussed above.

We could store this information in a list:

complaint_as_list =
    ['Wells Fargo & Company',
     '07/29/2013',
     '468882',
     'Managing the loan or lease',
     'Consumer Loan',
     '',
     'Closed with explanation',
     'VA',
     '24540',
      ...]

But then we would have to keep track of the fact that the name of the company is at index 0, (complaint_as_list[0]), the date received is at index 1 (complaint_as_list[1]), and so on. Dictionaries allow us to use more meaningful values to access the different parts of a complaint. In particular, we can use strings as keys. Here, for example, is the same complaint represented using a dictionary:

complaint_as_dict =
   {'Company': 'Wells Fargo & Company',
    'Company public response': '',
    'Company response to consumer': 'Closed with explanation',
    'Complaint ID': '468882',
    'Consumer complaint narrative': '',
    'Consumer consent provided?': 'N/A',
    'Consumer disputed?': 'No',
    'Date received': '07/29/2013',
    'Date sent to company': '07/30/2013',
    'Issue': 'Managing the loan or lease',
    'Product': 'Consumer Loan',
    'State': 'VA',
    'Sub-issue': '',
    'Sub-product': 'Vehicle loan',
    'Submitted via': 'Phone',
    'Tags': '',
    'Timely response?': 'Yes',
    'Wells Fargo & Company': 2,
    'ZIP code': '24540'}

Given such a dictionary, we can extract the name of the company using the string "Company" as the index or key: complaint_as_dict["Company"]. We can extract the home state of complainant using the expression complaint_as_dict["State"].

Notice that although the types of the keys are all the same (strings), the types of values associated with the keys are different. This arrangement is common, but not required.

We will start with a few tasks that take a list of complaints that use the representation as shown above and compute a simple value.

Task 1

Write a function:

def count_complaints_about(complaints, company_name):

that takes a list of complaint dictionaries and the name of a company as a string and returns the number of complaints received for that company.

Remember: we’ve included a variable called CFPB_16 that contains information from 1000 complaints in 2016. You will be using this variable when testing these functions, as shown below.

For example:

In [1]: cfpb.count_complaints_about(cfpb.CFPB_16, "Citibank")
Out[1]: 39

If you’re curious, this data is stored in the cfpb16_1000.json file and loaded at the top of the cfpb.py file. The file uses a format called JSON that we will describe in more detail later in the quarter; for now, you do not need to concern yourself with that file. Simply use the CFPB_16 variable we provide.

Task 2

Write a function:

def find_companies(complaints):

that takes a list of complaints and returns a list (or set) of the companies that received at least one complaint.

Note

Python has a built in set data structure that will be useful for this task. The expression set() creates an empty set. The add method can be used to add an element to the set. For example, executing this code:

s = set()
s.add("a")
s.add("b")
s.add("a")
s.add("c")
print(s)

yields the set:

{'a', 'b', 'c'}

You can also pass a list to the set constructor set(["a", "b", "a", "c"]) and it will construct a set from the elements of the list. Note that sets do not preserve order, so {'a', 'b', 'c'} and {'b', 'c', 'a'} are both possible results from evaluating set(["a", "b", "a", "c"]).

You can test this function like this:

cfpb.find_companies(cfpb.CFPB_16)

We’re not including the full output because it includes 256 companies. Your function should also return 256 companies!

Counting with dictionaries

The complaint representation we discussed in the last section is static in the sense that the contents of a complaint dictionary do not change. Dictionaries are also used in more dynamic ways. For example, let’s say we wanted to compute the number of complaints received per company.

Our goal is to compute a dictionary that maps a company name to the number of complaints received about that company. The dictionary would include an entry for every company that received at least one complaint.

We could start by using the result of Task 2 to initialize a dictionary that maps each company name to zero.

by_company = {}
for company in find_companies(complaints):
    by_company[company] = 0

And then loop over the complaints, extracting the company from the complaint and updating the associated count appropriately.

for complaint in complaints:
    c = complaint["Company"]
    by_company[c] = by_company[c] + 1

This approach requires two passes over the data: one to identify the companies and one to compute the counts. It would be better to do the computation in one pass over the data.

We can use the in operator, which allows us to check whether a given key has a value associated with it in a dictionary, and initialize the value associated with the key, if necessary. Given this operation, we can re-write our counting code as follows:

by_company = {}

for complaint in complaints:
    c = complaint["Company"]
    if c in by_company:
        by_company[c] = by_company[c] + 1
    else:
        by_company[c] = 1

We could simplify this code a bit using not, as in:

by_company = {}

for complaint in complaints:
    c = complaint["Company"]
    if c not in by_company:
        by_company[c] = 0
    by_company[c] = by_company[c] + 1

Finally, the get method for dictionaries allows us to specify a value to use as a default if a key does not appear in a dictionary:

by_company = {}

for complaint in complaints:
    c = complaint["Company"]
    by_company[c] = by_company.get(c, 0) + 1

Task 3

Write a function:

def count_by_state(complaints):

that takes a list of complaint dictionaries and returns a dictionary that maps a state to the number of complaints reported from that state.

You can test this function like this:

cfpb.count_by_state(cfpb.CFPB_16)

We’re not including the full output, but you can spot-check that "CA" has 122 complaints, and that there is an entry in the dictionary for the empty string "" with 8 complaints (presumably from complaints where the state was not specified).

Iterating over dictionaries

We often want to perform an action on every key or key/value pair in a dictionary. We can use the three dictionary methods, .keys(), .values(), and .items(), that return “list-like structures” (a dictionary view object, which we will cover in future lectures in more details) that contain the keys, values, and key/value tuples respectively. For example, if we want to print out the results of the previous function, we can write:

state_cnts = count_by_state(complaints)
for key in state_cnts.keys():
    print(key, state_cnts[key])
print()

Iterating over the keys of a dictionary is sufficiently common that you can omit the .keys():

for key in state_cnts:
    print(key, state_cnts[key])
print()

Python does not guarantee that the keys will be generated in any specific order. If we want to print the results in sort by key, we need to add a call to the built-in sorted method:

for key in sorted(state_cnts):
    print(key, state_cnts[key])
print()

Finally, the items() method is useful, when, as in our example, you need both the values and the keys:

for key, cnt in sorted(state_cnts.items()):
    print(key, cnt)
print()

Warning: you should never remove mappings (that is, key/value pairs) from a dictionary as you iterate over that dictionary!

Task 4

Write a function:

def state_with_most_complaints(cnt_by_state):

that takes the output of your function from Task 3 and determines the state with the most complaints. You can break ties arbitrarily.

You can test this as follows:

In [8]: by_state = cfpb.count_by_state(cfpb.CFPB_16)

In [9]: cfpb.state_with_most_complaints(by_state)
CA

Nested dictionaries

Dictionaries can be nested, that is, the value associated with a key can itself be a dictionary. For example, we might have a dictionary that maps each company name to another dictionary that maps a state to the number of complaints about that company in that state. Here’s an abridged version of this dictionary computed using the 2016 complaint data:

by_company_by_state =
    {'Bank of Hawaii': {'HI': 1},
     'DriveTime': {'FL': 2},
     'MB Financial, INC': {'IL': 1},
     'Specialized Loan Servicing LLC': {'AZ': 1, 'CA': 1},
     'The Money Source Inc': {'NJ': 1},
     'Transworld Systems Inc.': {'IL': 1, 'PA': 3, 'TN': 1, 'TX': 2, 'VA': 1}
    }

The expression by_company_by_state['Absolute Mortgage Company Inc.']['NJ'] would yield 1, the number of complaints made from New Jersey about this company.

Task 5

Write a function:

def count_by_company_by_state(complaints):

that takes a list of complaints and computes the by_company_by_state dictionary described above.

You can test this function like this:

cfpb.count_by_company_by_state(cfpb.CFPB_16)

This will produce a lot of output, but you can spot-check your output using the values in the by_company_by_state dictionary shown above.

Dictionaries can also map keys to lists or even to lists of dictionaries.

Task 6

Write a function:

def complaints_by_company(complaints):

that takes a list of complaint dictionaries and returns a dictionary that maps the name of a company to a list of the complaint dictionaries that concern that company.

You can test this function like this:

cfpb.complaints_by_company(cfpb.CFPB_16)

You can spot-check the resulting dictionary by checking the following:

  • Does the dictionary have 256 entries?

  • Does 'The Money Source Inc' contain a list with a single complaint?

  • Does 'Specialized Loan Servicing LLC' contain a list with two complaints, one in AZ and another in CA?

When finished

When you are finished with the lab, please check in your work (assuming you are inside the lab directory):

git add cfpb.py
git commit -m "Finished with lab3"
git push

No, we’re not grading your work. We just want to make sure your repository is in a clean state and that your work is saved to your repository (and to our Git server)