Computer Science with Applications 3

PA1: Analyzing the White House visitor logs using MapReduce

Due: Monday, April 26 at 4pm.

You must work alone on this assignment.

You will be working with visitor logs from the Obama White House. The attributes in this dataset are described here. Also, you can see a spreadsheet of the data here. Note there are datasets from 2009/2010, 2011, 2012, 2013, 2014, 2015, and 2016. We'll work with the data from 2009/2010.

If you did not set up your repository as part of Lab 1, the nplease start by doing so.

Your task is to write MapReduce code to generate the following information:

  1. A list of the guests who visited at least ten times (task1.py).
  2. A list of the ten most frequently-visited staff members (task2.py).
  3. A list of the guests who visited at least once in both 2009 and 2010 (task3.py).
  4. A list of the people who were both guests and staff (in other words, someone whom is visited) in 2009 and / or 2010 (as long as someone visited at any point across the two years, and was visited at any point across the two years, they count, whether or not these two events occurred in the same year) (task4.py).

The terms "guest" and "staff" simply refer to the roles: someone who is a guest is visiting the White House, while a staff member is someone who works there and is being visited. Individuals such as the First Lady, whom you may not consider technically to be "staff," still qualify in this simplistic definition. In other words, these terms are being used to provide intuition, not to create a situation where you are having to make difficult determinations about who constitutes "staff members."

In this writeup, we use the terms "guest" and "staff," but you will also encounter the terms "visitor" and "visitee," which are equivalent, respectively.

For full credit, you must write combiners for each of these tasks whenever it is possible to have a meaningful combiner.

The 2009/2010 dataset is small enough that you will be able to write python code to check your answers. When you can validate the results of a MapReduce calculation using a separate, non-MapReduce approach, this gives you a high degree of confidence that your MapReduce implementation is correct. Of course, it is not particularly compelling to use MapReduce for data sets that do not need it, other than for instructional purposes. But, even for large data sets, performing this kind of double-check on a manageably small subset of the data is wise.

Write your MapReduce versions in python using mrjob. You can do your development on vDesk (NoMachine) or Visual Studio Code, ssh'd into a departmental machine.

Install mrjob with the following:

pip3 install --user --upgrade --ignore-installed mrjob

Unfortunately, due to a regression (a feature that was removed or a bug that was introduced) in recent versions of mrjob, we need to fix a problem with it ourselves. On a departmental machine, type:

code ~/.local/lib/python3.8/site-packages/mrjob/sim.py
If you discover that this file does not exist, then
cd
find . -name sim.py
and write the path that prints out after code instead.

Go to line 413, which should be the return statement for the function _num_reducers. If it is at a slightly different line number, look for this return statement. Then, delete the expression being returned and hard-code in the return value 1/2. In other words, the function should simply say return 1/2. Save your changes and quit.

To run a particular task, if the downloaded White House data is in data.csv:

python3 taskx.py data.csv
for some value of x. You will receive a warning message about "No configs", which you can ignore.

Place your files in a directory called pa1 in your git repository and perform the git push and chisubmit steps you did in 122 (the assignment name in chisubmit is pa1). You should place your code in four python files with names task1.py, task2.py, etc. as above.

Please take care not to commit the CSV file with the data in it, as this is a large file that, across all the students in our class, takes up a substantial amount of space.

Note: you must use chisubmit to register for and submit this assignment, as in prior courses in the sequence. If you are not familiar with this tool or have any questions on how to use it, please ask.