Computer Science with Applications 3

PA1: Analyzing the White House visitor logs using MapReduce

Due: Monday, April 25 at 5pm. No late chips.

You must work alone on this assignment.

You will be working with visitor logs from the White House. The attributes in this dataset are described here. Also, you can see a spreadsheet of the data here. Note there are datasets from 2009/2010, 2011, 2012, 2013, 2014, 2015, and 2016. We'll work with the data from 2009/2010.

Your task is to write MapReduce code to generate the following information:

A list of the people who visited at least ten times.
A list of the ten most frequently-visited people. The dataset has some obvious anomalies that you will want to address when you answer this question.
A list of the people who visited at least once in both 2009 and 2010.
A list of the people who were both visitors and visitees (in other words, someone whom is visited) in 2009 and 2010.

For full credit, you must write combiners for each of these tasks whenever it is possible to have a meaningful combiner.

The 2009/2010 dataset is small enough that you will be able to write python code to check your answers. When you can validate the results of a MapReduce calculation using a separate, non-MapReduce approach, this gives you a high degree of confidence that your MapReduce implementation is correct. Of course, it is not particularly compelling to use MapReduce for data sets that do not need it, other than for instructional purposes. But, even for large data sets, performing this kind of double-check on a manageably small subset of the data is wise.

Write your MapReduce versions in python using mrjob. You can do your development on your VirtualBox from CS 122. Once you have everything working, you can also try running your jobs on Amazon Elastic MapReduce. See below.

To install mrjob on your VirtualBox VM:

sudo pip2 install mrjob
sudo pip3 install mrjob

If you'd like to try running one or more of your jobs on Amazon Elastic MapReduce, you'll need to wait until our lab on 4/21 that covers setup. THen, you'll be able to just add -r emr in your python command:

python script.py -r emr data.csv

Note that the -r emr has to be after the script name and before the data file name, and that this needs to be performed on a machine that has the .mrjob.conf file set up. Also, please anticipate EMR to take longer due to setup times; we still aren't performing a calculation large enough to truly benefit sufficiently to outweigh these times. (As always, be sure to confirm termination.)

There will be a post on Piazza with information about setting up git and chisubmit to submit this assignment. Place your files in a directory called pa1 in your git repository and perform the git push and chisubmit steps you did in 122. You should place your code in four python files with names task1.py, task2.py, etc.