Algorithms

In this lab we will play around with a few algorithms. The algorithms themselves are not explained in detail in the lab write-up. If you are unsure of how they work, please ask your lab instructor to explain them or refer to the reference literature below. Understanding the algorithms is central to this lab and will help you come up with a good project. Here is the outline of the lab:

Getting you to think about a few data analysis algorithms:

Top-k with a max-heap

k-means

k-nearest-neigbhor

Brief introduction to mrjob.

Resources

k-means clustering
k-nearest-neighbor
Approximate Frequency Counts over Data Streams (Lossy Top-k)
MapReduce
mrjob

Note that Top-k is not a standard name, so if you want to find more information about it, you might want to try various descriptive phrases.

Examples

First, make sure that you have the uchicago-cs/cmsc12300 GitHub repository. You should already be familiar with how to retrieve this from GitHub.

In cmsc12300/examples, the professor has put implementions of various data analysis algorithms, written in pure Python. You should start by reading the examples/data_analysis/README file, which can also be seen directly on the uchicago-cs/cmsc12300 repository page. Please make sure to follow the instructions of how to add the src directory to your PYTHONPATH. If you want to avoid doing this every time, you can run something like this from the data_analysis folder:

$ echo export PYTHONPATH=`pwd`/src:\$PYTHONPATH

Now, refresh your config file by running source ~/.bashrc and then go to the bin folder and make sure you can execute the files, for instance by trying:

$ ./topk_heap.py -h

Run the commands with -h to understand how to use them, since it won't be explained as part of the lab.

Top-k

The Top-k problem finds the k most frequent entries in a corpus. We can use a hash map to keep counts of the data, and then use a heap to keep track of the k largest entries while linearly going through the list of counts. The sample implementation operates on integers and topk_heap.py takes a data file with one integer per line. We need some data to test this on, so we will generate some.

Let us say we have a corpus of only the numbers $\{0, ..., n-1\}$ . Word counts can be seen as drawn from a multinomial distribution. If we look at each sample (word occurrence) individually, we can consider them as drawn from a categorical distribution of n categories:

\begin{equation*} X_1, ..., X_N, \;\;\; \text{i.i.d}\;\;\; X_i \sim \text{Cat}(n, \{p_0, ..., p_n\}) \end{equation*}

This just means that there is $p_i$ probability to generate the number $i$ . Naturally, we must require that $\sum_{i=0}^{n-1} p_i = 1$ .

Exercises

In the following exercises, many of the details are up to you, such as using standard input/output or file handlers.

Write a program (in Python, for instance) that generates the samples $X_1, ..., X_N$ and outputs them one number per line. We will use:

\begin{equation*} p_i = \left\{ \begin{array}{l l} \frac{1}{3} & \quad \text{if $i = 0$ or $i = 1$} \\ \frac{1}{3(n-2)} & \quad \text{otherwise} \end{array} \right. \end{equation*}

where n is set to 10,000,000 (number of categories) and N is set to 1,000,000 (number of samples). This means that there is 1/3 probability to draw a 0, 1/3 to draw a 1, and the rest are drawn uniformly from whatever probability is left. You should be able to implement this using only numpy.random.randint, in a very simple and straightforward way. Do not worry about making the sampling function general, since generality at this dimensionality can cost you a lot.
Write another program that reads this file and naively finds the k most frequent numbers. By naively, I mean keeping a count of each entry and then sorting them by count. Set k to 2, but the algorithm should still be general.

Run your file prefixed with time to check performance. This is not ideal, since it will include Python loading times, so if you want to do this right, you are welcome to do it in ipython, prefixing the appropriate command with the %timeit directive.
Now, try running topk_heap.py on the same file and see the performance difference. The performance improvement might seem modest, but that is just because I didn't want to set the category size too big. Also, running time will include both loading Python libraries, as well as loading the data itself, so we can't really see how many times faster topk_heap.py is.
Theoretically, we should go from $\mathcal{O}(n \log n)$ to $\mathcal{O}(n \log k)$ , which can make a significant difference. Try to see if you can understand why this is. Make sure you understand the algorithm. Check your answer with the lab instructor.

Top-k Lossy

There is another implementation of Top-k, which allows some level of error, in favor of speed and a low memory footprint. This is explained in Approximate Frequency Counts over Data Streams. The algorithm can be run through topk_lossy.py. Try running it on the data that you generated, specifying an appropriate frequency that will give you only the zeros and ones.

k-means

The k-means algorithm finds k partitions of a data set, where each element in a partition is closer to the partition's centroid (mean) than any other partition's centroid. The problem is known to be computationally difficult (NP-hard), but there are simple iterative algorithms with good convergence to local optima.

Old Faithful

To get some kind of intuition for this algorithm, let us start with 2D data. We will use data from the Old Faithful geyser at Yellowstone National Park:

faithful.txt

The first column denotes the duration of an eruption and the second the waiting time between eruptions. As it turns out, this data is distinctly bimodal, so settings k to 2 should give us good results.

Exercises

Run the data through kmeans_2d.py. This script takes a file that specifies the number of dimensions on the first line of the file, so add a 2 at the top of faithful.txt.
What is the stopping criterion and how does it use the threshold that you specify through -t CUTOFF? You will need to open up the file to examine how it works to be able to answer this.

Note that since this k-means implementation uses Euclidean distances, it favors round partitions. The two dimensions in this case are of different unit and scale, but with similar variance as expressed in multiples of the total range of the data along each axis. For this reason, it would be better to normalize the axes first, for instance by stretching them both to -1 to 1, before running our clustering algorithm. This will also allow us to use a single stopping threshold for most of our data. This same approach is valuable to many other algorithms and situations, such as linear regression.

MNIST

We can also use it for higher dimensional data, for instance the MNIST data set. It is a data set of 28 by 28 grayscale pixels of well-posed handwritten digits. If we consider each pixel location a dimension, then we can run k-means on this high-dimensional space, to cluster the digits into different classes. First, you need to download the MNIST data set:

MNIST

Put all four files in its own folder somewhere, and make sure to decompress them (run gunzip *.gz in that folder, if you're lazy).

Exercises

Run kmeans_mnist.py with various number of clusters. What is the most obvious choice? What happens if we decrease or increase the number?

k-nearest-neighbor

Clustering can sometimes be interesting on a data set like MNIST, but what we really want to do is to train a classifier to identify the correct digit in images we give it. One of the simplest ways is using k-nearest-neighbor. This involves first finding the k closest data points in the training data in the same high-dimensional Euclidean space as in the k-means case above. In the training data, the labels are known, so once we know the nearest neighbors, we do a plurality vote among them to get the label.

Exercises

Play around with knn_mnist.py, specifying the index to classify using -i. Try different values of k, for instance on index 33, which is a non-trivial case.
What are the trade-offs of a small as opposed to a large k value? How can you decide k in a principled way?
Is this algorithm parallelizable? What do you think the main bottlenecks are?

Python generators

We're going to switch gears for a bit and focus on something called iterators and generators in Python, which you will need to know how to use before we move on to the next topic.

An iterator object is any object that can be iterated through in a linear fashion. There is only one operation that we are concerned with, and it is called next(). When there are no more elements in the iterator, it will raise a StopIteration exception. Lists, dictionaries and tuples are all iterable in Python, and to explicitly create an iterator from an iterable object we use iter():

>>> g = iter([0, 1, 2])
>>> g.next()
0
>>> g.next()
1
>>> g.next()
2
>>> g.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

A generator is another name for a iterator that was created similar to how we define functions, inserting the yield keyword. Check this out:

def gen():
    yield 0
    yield 1
    yield 2

Calling gen() now returns an iterator that will be indistinguishable to iter([0, 1, 2]). When we invoke gen(), instead of executing the function, it pauses at the very top of the function. Once we call next() on this object, it will continue the execution until it reaches yield and then return that value and pause again. If we call it again, it will simply go to the next yield.

Invoking next() directly is not the most elegant nor typical way of interacting with an iterator/generator, and it is instead much more common to extract the values through the following familiar patterns:

>>> for in gen():
...     print i
0
1
2
>>> sum(gen())
3
>>> list(gen())
[0, 1, 2]

Converting to a list is possible, as seen above, if the generator is finite. However, the whole point of iterators is to avoid having to store the entire sequence in memory at once. Iterators can even be infinite.

Exercises

In the definition of gen(), insert print statements all over the generator function and then create an instance of it and step through it using next(). This should give you a sense of the flow of execution.
Now, modify gen to take one parameter n and return a generator that returns the following list of integers: 0, 2, 4, 6, ..., 2*(n-1)

mrjob

Amazon's Elastic MapReduce allows you to easily run the MapReduce algorithm on an AWS cluster. The Python package mrjob further simplifies this process if you want to implement the algorithm in Python.

mrjob example

Taken from the mrjob documentation (Writing jobs in Python), a simple example of a word counter is:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFreqCount.run()

The mapper function should yield tuples of (key, value), in this case (word, 1) indicating that you found one occurrence of word. These values are then combined using combiner, which is called for each word, with all the counts, as a generator one only ones. In the above example, the counts are combined to yield something like (word, n), where n is the word count on a particular cluster. Note that both mapper and combiner are called in parallel across several processes.

Finally, the computations of all processes are sent to a designated master node and combined using reducer. In this case, it takes counts again as a generator, but this time they won't all be ones. It may seem unnecessary to have both combiner and reducer, since they look identical in this example. However, combiner is performed on each process and reduces the amount of information that needs to be transmitted to the master node where reducer is called. The fact that they look identical only happens in some situations, such as for linear reduction functions or max/min operations.

k-means mrjob

In this section, you should create the config file and set everything up, but you don't have to execute the commands.

The src/cs123/mrjob/kmeans.py uses mrjob to run the k-means in parallel. By default, it will run the jobs locally, but by specifying -r emr it will run them on an AWS cluster. For this, you have to create ~/.mrjob.conf with the following information:

runners:
  emr:
    aws_access_key_id: <AWS Access Key>
    aws_secret_access_key: <AWS Secret Access Key>
    aws_region: us-east-1
    ec2_key_pair: cslab
    ec2_key_pair_file: ~/cslab.pem
    ec2_instance_type: m1.small
    num_ec2_instances: 2

    # Less important
    base_tmp_dir: /tmp
    cmdenv:
      TZ: America/Chicago

  local:
    base_tmp_dir: /tmp

Notice that you have to fill in the same information as in your ~/.awsconfig file, as well as your Key Pair file. The m1.small selects the instance type (does not work on t1.micro) and num_ec2_instances specifies how many instances you want to run it on.

You could try kmeans.py on examples/data_analysis/data/5cluster.txt, but it will take much longer time with -r emr than to run it locally, so there is no gain in this example. The reason is because it's a small data set to begin with, and just running -r emr will start the instances, which takes time.

If you want to keep instances running and perform several jobs without having to start them and shut them down all the time, you can create job flows and reuse them. How to do this with mrjob is explain in Reusing Job Flow.

Exercises

What is the reduction function for the parallel k-means? Are combiner and reducer identical as in the word count example?

Terminating

To terminate all instances associated with a job flow, first you have to know the Job Flow ID, which you can check by:

$ awr emr describe-job-flows --output table

Once you know the ID, say j-<number>, then you can terminate it (and all the instances along with it) by running:

$ awr emr terminate-job-flows --job-flow-ids j-<number>

Make sure to terminate job flows that you are done with them for a while, since otherwise they will generate unnecessary charges. Make sure that the instances are shutting down and eventually are shut down by running awr ec2 describe-instances or through the web interface.