Algorithms

In this lab we will work with a few algorithms. The algorithms themselves are not explained in detail in the lab write-up. If you are unsure of how they work, please ask your lab instructor to explain them or refer to the reference literature below. Understanding the algorithms is central to this lab. Here is the outline of the lab:

Getting you to think about a few data analysis algorithms:

Top-k with a max-heap

k-means

k-nearest-neighbor

Brief introduction to some new Python syntax we will need to use in the future.

Resources

k-means clustering
k-nearest-neighbor
Approximate Frequency Counts over Data Streams (Lossy Top-k)

Note that Top-k is not a standard name, so if you want to find more information about it, you might want to try various descriptive phrases.

Examples

Start by setting up your repository for this course, which will give you access to the files you need for this lab. To do so, remotely connected to a departmental machine, type cs-setup-script cmsc12300-spr-21. Having done so, you should have a repository directory which contains a lab1 folder, and should also be set up to use chisubmit.

To use the files for this lab, you need to temporarily add the code you just retrieved to your environment. You will need to do this whenever you make a new terminal window during the course of this lab. First, cd into your CS 123 repository, then the lab1 directory. Then:

$ export PATH=`pwd`/bin:$PATH
$ export PYTHONPATH=`pwd`/src:$PYTHONPATH

Test that this worked by making sure you can execute the files, by trying:

$ topk_heap.py -h

Run the commands with -h to understand how to use them, since it won't be explained as part of the lab.

Top-k

The Top-k problem finds the k most frequent entries in a corpus. We can use a dictionary to keep counts of the data, and then use a heap to keep track of the k largest entries while linearly going through the list of counts. The sample implementation operates on integers and topk_heap.py takes a data file with one integer per line. We need some data to test this on, so we will generate some.

Let us say we have a corpus of only the numbers $\{0, ..., n-1\}$. Word counts can be seen as drawn from a multinomial distribution. If we look at each sample (word occurrence) individually, we can consider them as drawn from a categorical distribution of n categories:

\begin{equation*} X_1, ..., X_N, \;\;\; \text{i.i.d}\;\;\; X_i \sim \text{Cat}(n, \{p_0, ..., p_n\}) \end{equation*}

This just means that there is $p_i$ probability to generate the number $i$. Naturally, we must require that $\sum_{i=0}^{n-1} p_i = 1$.

Exercises

In the following exercises, you have discretion as to how to structure and write your code.

Write a program (in Python, for instance) that generates the samples $X_1, ..., X_N$ and outputs them one number per line. We will use:

\begin{equation*} p_i = \left\{ \begin{array}{l l} \frac{1}{3} & \quad \text{if $i = 0$ or $i = 1$} \\ \frac{1}{3(n-2)} & \quad \text{otherwise} \end{array} \right. \end{equation*}

where n is set to 100,000,000 (number of categories) and N is set to 10,000,000 (number of samples). This means that there is 1/3 probability to draw a 0, 1/3 to draw a 1, and the rest are drawn uniformly from whatever probability is left. You should be able to implement this using only numpy.random.randint, in a very simple and straightforward way. Do not worry about making the sampling function general, since generality at this dimensionality can cost you a lot.
Write another program that reads this file and naively finds the k most frequent numbers. By naively, I mean keeping a count of each entry and then sorting them by count. Set k to 2, but the algorithm should still be general.

Run your file prefixed with time to check performance. This is not ideal, since it will include Python loading times, so if you want to do this right, you are welcome to do it in ipython3, prefixing the appropriate command with the %timeit directive.
Now, try running topk_heap.py on the same file and see the performance difference. The performance improvement might seem modest, but it would become more stark with a larger category size. Also, running time will include both loading Python libraries, as well as loading the data itself, so we can't really see how many times faster topk_heap.py is.
Theoretically, we should go from $\mathcal{O}(n \log n)$ to $\mathcal{O}(n \log k)$, which can make a significant difference.

Top-k Lossy

There is another implementation of Top-k, which allows some level of error, in favor of speed and a low memory footprint. This is explained in Approximate Frequency Counts over Data Streams. The algorithm can be run through topk_lossy.py. Try running it on the data that you generated, specifying an appropriate minimum frequency of occurrence that will give you only the zeros and ones.

k-means

For the subsequent parts of this lab, you will need to use NoMachine / vDesk, rather than Visual Studio Code, as you need a graphical connection to the departmental machine. (Please remember you need to update PATH and PYTHONPATH as described near the top of this lab if you are switching connections at this point.)

The k-means algorithm finds k partitions of a data set, where each element in a partition is closer to the partition's centroid (mean) than any other partition's centroid. The problem is known to be computationally difficult (NP-hard), but there are simple iterative algorithms with good convergence to local optima.

Old Faithful

To get some kind of intuition for this algorithm, let us start with 2D data. We will use data from the Old Faithful geyser at Yellowstone National Park:

faithful.txt

The first column denotes the duration of an eruption and the second the waiting time between eruptions. As it turns out, this data is distinctly bimodal, so setting k to 2 should give us good results.

Exercises

Run the data through kmeans_2d.py. This script takes a file that specifies the number of dimensions on the first line of the file, so add a 2 at the top of faithful.txt.
What is the stopping criterion and how does it use the threshold that you specify through -t CUTOFF? You will need to open up the file to examine how it works to be able to answer this.

Note that since this k-means implementation uses Euclidean distances, it favors round partitions. The two dimensions in this case are of different unit and scale, but with similar variance as expressed in multiples of the total range of the data along each axis. For this reason, it would be better to normalize the axes first, for instance by stretching them both to -1 to 1, before running our clustering algorithm. This will also allow us to use a single stopping threshold for most of our data. This same approach is valuable to many other algorithms and situations, such as linear regression.

MNIST

We can also use it for higher dimensional data, for instance the MNIST data set. It is a data set of 28 by 28 grayscale pixels of well-posed handwritten digits. If we consider each pixel location a dimension, then we can run k-means on this high-dimensional space, to cluster the digits into different classes. First, you need to download the MNIST data set:

MNIST

Put all four files in one otherwise empty folder somewhere, and make sure to decompress them (run gunzip *.gz in that folder).

Exercises

Run kmeans_mnist.py with various numbers of clusters. What is the most obvious choice? What happens if we decrease or increase the number?

k-nearest-neighbor

Clustering can sometimes be interesting on a data set like MNIST, but what we really want to do is to train a classifier to identify the correct digit in images we give it. One of the simplest ways is using k-nearest-neighbor. This involves first finding the k closest data points in the training data in the same high-dimensional Euclidean space as in the k-means case above. In the training data, the labels are known, so once we know the nearest neighbors, we do a plurality vote among them to get the label.

Exercises

Play around with knn_mnist.py, specifying the index to classify using -i. Try different values of k, for instance on index 33, which is a non-trivial case.
What are the trade-offs of a small as opposed to a large k value? How can you decide k in a principled way?

Python generators

We're going to switch gears for a bit and focus on something called iterators and generators in Python, which you will need to know how to use in future labs and assignments.

An iterator object is any object that can be iterated through in a linear fashion. There is only one operation that we are concerned with, and it is called __next__(). When there are no more elements in the iterator, it will raise a StopIteration exception. Lists, dictionaries and tuples are all iterable in Python, and to explicitly create an iterator from an iterable object we use iter():

>>> g = iter([0, 1, 2])
>>> g.__next__()
0
>>> g.__next__()
1
>>> g.__next__()
2
>>> g.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

A generator is another name for an iterator that was created similarly to how we define functions, inserting the yield keyword. Here is an example:

def gen():
    yield 0
    yield 1
    yield 2

Calling gen() now returns an iterator that will be indistinguishable to iter([0, 1, 2]). When we invoke gen(), instead of executing the function, it pauses at the very top of the function. Once we call __next__() on this object, it will continue the execution until it reaches yield and then return that value and pause again. If we call it again, it will simply go to the next yield.

Invoking __next__() directly is not the most elegant nor typical way of interacting with an iterator/generator, and it is instead much more common to extract the values through the following familiar patterns:

>>> for in gen():
...     print i
0
1
2
>>> sum(gen())
3
>>> list(gen())
[0, 1, 2]

Converting to a list is possible, as seen above, if the generator is finite. However, the whole point of iterators is to avoid having to store the entire sequence in memory at once. Iterators can even be infinite.

Exercises

In the definition of gen(), insert print statements all over the generator function and then create an instance of it and step through it using __next__(). This should give you a sense of the flow of execution.
Now, modify gen to take one parameter n and return a generator that returns the following list of integers: 0, 2, 4, 6, ..., 2*(n-1)