K-Nearest Neighbors with the MNIST Dataset

In this programming assignment, we will revisit the MNIST handwritten digit dataset and the K-Nearest Neighbors algorithm. We have already seen how this algorithm is implemented in Python, and we will now implement it in C++ with a few modifications. This assignment will allow you to familiarize yourself with C++ programming in general, and with C++ objects and the STL in particular. It will also give you a chance to see the difference in performance between Python and C++.

The MNIST Dataset

The MNIST Dataset contains 70,000 images of handwritten digits (zero through nine), divided into a 60,000-image training set and a 10,000-image testing set. For example:

It has become a classical dataset for testing machine learning algorithms, due to the ease of working with the dataset. The images themselves are 28x28 pixel images stored in a single file; however, you will not have to worry about the binary representation of the images, as we are providing the code to read the images from the files.

More specifically, we are providing you with a class called MNISTDataset which, given the image and label files, will load the entire dataset into memory, storing each image (and its label) in an MNISTImage object. However, MNISTImage is an abstract class; one of your tasks in this homework will be to implement several possible internal representations of an MNIST image.

K-Nearest Neighbors

K-Nearest Neighbors (or KNN) is a simple classification algorithm that is surprisingly effective. However, to work well, it requires a training dataset: a set of data points where each point is labelled (i.e., where it has already been correctly classified). If we set K to 1 (i.e., if we use a 1-NN algorithm), then we can classify a new data point by looking at all the points in the training data set, and choosing the label of the point that is nearest to the new point. If we use higher values of K, then we look at the K nearest points, and choose the most frequent label amongst those points.

As we saw when we ran KNN on the MNIST Dataset with Python, even 1-NN produces very good results. However, classifying the entire testing set could take several hours. If we reimplement the exact same algorithm in C++, we will only be able to improve our running time by a constant factor (since the complexity of the algorithm remains the same, regardless of what programming language we use). The main issue with the algorithm we presented in Python was that it computed the distance between the images using a Euclidean distance: we regarded each 28*28 pixel image as a point in 784-space, and computed the distance between two such points for every pair of images.

In this assignment, we will explore four additional strategies for computing the distance between images:

Downsampling the images: discarding information in the images (e.g., discard every other pixel)
Scaling the images: resizing the images to make them smaller. This also discards information but, instead of simply throwing away information, we're recomputing a new version of the image.
Dimensionality reduction: Mapping each 784-space point to a scalar. We will do a trivial reduction: adding up the values of all the coordinates. Since each pixel is a value between 0 and 255 representing the level of gray in that pixel (with 0 being white, and 255 being black), this addition could potentially yield results: e.g., "8" would typically have more black pixels than a "1", so the "amount of gray/black" in an image could be a good predictor.
Random distances: As a baseline, we will generate distances between images at random. Any strategy we design should perform at least better than flipping coins. We provide the implementation of this strategy.

Additionally, for extra credit, you will be able to implement additional strategies, but will only get credit for them if they have an error rate lower than 20%.

The code

We have uploaded several C++ files to your PhoenixForge repository that, when compiled together, produce a program that will classify the entire MNIST testing dataset using one of the distance strategies described above.

mnist-knn.cpp: This is the entry point into the program. It parses some command-line arguments (described below) and uses the C++ classes defined in the other C++ files. You should not have to modify this file.
mnist-knn-tests.cpp: If you want to add tests to your program, add them here. We have provided a few sample tests.
MNISTDataset.{cpp|h}: Provides a class called MNISTDataset that loads the MNIST dataset into memory. You should not have to modify this file, unless you implement any custom strategies.
MNISTImage.{cpp|h}: Provides an abstract class called MNISTImage that encapsulates information about a single MNISTImage. You should not have to modify this file.
MNISTImageImpls.{cpp|h}: Most of your work will take place in these two files. These files provide derived classes of MNISTImage that implement several distance strategies. We have already provided the declaration of these classes; you will have to provide the implementation of the classes.
knn.{cpp|h}: Most of your work will also take place in these files. This is where you will implement the KNN algorithm.
Makefile: Makefile to build the entire program.

To build the program, just run make:

make

This will generate a mnist-knn executable.

To build the tests, run:

make tests

This will generate a mnist-knn-tests executable.

The program accepts several command-line parameters:

-d DIRECTORY: The directory where the MNIST files are located (t10k-images-idx3-ubyte, t10k-labels-idx1-ubyte, train-images-idx3-ubyte, and train-labels-idx1-ubyte).
-t TYPE: The distance strategy to use. Valid values are:

full: Euclidean distance between all 784 coordinates.

downsample2: Downsampling by a factor of 2 (this is explained in more detail below)

downsample4: Downsampling by a factor of 4.

downsample8: Downsampling by a factor of 8.

scale2: Scaling the image by a factor of 2 (this is explained in more detail below)

scale4: Scaling the image by a factor of 4.

scale7: Scaling the image by a factor of 7.

sum: Distance between the sums of the coordinates.

random: Random distances.

-k K: The value of K to use when running KNN.
-n N: Use only the first N images from the testing set.
-v: Verbose mode. Prints out more messages to the console.

After the program completes its analysis, it will print out a single line with the following information:

TYPE K N CORRECT INCORRECT ERROR

Where:

TYPE is the same value provided with the -t parameter
K is the same value provided with the -k parameter
N is the number of images classified
CORRECT is the number of images that were correctly classified
INCORRECT is the number of images that were not classified correctly
ERROR is the error rate (the percent of images classified incorrectly)

Your Tasks

You must complete the following tasks in this homework:

Task 1: Implement the `ArrayMNISTImage` and `FullArrayMNISTImage` classes (25 points)

The ArrayMNISTImage class is the parent class for all the classes that store an MNIST image as an array of bytes. The class constructor should not allocate memory, since it doesn't know yet how much memory will be needed (this is left to the derived classes, e.g., the downsampling implementation doesn't need to store the full array, just a portion of it.). Even though we won't be creating instances directly from ArrayMNISTImage, you can choose to initialize values appropriately, such as data to NULL and size to 0. This is good practice that helps avoid random and bug-prone values in case we forget to intialize them later on.

You must also implement the distance method, computing the Euclidean distance between the arrays of two images. This function should work for any values of data and size, so do not assume that it will be a 0-length empty array at this point.

The FullArrayMNISTImage class is a child class of ArrayMNISTImage, and stores the entire image as a flat array. You must implement its constructor, which must create a copy of the array for the object.

Our solution to this task required 15 lines of code.

Task 2: Implement the 1-NN algorithm (35 points)

In file knn.cpp you will implement the following function:

uint32 KNN(MNISTDataset &trainSet, MNISTDataset &testSet, uint32 k, uint32 max, bool verbose)

This function takes the training set and the testing set, and runs KNN with k neighbours, using the first max images in the testing set (if max is 0, then all the images are used). The function should not print anything to the console if verbose is set to false.

The function returns the number of images in the testing set that have been correctly labeled.

To implement this function, you will have to use the methods of the MNISTDataset and MNISTImage classes. You will find a description of these classes and their methods in their header files.

At this point, you can assume that k will always be 1. This simplifies the algorithm considerably, since you do not have to keep track of multiple neighbors.

Our solution to this task required 40 lines of code.

Task 3: Implement the `DownsampleMNISTImage` class (10 points)

The DownsampleMNISTImage class is a child class of ArrayMNISTImage, and stores a downsampled version of the image. This means that, instead of storing the full array of bytes as specified in the MNIST file, this class will only store a portion of those bytes. This portion is determined by the downsampling factor. If the factor is 2, it means only the bytes at indexes divisible by 2 will be stored. If the factor is 4, only the bytes at indexes divisible by 4 will be stored. etc.

So, if the array of bytes is { 10, 20, 30, 40, 50, 60, 70, 80 } and the downsampling factor is 2, then the "size" attribute of the object should be set to 4, and the following array should be stored: { 10, 30, 50, 70 }

You must implement only the constructor of this class. You do not need to implement the distance method, since the one inherited from ArrayMNISTImage will work fine.

Our solution to this task required 5 lines of code.

Task 4: Implement the `ScaleMNISTImage` class (15 points)

The ScaledMNISTImage class is a child class of ArrayMNISTImage, and stores a scaled version of the image. This is similar to what DownsampleMNISTImage does, except it is a bit smarter about reducing the size of the array.

To scale the image, we will use a scaling factor. If we are given a scaling factor of 4, that means that we will divide the image into 4x4 squares, and reduce each square into a single pixel. So, a 28x28 pixel image would become a 7x7 image.

To determine the value of the new pixel, we simply take the average of all the pixels in the n*n square that are being reduced into that pixel.

You must implement only the constructor of this class. You do not need to implement the distance method, since the one inherited from ArrayMNISTImage will work fine.

Our solution to this task required 15 lines of code.

Task 5: Implement the `SumMNISTImage` class (5 points)

The SumMNISTImage class is a child class of MNISTImage, which reduces the MNIST image into a single scalar: the sum of the values of all the pixels. You must implement the constructor of this class, along with the distance method.

Our solution to this task required 5 lines of code.

Task 6: Implement the K-NN algorithm (20 points)

Modify your implementation of the KNN function to support values of k greater than 1. For a given image in the testing set, this will require keeping track of the k best distances as you compare the test image with all the training images. Although there are several ways of doing this, we suggest you use the STL's priority_queue data structure.

Our solution to this task required adding 45 lines of code to the solution for Task 2.

Bonus Task: Implement additional MNISTImage derived classes (15 extra points)

Come up with an algorithm to compare MNIST images (or find an existing algorithm) and implement it as a derived class of MNISTImage. Your algorithm must meet the following criteria:

It must run at least twice as fast as the full method.
It must have an error rate of less than 20%.

Take into account that, besides implementing an additional MNISTImage derived class, you must modify the createImage method of MNISTDataset so that the program will recognize your additional method when using the -t command-line parameter.

Notes

Once you have implemented Tasks 1 and 2, you will have enough code to start testing your solution on the MNIST dataset. When we run our solution with the following parameters:
```
./mnist-knn -d ./ -k 1 -t full -n 250
```
We get the following result:
```
full 1 250 247 3 1.2
```
This takes approximately 45 seconds to run.
Since it can take several seconds to go through a meaningful subset of the dataset, and the only feedback you get is the final line printed by the program, we suggest you make good use of the -v option. When set, we suggest you make the KNN function print some information every time it classifies a single image. For example, when the -v option is set on our solution, we print the following for each image:
```
IDX LABEL_EXPECTED LABEL_CLASSIFIED [XXX]
```
Where IDX is the index of the image in the data set, LABEL_EXPECTED is the expected label, LABEL_CLASSIFIED is the label determined by KNN, and the literal string "XXX" is printed when the labels don't match.
Your implementation of distance is going to require "upcasting" an MNISTImage object to one of its derived objects. For example, in the distance implementation in ArrayMNISTImage you will need to have access to the data attribute, which is not defined in MNISTImage. This kind of casting can't be done just by putting the new type in parenthesis (e.g., (ArrayMNISTImage*) img). Instead, you have to do the following:
```
const ArrayMNISTImage *arrayimg = dynamic_cast<const ArrayMNISTImage*>(img);
```
By default, our Makefile builds the program with the compiler optimization flags turned on. If you want to debug your program with gdb or with Eclipse, you will have to make it like this:
```
make DEBUG=yes
```