Processing streaming data with a thread pool

We have seen that threads provide a powerful mechanism to perform multiple tasks concurrently. Sometimes, the number of threads we will need is known beforehand. For example, if we want to process an image, we can divide it up and assign each portion of the image to a separate thread (after a certain point, adding more threads doesn't help; however, once we figure out the "sweet spot" of threads, that number will work every time we re-run the application on the same data).

However, in some applications, the number of tasks is not known beforehand and might even change throughout the lifetime of an application. For example, a web server receives a constant stream of requests for web pages and most modern web servers assign a thread to each request. If the web server spawns a new thread every time a new request comes along, it could end up spawning too many threads if there is a spike in web traffic (and those threads could use up valuable system resources, to the point where the server becomes saturated and can't deal with any more requests).

A common solution to this problem is the thread pool pattern, where a "pool" of threads is pre-spawned when the application starts running, and is always available to deal with incoming tasks. If the number of tasks ever exceeds the number of threads, the tasks are placed in a queue until a thread becomes available.

So, instead of potentially spawning more threads than the application can handle, we make these extra tasks wait. Another advantage is that the cost of starting up the threads is incurred when the application starts, instead of when a task is run (since we don't spawn a thread for each task; the threads are already there and are assigned to work on incoming tasks).

In this assignment, you will be implementing a thread pool to handle the 1.USA.gov clicks public stream. 1.USA.gov URLs are created whenever anyone shortens a .gov or .mil URL using bitly. The 1.USA.gov public stream provides real-time data every time anyone clicks on a 1.USA.gov URL (each click is represented as a JSON object).

Implementing a thread pool requires understanding locks and condition variables, both of which will be necessary to mantain a queue of tasks and to coordinate the threads in the thread pool. This assignment will help you practice using these synchronization primitives and will allow you to continue practicing C++.

The Thread Pool Pattern

The thread pool pattern is typically implemented using a queue, N threads (typically referred to as "worker threads"), and a data structure that encapsulates a single task to be run by a thread.

When the thread pool is created, N threads are created and are initially idle (since there is no work to do). When a task is submitted to the thread pool, it is placed in the task queue. If the queue goes from being empty to being non-empty (as will happen when the first task is submitted), an idle thread (selected arbitrarily) is notified that there is work to do. That thread then extracts the first task in the queue and runs it.

When the queue starts to accumulate tasks, the threads no longer have to receive notifications that there is more work to do. Instead, once a thread is done with the task it was working on, it checks the queue again; if there are tasks in it, it takes the first task and runs it. If not (if the queue is empty), it goes to sleep and waits for the queue to be non-empty.

A common feature of thread pools is that the number of threads also varies depending on how many tasks are waiting to be run. For example, if we created a pool with 4 threads, but suddenly there are 100 tasks waiting in the task queue, it would make sense to spawn more threads. Once the queue is empty, those extra threads can be destroyed to free up resources.

In this assignment, we will assume that the pool is created with N threads, and that this number does not change over time. Furthermore, the queue will have a maximum size of 10 tasks, meaning that any tasks submitted when the queue has 10 tasks in it will simply be dropped.

The 1.USA.gov click data

1.USA.gov is a URL shortening service. Any time a .gov or .mil URL is shortened using bitly, a 1.USA.gov URL is created. For example http://1.usa.gov/awQHfg is a short URL that redirects to http://www.cancer.gov/.

1.USA.gov provides a live feed of clicks on 1.USA.gov URLs. You can see this live feed here: http://developer.usa.gov/1usagov. Each click is represented as a single JSON object. For example:

{ "a": "Mozilla\/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident\/4.0)",
  "c": "US", "nk": 0, "tz": "America\/New_York", "gr": "PA",
  "g": "awQHfg", "h": "19WPvQ8", "l": "rocklandantibodies",
  "hh": "1.usa.gov", "r": "direct", "u": "http:\/\/www.cancer.gov\/",
  "t": 1369838391, "hc": 1369083379, "cy": "Lemoyne", "ll": [ 40.252499, -76.897003 ] }

You will not have to worry about parsing the JSON, as we will be providing the code that handles this. Nonetheless, notice how the data includes several interesting pieces of information, such as the web browser used, the URL being clicked (http://www.cancer.gov/), and even the longitude/latitude where the click happened. You can find a more complete description of each data field here.

In this assignment, you will be able to hook up your thread pool to the live feed, but will also be able to run it on a file containing clicks.

The code

We have uploaded several C++ files to your PhoenixForge repository that handle most of the functionality of dealing with the 1.USA.gov data. The only thing missing is the thread pool.

The following are files we provide which you should not have to modify:

main.cpp: This is the entry point into the program. It parses some command-line arguments (described below) and uses the C++ classes defined in the other C++ files.
USAGovClick.{cpp|h}: Provides a class called USAGovClick that encapsultes information about a single 1.USA.gov click.
USAGovClickReader.{cpp|h}: Provides an abstract class called USAGovClickReader that reads one click at a time from a source of 1.USA.gov clicks.
USAGovClickFileReader.{cpp|h}: Provides a class called USAGovClickFileReader which implements the USAGovClickReader abstract class. It reads clicks from a single text file (with one click, in JSON format, per line).
USAGovClickHTTPReader.{cpp|h}: Provides a class called USAGovClickHTTPReader which implements the USAGovClickReader abstract class. It reads clicks from the 1.USA.gov live feed.
Task.{cpp|h}: Provides an abstract class called Task that represents a task to be run by the thread pool.
USAGovClickTask.{cpp|h}: Provides a class called USAGovClickTask which implements the Task abstract class. It processes a single click by aggregating some of its data into a USAGovClickData object (described below). You may modify this file for extra credit (also described below).
jsoncpp.cpp and json/ directory: A standalone distribution of the json-cpp library (which we use to parse JSON).
Makefile: Makefile to build the entire program.

The following are files you will be modifying:

ThreadPool.{cpp|h}: Provides a class called ThreadPool which represents a thread pool.
WorkerThread.{cpp|h}: Provides a class called WorkerThread which represents a worker thread in the thread pool.
USAGovClickData.{cpp|h}: Provides a class called USAGovClickData which is used to aggregate information about multiple clicks. The provided implementation is not thread-safe, and only aggregates some simple information.

To build the program, just run make:

make

This will generate a usagov-clicks executable.

The program accepts several command-line parameters:

-f FILE: Read the clicks from a file (one click per line, in JSON format). We have provided a sample clicks.json file with 42 clicks. You can also use the hourly archives from the 1.USA.gov feed.
-w: Read the clicks from the live feed.
-t THREADS: Number of threads to use in the thread pool.
-d DURATION: Artificially make the tasks in the thread pool run for this long (in milliseconds).
-i INTERVAL: Artificially introduce an interval (in milliseconds) between reading each new click. This only makes sense when reading from a file (when reading from the live stream, there is already a natural interval between clicks)
-v: Verbose mode. Prints out more messages to the console.

After the program completes its analysis, it will print out a single line with the following information:

FILE,NTHREADS,ELAPSED,N,DROPPED

Where:

FILE is the file the clicks were read from (blank if the live stream was used)
NTHREADS is the number of threads in the thread pool
ELAPSED the time (in seconds) it took to process all the clicks.
N is the number of clicks processed.
DROPPED is the number of clicks that were not accepted by the thread pool (because the task queue was full when the click arrived).

Your Tasks

You must complete the following tasks in this homework:

Task 1: Make `USAGovClickData` thread-safe (20 points)

The USAGovClickData class is used to aggregate information about multiple clicks. This means that there will be a single USAGovClickData object shared by all the threads and, as each thread processes a click, it will update the data in USAGovClickData (this is actually done in USAGovClickTask). However, USAGovClickData is not currently thread-safe. You must update it so that there can be no race conditions when the threads access the shared USAGovClickData object.

Our solution to this task required less than 5 lines of code.

Task 2: Implement the thread pool (30 + 30 points)

To implement the thread pool, you will have to implement the constructor and schedule method of the ThreadPool class (30 points) and the run method of the WorkerThread class (30 points). You do not need to implement the stop method of the ThreadPool class yet.

When implementing the thread pool, you must take the following into account:

Tasks submitted with the schedule function are placed in the task queue (unless the queue has maxTasks tasks in it, in which case the task is dropped). If the queue had some tasks in it, no further action is necessary. If the queue was empty, then the threads have to be notified that there is work to do.
Threads are in one of two states: idle and working.
- When a thread is idle, it is just waiting for the queue to be non-empty. This idle state can't be implemented with "busy waiting" (i.e., looping constantly while waiting for the queue to be non-empty). You must use synchronization primitives to make sure that the thread is blocked while waiting for the queue to be non-empty, and is woken up when the queue has at least one task.
- When a thread is working, it simply runs the run method of the task it took from the queue. Once it's done, it checks if there are more tasks in the queue. If there are, it just takes the first task and runs it. If there are no tasks in the queue, then the thread goes into the idle state.
The task queue is shared by all the threads. Make sure it is accessed in a thread-safe manner.
Since you are not required to implement the stop method yet, you should call each thread's detach method after creating them. Otherwise, your solution will crash because you will have unjoined threads at the end of your program (detach basically tells your program that your thread does not have to be joined).

Our solution to this task required 10 lines of code in WorkerThread and 20 lines of code in ThreadPool. It can be implemented with a single mutex and a single condition variable.

Task 3: Implement the `stop` method of `ThreadPool` (20 points)

The worker threads in a thread pool (including this one) will continue to run tasks (and to wait for tasks to appear in the queue) until explicitly told to stop. This is in contrast to other thread examples we have seen where a thread is given a single finite task to work on; if we want to wait until the thread completes, we just call its join method.

Here, we need to explicitly stop the threads. However, C++11 threads do not have a convenient stop method. Instead, you will need to figure out a way to communicate to the threads that they should stop processing the task queue (and that the run() method should return).

This is a tricky task, although it ultimately requires only a few lines of code: our solution required five extra lines in WorkerThread and five lines to implement the stop method. No extra synchronization primitives should be needed.

Bonus Task: Generate interesting data (10 points)

The USAGovClickData class currently aggregates some pretty dull data: the number of clicks, the number of new users, and the per-country code. Modify this class (and the main.cpp file) to generate a more interesting analysis of the click data.

Testing

We have included a clicks.json file that you can use to test your implementation.

We suggest that you start by testing your implementation by setting the thread pool to have a single thread. That will allow you to verify that you are processing the click data correctly. You can try running your solution like this:

./usagov-clicks -f clicks.json -t 1 -v

It should print something like this:

Task 1 scheduled.
Task 2 scheduled.
Task 3 scheduled.
Task 4 scheduled.
Task 5 scheduled.
Task 6 scheduled.
Task 7 scheduled.
Task 8 scheduled.
Task 9 scheduled.
Task 10 scheduled.
Task 11 scheduled.
Task 12 scheduled.
Task 13 scheduled.
Task 14 scheduled.
Task 15 scheduled.
Task 16 scheduled.
Task 17 scheduled.
Task 18 scheduled.
Task 19 scheduled.
Task 20 scheduled.
Task 21 scheduled.
Task 22 scheduled.
Task 23 scheduled.
Task 24 scheduled.
Task 25 scheduled.
Task 26 scheduled.
Task 27 scheduled.
Task 28 scheduled.
Task 29 scheduled.
Task 30 scheduled.
Task 31 scheduled.
Task 32 scheduled.
Task 33 scheduled.
Task 34 scheduled.
Task 35 scheduled.
Task 36 scheduled.
Task 37 scheduled.
Task 38 scheduled.
Task 39 scheduled.
Task 40 scheduled.
Task 41 scheduled.
Task 42 scheduled.


Running time: 0.00275696
Received 42 clicks.
Dropped 0 clicks.
Processed 42 clicks.

2 of the clicks were from new users

Breakdown by country:
?? 2
AU 1
CA 2
DO 1
FR 1
PR 1
SA 1
SE 1
US 32

Next, try introducing arbitrary durations and intervals for the tasks, to verify that the thread pool correctly detects when it has reached its maximum capacity. For example, running this:

./usagov-clicks -f clicks.json -t 1 -v -i 10 -d 20

Should produce something like this:

Task 1 scheduled.
Task 2 scheduled.
Task 3 scheduled.
Task 4 scheduled.
Task 5 scheduled.
Task 6 scheduled.
Task 7 scheduled.
Task 8 scheduled.
Task 9 scheduled.
Task 10 scheduled.
Task 11 scheduled.
Task 12 scheduled.
Task 13 scheduled.
Task 14 scheduled.
Task 15 scheduled.
Task 16 scheduled.
Task 17 scheduled.
Task 18 scheduled.
Task 19 scheduled.
Task 20 scheduled.
Task 21 scheduled.
Task 22 NOT scheduled.
Task 23 scheduled.
Task 24 NOT scheduled.
Task 25 scheduled.
Task 26 NOT scheduled.
Task 27 scheduled.
Task 28 NOT scheduled.
Task 29 scheduled.
Task 30 NOT scheduled.
Task 31 scheduled.
Task 32 NOT scheduled.
Task 33 scheduled.
Task 34 NOT scheduled.
Task 35 scheduled.
Task 36 NOT scheduled.
Task 37 scheduled.
Task 38 NOT scheduled.
Task 39 scheduled.
Task 40 NOT scheduled.
Task 41 scheduled.
Task 42 NOT scheduled.


Running time: 0.442897
Received 42 clicks.
Dropped 11 clicks.
Processed 22 clicks.

2 of the clicks were from new users

Breakdown by country:
?? 2
AU 1
CA 2
DO 1
FR 1
SA 1
US 14

The actual dropped tasks may vary when you run this, but you should nonetheless see some of the tasks being dropped. In this case, because the thread pool only has one thread, the task queue eventually fills up (to its limit of ten tasks) and has to start dropping tasks. If you run the following (the same as above, but with two threads):

./usagov-clicks -f clicks.json -t 2 -v -i 10 -d 20

You should not see any dropped task.

Also note that the dropped and the processed tasks in the previous run addded up to 33, which means 9 tasks were unaccounted for. These tasks were entered into the queue but effectively cancelled when we chose to exit the program.

Finally, if you want to connect to the live stream (using the -w option), the program will keep on running until interrupted. Just press Control+C, and the program will stop reading from the live feed and will print out a summary of the collected data.

Notes

By default, our Makefile builds the program with the compiler optimization flags turned on. If you want to debug your program with gdb or with Eclipse, you will have to make it like this:
```
make DEBUG=yes
```