We have seen that threads provide a powerful mechanism to perform multiple tasks concurrently. Sometimes, the number of threads we will need is known beforehand. For example, if we want to process an image, we can divide it up and assign each portion of the image to a separate thread (after a certain point, adding more threads doesn't help; however, once we figure out the "sweet spot" of threads, that number will work every time we re-run the application on the same data).
However, in some applications, the number of tasks is not known beforehand and might even change throughout the lifetime of an application. For example, a web server receives a constant stream of requests for web pages and most modern web servers assign a thread to each request. If the web server spawns a new thread every time a new request comes along, it could end up spawning too many threads if there is a spike in web traffic (and those threads could use up valuable system resources, to the point where the server becomes saturated and can't deal with any more requests).
A common solution to this problem is the thread pool pattern, where a "pool" of threads is pre-spawned when the application starts running, and is always available to deal with incoming tasks. If the number of tasks ever exceeds the number of threads, the tasks are placed in a queue until a thread becomes available.
So, instead of potentially spawning more threads than the application can handle, we make these extra tasks wait. Another advantage is that the cost of starting up the threads is incurred when the application starts, instead of when a task is run (since we don't spawn a thread for each task; the threads are already there and are assigned to work on incoming tasks).
In this assignment, you will be implementing a thread pool to handle the 1.USA.gov clicks public stream. 1.USA.gov URLs are created whenever anyone shortens a .gov or .mil URL using bitly. The 1.USA.gov public stream provides real-time data every time anyone clicks on a 1.USA.gov URL (each click is represented as a JSON object).
Implementing a thread pool requires understanding locks and condition variables, both of which will be necessary to mantain a queue of tasks and to coordinate the threads in the thread pool. This assignment will help you practice using these synchronization primitives and will allow you to continue practicing C++.
The thread pool pattern is typically implemented using a queue, N threads (typically referred to as "worker threads"), and a data structure that encapsulates a single task to be run by a thread.
When the thread pool is created, N threads are created and are initially idle (since there is no work to do). When a task is submitted to the thread pool, it is placed in the task queue. If the queue goes from being empty to being non-empty (as will happen when the first task is submitted), an idle thread (selected arbitrarily) is notified that there is work to do. That thread then extracts the first task in the queue and runs it.
When the queue starts to accumulate tasks, the threads no longer have to receive notifications that there is more work to do. Instead, once a thread is done with the task it was working on, it checks the queue again; if there are tasks in it, it takes the first task and runs it. If not (if the queue is empty), it goes to sleep and waits for the queue to be non-empty.
A common feature of thread pools is that the number of threads also varies depending on how many tasks are waiting to be run. For example, if we created a pool with 4 threads, but suddenly there are 100 tasks waiting in the task queue, it would make sense to spawn more threads. Once the queue is empty, those extra threads can be destroyed to free up resources.
In this assignment, we will assume that the pool is created with N threads, and that this number does not change over time. Furthermore, the queue will have a maximum size of 10 tasks, meaning that any tasks submitted when the queue has 10 tasks in it will simply be dropped.
1.USA.gov is a URL shortening service. Any time a .gov or .mil URL is shortened using bitly, a 1.USA.gov URL is created. For example http://1.usa.gov/awQHfg is a short URL that redirects to http://www.cancer.gov/.
1.USA.gov provides a live feed of clicks on 1.USA.gov URLs. You can see this live feed here: http://developer.usa.gov/1usagov. Each click is represented as a single JSON object. For example:
{ "a": "Mozilla\/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident\/4.0)", "c": "US", "nk": 0, "tz": "America\/New_York", "gr": "PA", "g": "awQHfg", "h": "19WPvQ8", "l": "rocklandantibodies", "hh": "1.usa.gov", "r": "direct", "u": "http:\/\/www.cancer.gov\/", "t": 1369838391, "hc": 1369083379, "cy": "Lemoyne", "ll": [ 40.252499, -76.897003 ] }
You will not have to worry about parsing the JSON, as we will be providing the code that handles this. Nonetheless, notice how the data includes several interesting pieces of information, such as the web browser used, the URL being clicked (http://www.cancer.gov/), and even the longitude/latitude where the click happened. You can find a more complete description of each data field here.
In this assignment, you will be able to hook up your thread pool to the live feed, but will also be able to run it on a file containing clicks.
We have uploaded several C++ files to your PhoenixForge repository that handle most of the functionality of dealing with the 1.USA.gov data. The only thing missing is the thread pool.
The following are files we provide which you should not have to modify:
The following are files you will be modifying:
To build the program, just run make:
make
This will generate a usagov-clicks executable.
The program accepts several command-line parameters:
After the program completes its analysis, it will print out a single line with the following information:
FILE,NTHREADS,ELAPSED,N,DROPPED
Where:
You must complete the following tasks in this homework:
The USAGovClickData class is used to aggregate information about multiple clicks. This means that there will be a single USAGovClickData object shared by all the threads and, as each thread processes a click, it will update the data in USAGovClickData (this is actually done in USAGovClickTask). However, USAGovClickData is not currently thread-safe. You must update it so that there can be no race conditions when the threads access the shared USAGovClickData object.
Our solution to this task required less than 5 lines of code.
To implement the thread pool, you will have to implement the constructor and schedule method of the ThreadPool class (30 points) and the run method of the WorkerThread class (30 points). You do not need to implement the stop method of the ThreadPool class yet.
When implementing the thread pool, you must take the following into account:
Our solution to this task required 10 lines of code in WorkerThread and 20 lines of code in ThreadPool. It can be implemented with a single mutex and a single condition variable.
The worker threads in a thread pool (including this one) will continue to run tasks (and to wait for tasks to appear in the queue) until explicitly told to stop. This is in contrast to other thread examples we have seen where a thread is given a single finite task to work on; if we want to wait until the thread completes, we just call its join method.
Here, we need to explicitly stop the threads. However, C++11 threads do not have a convenient stop method. Instead, you will need to figure out a way to communicate to the threads that they should stop processing the task queue (and that the run() method should return).
This is a tricky task, although it ultimately requires only a few lines of code: our solution required five extra lines in WorkerThread and five lines to implement the stop method. No extra synchronization primitives should be needed.
The USAGovClickData class currently aggregates some pretty dull data: the number of clicks, the number of new users, and the per-country code. Modify this class (and the main.cpp file) to generate a more interesting analysis of the click data.
We have included a clicks.json file that you can use to test your implementation.
We suggest that you start by testing your implementation by setting the thread pool to have a single thread. That will allow you to verify that you are processing the click data correctly. You can try running your solution like this:
./usagov-clicks -f clicks.json -t 1 -v
It should print something like this:
Task 1 scheduled. Task 2 scheduled. Task 3 scheduled. Task 4 scheduled. Task 5 scheduled. Task 6 scheduled. Task 7 scheduled. Task 8 scheduled. Task 9 scheduled. Task 10 scheduled. Task 11 scheduled. Task 12 scheduled. Task 13 scheduled. Task 14 scheduled. Task 15 scheduled. Task 16 scheduled. Task 17 scheduled. Task 18 scheduled. Task 19 scheduled. Task 20 scheduled. Task 21 scheduled. Task 22 scheduled. Task 23 scheduled. Task 24 scheduled. Task 25 scheduled. Task 26 scheduled. Task 27 scheduled. Task 28 scheduled. Task 29 scheduled. Task 30 scheduled. Task 31 scheduled. Task 32 scheduled. Task 33 scheduled. Task 34 scheduled. Task 35 scheduled. Task 36 scheduled. Task 37 scheduled. Task 38 scheduled. Task 39 scheduled. Task 40 scheduled. Task 41 scheduled. Task 42 scheduled. Running time: 0.00275696 Received 42 clicks. Dropped 0 clicks. Processed 42 clicks. 2 of the clicks were from new users Breakdown by country: ?? 2 AU 1 CA 2 DO 1 FR 1 PR 1 SA 1 SE 1 US 32
Next, try introducing arbitrary durations and intervals for the tasks, to verify that the thread pool correctly detects when it has reached its maximum capacity. For example, running this:
./usagov-clicks -f clicks.json -t 1 -v -i 10 -d 20
Should produce something like this:
Task 1 scheduled. Task 2 scheduled. Task 3 scheduled. Task 4 scheduled. Task 5 scheduled. Task 6 scheduled. Task 7 scheduled. Task 8 scheduled. Task 9 scheduled. Task 10 scheduled. Task 11 scheduled. Task 12 scheduled. Task 13 scheduled. Task 14 scheduled. Task 15 scheduled. Task 16 scheduled. Task 17 scheduled. Task 18 scheduled. Task 19 scheduled. Task 20 scheduled. Task 21 scheduled. Task 22 NOT scheduled. Task 23 scheduled. Task 24 NOT scheduled. Task 25 scheduled. Task 26 NOT scheduled. Task 27 scheduled. Task 28 NOT scheduled. Task 29 scheduled. Task 30 NOT scheduled. Task 31 scheduled. Task 32 NOT scheduled. Task 33 scheduled. Task 34 NOT scheduled. Task 35 scheduled. Task 36 NOT scheduled. Task 37 scheduled. Task 38 NOT scheduled. Task 39 scheduled. Task 40 NOT scheduled. Task 41 scheduled. Task 42 NOT scheduled. Running time: 0.442897 Received 42 clicks. Dropped 11 clicks. Processed 22 clicks. 2 of the clicks were from new users Breakdown by country: ?? 2 AU 1 CA 2 DO 1 FR 1 SA 1 US 14
The actual dropped tasks may vary when you run this, but you should nonetheless see some of the tasks being dropped. In this case, because the thread pool only has one thread, the task queue eventually fills up (to its limit of ten tasks) and has to start dropping tasks. If you run the following (the same as above, but with two threads):
./usagov-clicks -f clicks.json -t 2 -v -i 10 -d 20
You should not see any dropped task.
Also note that the dropped and the processed tasks in the previous run addded up to 33, which means 9 tasks were unaccounted for. These tasks were entered into the queue but effectively cancelled when we chose to exit the program.
Finally, if you want to connect to the live stream (using the -w option), the program will keep on running until interrupted. Just press Control+C, and the program will stop reading from the live feed and will print out a summary of the collected data.
By default, our Makefile builds the program with the compiler optimization flags turned on. If you want to debug your program with gdb or with Eclipse, you will have to make it like this:
make DEBUG=yes