152: Week 9

Due in one week before you scheduled lab session.

Goals for this homework

Continue practicing implementation with structs
Better understand hash tables and how the components work together

You may work in your duet partnership on this assignment if you wish.

Set Up

Create a hw9 directory for this assignment.

We have provided some starting files. Like other warmups, and unlike your project, the design is well specified so that you can just fill in the implementation.

As with any container class, there are three very separate jobs for you to do -

Write the structure that's being stored in the hash table
Write the container data structure
Write the code that uses the data structure

In this project,

A wordcount structure, holding a word and a count, is stored in the hash table.
The data structure is a hash table.
You are evaluating different collision resolution techniques to see how they perform. The real use of this setup is that you could keep a count of how many times each word is seen in a file to calculate the frequency different words appear. However, for this project, we don't care about the application, we're substituting that application for a main function that exercises the different implementations of the hash table to obtain performance information.

Here are some files to start you out.

word.h - this file defines struct wordcount, the data type that we will work with for this project. wordcount counts the word inserted and the number of inserts. In this project we do not really ask you to count the occurence of a word, so the field of count will always be 1, but in a real applicatin you might need to increment the count. A function called get_key takes in a wordcount struct (actually a void * that we know is a wordcount struct) and returns an integer. This function is used as the hash function in the hash table. Below, we will describe how to calculate the integer.
word.c - This file defines functions relevant to struct wordcount. Implement the functions with our provided function signatures and descriptions.
hash.h - This declares a set of functions that will be needed by the hash table in order to store and manage the strings it is holding. We will be making a hash table that uses different methods to handle collisions, and the specific method of insertion and searching is dependent on the collision handling mechanism, which is determined at runtime.
hash.c - The implementation of the hash functions.
main.c - the main function that reads input from the user, creates the hash table, and tests the hash table. We have provided the implementation of print_hash_table_entries so that everyone's output is consistent.

You will also need the following files (you might as well touch and add them to your respository now):

Makefile - makes executable named hash_table

Problem: wordcount

Finish the functions relevant to manipulating struct wordcount according to our provided function signatures and descriptions.

Problem: hash table

Complete this hash table implementation by providing definitions for the functions specified in the header files. The function generate_hash and get_key merits some additional explanation, which appears below in this document. The rest of the functions are specified in the header files in some detail; please review those files carefully. For our exercise, we assume that we are storing (key,value) pairs such as (12345, "uchicago"). The key being stored is an unsigned long int, but the value is a void pointer that can be used to store any type of value.

When displaying the hash table (print_hash_table_entries), you must follow our format: the bucket indeces must appear as numbers at the left, enclosed in square brackets, one per line, with the contents of each bucket following it. If the hash-table is chained, each Key,Value pair is shown in parenthesis with a "->" symbol. The key is what get_key returns, and the value is what print_word returns.

You need two functions from word.c - a compare function and a get_key function. You will use compare_words that you already wrote in a previous assignment. Then you will need to write get_key as specified below.

You need to write a function (get_key) to get from the wordcount structs stored in the hash table to a key. In this case, the key is not guaranteed to be unique because the keyspace is larger than the 2^64. However, our compare operation in wordcount does more than compare keys - it compares words - so we're okay.

Follow this recipe for computing an unsigned long int key for a string:

1. Initialize an unsigned long int result to 17 (unsigned long res = 17;).
2. Loop over the characters in the string. For every character, set the result to be 37 times itself plus the ASCII numeric value of the character.

(This hashing algorithm recipe is from Effective Java by Joshua Bloch.)

For example, the hash code for the word FORK, given that the ASCII values of the characters in FORK are 70, 79, 82 and 75, is as follows:


res = 17
res = 37*17+70     = 699
res = 37*699+79    = 25942
res = 37*25942+82  = 959936
res = 37*959936+75 = 35517707
(As you can see, the numbers get big.)

Now that your wordcount struct has produced an integer key, the hash table needs to take that key and map it to a location within the hash table. You need to convert this long integer into an integer index into the hash table array (generate_hash).

We will go with the simplest of approaches here. To identify the index of FORK in a hash table of size, say, 33, you compute the remainder of the key when divided by 33. In this case 35517707%33 is 5, so FORK would be inserted into and/or located in the bucket with index 5 (the sixth bucket in the table).

Once you have implemented the hash function above, you will need to implement methods to handle collisions in your hash table. Instead of choosing a single function, the purpose of this assignment is to analyze the performance of three popular mechanisms - chaining (linked-list), linear probing, and quadratic probing.

Chaining: Each hash table entry corresponds to a linked list. Each entry that has been hashed into an index that is already occupied by other hash table entries is simply added to the linked list at that index. This will require you to implement the insert_chained() and search_chained() functions.
Linear Probing: If a hash table entry is already occupied, search through the hash table entries in linear fashion until you find an empty slot and place the entry there. The logic for insertion and search will have to be implemented in the insert_linear() and search_linear() functions respectively.
Quadratic Probing: Similar to linear probing above, but instead of iterating through the list in linear fashion, you will probe for empty slots in the hash table quadratically, (i.e. using a square of the current iteration number). Therefore the list of entries you would look at for a particular hash value H would be H+1²,H+2²,H+3² and so on. The logic for insertion and search will have to be implemented in the insert_quadratic() and search_quadratic() functions respectively.

Please note that in creating and using the hash table functions, you can only use the generic create_hash_table(), insert() and search() functions. Since we're interested in the performance of these individual collisions handling mechanisms, we will need to measure the number of comparisons needed to perform the requested operation (insert or search). This can be done by passing an integer value (opcount) by reference into each of the functions. Starting from 0, for every entry of the hash table (or chained list item) that has been visited, you should increment opcount by 1. This will allow us to see exactly how many operations it takes for each of the mechanisms to insert and search through the hash table.

The insert functions return an integer value, denoting success or failure of the insert (1 on success, 0 on failure), while the search funcitons, while the search functions return the node item that matches the key.

Putting it all together

We want to be able test the correctness and evaluate the relative merits/demerits of each collision handling mechanism. For this, we should read words from a file and insert/search them in sequence. The input files should have exactly one word per line with operation "insert" or "search". We have provided the functions insert_search_from_file() to perform insert and search operations. The function do_all_tests() in main.c can be used to test all three hash table implementations on the same file and print out the average operations required for each hash table type.

We provide the following three test data files for you to evaluate each collision resolution mechanism.

Another experiment to perform is how bucket size (the number of slots in the hash table) affects the hash table performance. Try hash table size of 1500, 2000, 5000, 10000, and report the number of operations needed.

Report

Given your implementation and the results you have seen of the various collision resolution mechanims, write a technical paper, that reports on what you found. This report should be named "[CNetID]-report.pdf", and added to your repository in the following format:

Heading - (Your full name, Uchicago Email, Lab section date and time)
Abstract (2 paragraph summary of purpose and results)
Motivation (introduce what hash tables are, what collision-resolution techniques are, and why this study is important)
Methods (Explain how the experiment was carried out - how was the test data generated? In our case, the test data is provided by us. What collision-resolution techniques were tested?)
Results (Graphs showing the performance of different collision-resolution techniques with different input sets. Also include memory requirements and any other relevant results.)
Discussion (your interpretation of the results. Which was the best?)
Conclusions (one paragraph, which is similar to the second paragraph of the abstract)

Please use our provided report template.

Submit

Don't forget to add and commit all of your files!

CS152 W18 Project 2