An Auto-Completing Keyboard using Tries¶

Due: Thursday, January 12 at 5pm

The purpose of this assignment is to give you more practice working with recursion and a recursive data structure. You must work alone for this assignment.

Introduction¶

Chances are pretty high that you’ve used a smart phone with a touch-screen keyboard. As you type, the phone is trying to guess what you’re typing in the background.

If it is very likely that you are typing a specific word, the phone may show a suggested completion on the screen and give you the opportunity to accept or reject it. For instance, if you start to type “aardv” then if you are typing a word in the English dictionary, it must be “aardvark” and the phone suggests this word.

Alternatively, if the letters you are typing do not appear to belong to a word, or at least not to a common word, then the phone recognizes this situation, too. It might present a graphical cue that the word is misspelled. Or, it might check whether there are words in the dictionary whose sequence of keystrokes are very close to the ones you entered on the keyboard (for instance, “dit” is, key-for-key, close to”for,” so perhaps that is what you meant).

How does the phone manage the information it needs to make these decisions? Clearly, it has a built-in dictionary. But how does it search that dictionary efficiently as you are typing?

A suitable and commonly-used data type for this task is called a trie. A trie represents sequences (in this case, of letters) and is suitable for determining whether a partial sequence is the prefix to at least one entry. A trie can tell us instantly if a sequence is a “dead end” that doesn’t have any valid completions; and when a sequence does have completions, we can count them and enumerate them in a straightforward fashion.

Tries¶

We will start by describing a trie conceptually.

A trie is a recursive data structure, that is, a trie is made by taking smaller tries and combining them.

A trie starts with a node. A node contains information about the first k letters of a sequence, as well as links to sub-tries. These links are labelled by the next letter in the sequence.

As we consider a particular sequence, we start at the beginning node, then follow the link for the first letter in the sequence. This step takes us to a new node. At this node, we can check the information stored to determine statistics about the sequence so far. We will have access to a “count,” which tells us how many words exist that begin with that letter. We can also check to see if there is a word that ends at that letter. As an example, if the first letter is “a,” then we start at the initial node and follow the link to the node representing all words that begin with “a.” We will find a count indicating how many such words exist, and an indication that “a” itself is indeed a word even without adding additional characters.

As another example, the figure below shows the complete trie for a language consisting only of the words “a,” “an,” “and,” “are,” and “bee.”

Now, let’s talk about how to accomplish useful tasks using a trie.

Getting information for a sequence¶

If we have a sequence of letters corresponding to the beginning of a word, we can traverse the trie to find the information corresponding to that sequence.

Starting with the first letter and the first node in the trie, we follow the link for that letter to get to a new node. We continue following links from successive nodes corresponding to successive letters until we have taken all of the letters in the sequence into account. We are now at a node representing that sequence as a prefix to words. At that node will be a count, indicating how many words begin with that prefix; also, there may be a flag indicating that the sequence itself is a word.

If, along the way, we were to discover that the link for the next letter in our sequence does not exist, then the sequence is not a valid prefix for any words, and we could stop at that point and conclude that the word is misspelled (or otherwise not in our dictionary).

Each time we follow a link to reach a new node, we arrive at a node while we are holding on to a list of letters in our sequence yet to be followed. Because this situation is very similar to the situation we are in when we begin traversing the structure (we have the entire sequence of letters yet to be considered, and are at the starting node), the problem of writing the traversal code naturally lends itself to recursion.

Completing a sequence¶

Once we have arrived at the node corresponding to the entire prefix, we can generate all possible completions of that sequence into words.

We do this by traversing every link out of the node corresponding to our prefix (thus considering all possible first letters in a completion), and every link from each of those nodes (all possible second letters), and so on, until there are no more links to follow. Every time we traverse a link, we keep track of the sequence of letters we would need to add to our prefix to arrive at the currently-visited node. Every time we encounter a node marked final (in other words, one that corresponds to a full word), we add the sequence of letters used to arrive at that node, to our list of possible completions.

The traversal of the nodes and links in this scenario again lends itself to a recursive implementation.

Building a trie from a list of words¶

We can build a trie from a list of words one word at a time. You may assume that the list does not contain any duplicates. (We will give you lists of words to use in this assignment. If you look at this data, it may appear that the large word list contains duplicates. But, it actually does not, because case matters. For example, “a” and “A” are considered different words.)

After creating an initial node corresponding to the empty prefix, we add each individual word according to the following process.

Starting at the initial node, we follow the link for the first letter in the word. If it does not exist, we create that link and the destination node. We then repeat this process for each subsequent letter until we have consumed all of the letters in the word. At the final node, we mark the node as final, indicating that a complete word corresponds to that prefix. At every node we visited along the way, we also increment the count of words with that prefix by one.

As with the other two operations, the repeated traversal of nodes and links lends itself to recursive programming.

Representing tries¶

To represent tries in Python, we recommend you use a dictionary. The benefit of a dictionary is that you can easily look up the next character in a string, because it can be the key in the dictionary.

More specifically, we suggest that you represent a node as a dictionary, with a key for the count at that node, a key that is set to true if the node represents a “final” sequence, that is, the completion of a word, and keys for every letter that succeeds the node along the way to longer sequences that eventually form words.

The count and final values can simply be numbers and boolean values, respectively. The values for the letter-keyed entries should themselves be tries. Because of this, the values at those keys are themselves dictionaries, defined the same way as the outer node: a count value and final flag, and some number of letter-based keys. Because the value at a letter-based key is itself a trie, or sub-trie, this data structure is considered recursive.

Your tasks: building and using tries¶

Your task is to complete the following functions in the file trie_dict.py:

def create_trie_node():

which creates a new node for a trie.

def add_word(word, trie)

which takes a word (a string) and adds it to the trie. As described above, a trie is internally represented as a dictionary, but it is up to you to figure out the details of this internal representation.

def is_word(word, trie)

which determines whether a particular word is a complete word according to the trie.

def get_completions(prefix, trie)

which returns a list of all of the suffixes of a specified prefix which, together, represent a word contained in the trie. Specifically, the function should return a list containing just the suffixes, not the entire word. This function should return the empty list if prefix does not appear in the trie. For example, if we called get_completions on the small example trie shown above with the prefix “as”, the function would return the empty list.

def num_completions(prefix, trie)

which returns the number of full words beginning with the specified prefix in the trie. IMPORTANT: Do not implement this function by simply returning len(get_completions(prefix, trie)). When using a trie, generating all the completions and then counting them is not the most efficient way of getting the number of completions.

It is up to you to decide how to structure your code beyond the functions we have suggested; for instance, the design of any helper functions is up to you. But, there are certain operations that you will likely find yourself doing repeatedly, so you are strongly encouraged to identify these and write them once, in helper functions.

Getting started¶

We have seeded your repository with a directory for this assignment. To pick it up, change to your cs122-win-17-username directory (where the string username should be replaced with your username) and then run the command: git pull upstream master. You should also run git pull to make sure your local copy of your repository is in sync with the server.

Here is a description of the contents of the directory:

trie_dict.py – you will implement the functions at the top of this file.
trie_list.py – an implementation of the above functions using a list rather than a trie. Using a list makes this inefficient, but it will allow you to get familiar with the functions by calling them with toy examples.
trie_shell.py – user-interface implementation
test_trie.py – a file for your tests here.
web2 – a copy of the words from the 1934 edition of Miriam-Webster’s Dictionary
five – a simple list of words with only the five words shown in the example diagram

You can use the list-based implementation to familiarize yourself with the expected behavior of the code.

The file trie_list.py contains a list-based implementation of the above functions. The file trie_shell.py contains code for a simple user interface that uses these functions to implement a text-based interface to write a message with auto-complete features. You can run the list version of the program using this command:

python3 trie_list.py WORDS_FILE

Where WORDS_FILE is a file with words (one per line) to be loaded into the trie. We provide two files:

A small example list of words named five with only the five words in the earlier example diagram.
A large list of words, web2, based on the 1934 version of Miriam-Webster’s Dictionary, for you to use once your trie implementation is complete. Because this real-world dictionary has hundreds of thousands of words, you should not be alarmed if it takes ten seconds or more to load it in.

The provided code, with our list-based implementation of the functions, behaves in the same way as your final trie-based implementation should. We strongly encourage you to play around with this version before you start writing your trie-based solution.

The user interface starts with a prompt for you to write a word. For each character you write, the functions above are called to determine if there are matches in the trie for the partial word you have written. If there are no matches, our code writes out a message to that effect. If there is one match, our code completes the word based on the match you provide. If there is more than one match, then our code continues to wait for further characters to be entered. If you press space before an auto-completion, we leave the word as you typed it and allow you to start entering the next word. If you press the tab key in the middle of a word, then we print out a list of possible completions based on the results of calling your function. However, if there are more than ten completions, we only print a message to that effect, to avoid printing an excessive amount of information to the screen.

Writing your own tests¶

We suggest that you test each of your functions thoroughly. You should write your tests in the provided test_trie.py file. We will not grade this file, but we may provide some feedback on your tests. Your tests could, for instance, print out the trie that you have generated after inserting a couple of words; or verify that the words you have inserted come back as valid words.

For your tests, you may want to use a very small list of words that you create by hand. For example, test_trie.py includes the calls to add_word necessary to build the simple five-word trie shown earlier. If you print out this trie, you should get something similar to this:

{'a': {'count': 4,
       'final': True,
       'n': {'count': 2, 'd': {'count': 1, 'final': True}, 'final': True},
       'r': {'count': 1, 'e': {'count': 1, 'final': True}}},
 'b': {'count': 1, 'e': {'count': 1, 'e': {'count': 1, 'final': True}}},
 'count': 5}

(Note that you have the discretion to only have the final key present when it is true (its absence implicitly indicates False), or you can always have this key present and set to True or False. Depending upon your approach, the output may differ by having final keys for all nodes, and this is acceptable. Note also that dictionaries may present keys and values in any order, so if your dictionary prints the sequence of keys and values in a different order, you should not be alarmed. But, make sure the sub-tries are nested in the same way.)

However, visually inspecting printouts of data structures is tedious and error-prone. You should try to write tests that call your functions, and then verify that the data structure behaves as expected. For example, if you built the five-word trie correctly, and then call get_completions("a", t), you should verify that it returns four strings: "", "n", "nd", "re". One way of writing these kind of tests in Python is by using assertions (https://wiki.python.org/moin/UsingAssertionsEffectively). In a nutshell, assertions allow you to assert a condition and, if that condition is false, then the program will exit with an error. If test_trie.py is full of assertions that depend on the trie working correctly, and you can run test_trie.py without any errors, then you passed all the assertions.

Please note that we expect you to write basic python code in test_trie.py, we do not expect you to write tests that are suitable for use with pytest.

Using the auto-complete shell to test your code¶

You can run the following command to use the provided user interface with your implementation:

python3 trie_dict.py WORDS_FILE

If your functions are implemented correctly, it should be possible to type the following sentence and, as you do, the program should auto-complete the letters written in parentheses on its own (note: you have to use the web2 file for this):

The aardv(ark) had a multifacet(ed) similarit(y) to native species of phytopl(ankton) that permeate the maritim(e) region of predominatel(y) equatoriall(y) situated nations

There are no words that start with ‘nations’

The word “nations” should register as misspelled, even though it is not, because Webster’s Dictionary does not include plurals. The word “permeate” and the word “situated” should be recognized as complete words and result in a space being inserted for you, though no letters will be completed by the program (Webster’s Dictionary does not have words starting with these prefixes that are longer.)

As a further test, if you type the letters “antic” and then tab, you should be told that there are 145 completions. If you then continue typing to yield the partial word “anticip” and press tab again, you should receive the following suggested completions (in any order):

anticipatable
anticipate
anticipation
anticipative
anticipatively
anticipatorily
anticipator
anticipatory
anticipant

Submission¶

To submit your assignment, make sure that you have:

put your name at the top of your file,
registered for the assignment using chisubmit,
added, committed, and pushed trie_dict.py and test_trie.py to the git server, and
run the chisubmit submission command.

Remember to push your code to the server early and often! Also, remember that you can submit as many times as you like before the deadline.