Analyzing Candidate Tweets¶

Due: Friday, October 29 at 4:30pm CDT.

You must work alone on this assignment.

The purpose of this assignment is to give you a chance to practice

using dictionaries to represent data and as a mechanism for mapping keys to values that change over time,
using functions to avoid repeated code,
structuring a task into logical, easy-to-test pieces, and
reading the documentation for library functions.

Introduction¶

On April 18th, 2017, Theresa May, then Prime Minister of the United Kingdom, announced that she would call a snap election. This news came as a bit of a surprise since the next election was not due to be held for several years. Her stated reason for calling the election was a desire to strengthen the UK Government’s hand in negotiations with the European Union (EU) over Britain’s exit from the EU (colloquially referred to as Brexit). While the election did not necessarily play out to Prime Minister May’s satisfaction, it did yield a trove of tweets that we can mine for insight into British politics.

Unlike US presidential elections, which seem to last forever, in the UK, general elections are quite short. The period between when a general election is officially called and the date of the election is typically six weeks or so. During this pre-election period, known as purdah, civil servants are restricted from certain activities that might be viewed as biased towards the party currently in power. Purdah ran from April 22nd until June 9th for the 2017 General Election.

For this assignment, we’ll be analyzing tweets sent from the official Twitter accounts of four parties: the Conservative and Unionist Party (@Conservatives), the Labour Party (@UKLabour), the Liberal Democrats (@LibDems) and the Scottish National Party (@theSNP) during purdah. We’ll ask questions such as:

What was @Conservatives’s favorite hashtag during purdah? [#bbcqt]
Who was mentioned at least 50 times by @UKLabour? [@jeremycorbyn]
What words occurred most often in @theSNP’s tweets? [snp, scotland, our, have, more]
What two-word phrases occurred most often in @LibDems’s tweets? [stand up, stop tories, will stand, theresa may, lib dems]

For those of you who do not follow British politics, a few notes:

Jeremy Corbyn was the leader of the Labour Party during the 2017 election.
Tim Farron was the leader of the Liberal Democrats during the 2017 election.
Theresa May was the leader of the Conservatives during the 2017 election.
Nicola Sturgeon has been the leader of the Scottish National Party since 2014.
The Conservatives are also known as the Tories.
The hashtag #bbcqt refers to BBC Question Time, a political debate program on the British Broadcasting Company.

Getting started¶

Before you start, make sure to review the Coursework Basics page. Then, using the invitation URL provided on Ed Discussion, create a repository for your PA #3 files.

Next, make sure you’ve set the GITHUB_USERNAME variable by running the following (replacing replace_me with your GitHub username):

GITHUB_USERNAME=replace_me

(remember you can double-check whether the variable is properly set by running echo $GITHUB_USERNAME).

And finally, run these commands (if you don’t recall what some of these commands do, they are explained at the end of the Git Tutorial):

cd ~/cmsc12100
mkdir pa3-$GITHUB_USERNAME
cd pa3-$GITHUB_USERNAME
git init
git remote add origin git@github.com:uchicago-cmsc12100-aut-21/pa3-$GITHUB_USERNAME.git
git remote add upstream git@github.com:uchicago-cmsc12100-aut-21/pa3.git
git pull upstream main
git branch -M main
git push -u origin main

You will find the files you need for the programming assignment directly in the root of your repository, including a README.txt file that explains what each file is. Make sure you read that file.

On this assignment, you may assume that the input passed to your functions has the correct format. You may not alter any of the input that is passed to your functions. In general, it is bad style to modify a data structure passed as input to a function, unless that is the explicit purpose of the function. If someone calls your function, they might have other uses for the data and should not be surprised by unexpected changes.

Because that data files for this assignment are large, we did not include them the GitHub distribution. Instead, please run the shell script get_files.sh to download the data. To run the script, make sure you are in the root of your repository and run this command from the Linux command-line:

$ ./get_files.sh

This script will create directories named data and tests and will download one file per party:

Conservatives.json
UKLabour.json
LibDems.json
theSNP.json

plus some files used by the test code.

Then as usual, fire up ipython3 from the Linux command-line, set up autoreload, and import your code and the json library as follows:

$ ipython3

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import basic_algorithms, analyze

In [4]: import json

The json library will be needed when you recreate our tests in ipython3.

Part 1: Basic algorithms¶

A common problem in data analysis is to count and sort tokens. A token can be any value, including a word or phrase, a symbol, a hashtag in a tweet, etc. It is useful to have efficient algorithms for counting and sorting distinct tokens. For example, we might want to find the most frequent keyword used in a document or the top 10 most popular links clicked on a website. In Part 1, you will implement two such algorithms: find_top_k and find_min_count. You will also implement an algorithm for finding the most salient (that is, the most important or notable) tokens for a document in the context of a document collection.

In this part, you will add code to basic_algorithms.py to implement the algorithms described in the following subsections.

Task 1.1: Counting distinct tokens¶

Write a helper function

def count_tokens(tokens):

that counts distinct tokens. This function takes as input a list of tokens and returns a dictionary that maps tokens to counts. In the example below, the tokens are strings, but in general the tokens can have any immutable type.

For example, if we have a list:

['A', 'B', 'C', 'A']

the function should yield (the exact order of the key-value pairs in the dictionary is not important):

{'A': 2, 'B': 1, 'C': 1}

Notes:

Do not use Python libraries other than what is already imported. (For example, you may not use collections.Counter.)
If you use Python’s list.count method, your solution will be inefficient and you will not receive credit for this task.

We provide test code for many of the tasks in this assignment, but not this one. However, you will call this function many times later in the assignment. Thus, it is even more important than usual test your implementation using ipython3 before moving on.

Task 1.2: Top k¶

Write a function

def find_top_k(tokens, k):

that takes a list of tokens and a nonnegative integer k, and returns a list of the the k tokens that occur most frequently in the list. The result should be sorted in descending order of token frequency. If some of the tokens are tied with the same number of occurrences, those tokens should be listed in alphabetical order (or more generally, sorted order).

Here is an example use of this function:

In [5]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']

In [6]: basic_algorithms.find_top_k(l, 2)
Out[6]: ['D', 'B']

In the above example, 'D' occurs most frequently, with five occurrences, so it is listed first. Then, 'B' and 'C' are tied for next most frequent, with two occurrences each. Since 'B' is alphabetically before 'C', it is 'B' that shows up next in the output list.

To do this computation, you will need to:

Count the number of times each token appears,
Convert that data into a list of (token, count) tuples
Sort the resulting list in decreasing order by count.
Extract the k tokens that occurred most frequently.

The standard Python sort function sorts pairs using the first value as the primary key and the second value as the secondary key (to break ties) and so, is not suitable for this task. Instead, you need a sort function that uses the second value as the primary key and the first value to break ties. We have provided such a function: sort_count_pairs in util.py. To learn about this function, you can look at the docstring or import util into ipython3 and then run help(util.sort_count_pairs).

You can test your code for this task by running:

$ py.test -xv -k top_k test_basic_algorithms.py

(As usual, we use $ to signal the Linux command-line prompt. You should not type the $.)

If you have not done so already, you will need to follow the instructions for downloading the data in the Getting started section before you run this test.

Task 1.3: Minimum number of occurrences¶

Write a function

def find_min_count(tokens, min_count):

that computes the set of tokens in a list that occur at least some specified minimum number of times.

Here is an example use of this function:

In [7]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']

In [8]: basic_algorithms.find_min_count(l, 2)
Out[8]: {'D', 'B', 'C'}

Recall that sets are not ordered, so the result of this call could just as easily have been {'B', 'C', 'D'}.

After writing this function you should be able to pass all tests in:

$ py.test -xv -k min_count test_basic_algorithms.py

Salient tokens¶

The most frequent tokens in a list or document may not be the most salient. For example, if the document contains words from an English-language document, the fact the words “a”, “an”, and “the” occur frequently is not at all surprising, and we wouldn’t consider those words to be especially important or relevant to that document in particular.

Term frequency–inverse document frequency (tf–idf for short) is a statistic designed to reflect how important a word is to a document, in the context of a collection or corpus of documents. The most salient words in a doument are the ones that occur frequently in that particular document, but not in the document corpus as a whole. Tf–idf is often used as a weighting factor in information retrieval and text mining.

Definitions¶

Tf–idf is defined as the product of the term frequency (tf) and the inverse document frequency (idf). We give precise formulas for both of these quantities below, but generally here is how you can think of them:

The term frequency reflects the number of times a term (token) appears in a particular document. If the term appears frequently in the document, it will have a high tf, and thus a high tf–idf. This reflects the idea that the term is more salient to that document.
The inverse document frequency reflects the number of different documents that a term (token) appears in, but with an inverse relationship. If a term appears in many different documents, then it will have a low idf, and thus a low tf–idf. This reflects the idea that, while the term may be common overall, it is not salient to any particular document.

There are several variants of both term frequency and inverse document frequency that can be used to compute tf–idf. We will be using augmented frequency as our measure of term frequency; this prevents bias towards longer documents as compared to the basic version of term frequency. We will use the basic version of inverse document frequency.

The augmented frequency of a term $t$ in a document $d$ is defined as

\[ \DeclareMathOperator{\tfidf}{\textsf{tf_idf}} \DeclareMathOperator{\tf}{\textsf{tf}} \DeclareMathOperator{\idf}{\textsf{idf}} \DeclareMathOperator{\salient}{\textsf{salient}} \tf(t, d) = 0.5 + 0.5 \cdot \left ( \frac{f_{t,d}}{\max(\{f_{t^\prime,d}: t^\prime \in d\})} \right )\]

where $f_{t,d}$ is the number of times the term $t$ appears in the document $d$. The expression $\max(\{f_{t^\prime,d}: t^\prime \in d\})$ is the maximum number of times that any term appears in document $d$.

The basic inverse document frequency of a term $t$ in a document collection $D$ is defined as

\[\idf(t, D) = \log \left ( \frac{\lvert D \rvert}{\lvert \{d \in D : t \in d\} \rvert} \right ).\]

The expression $\lvert D \rvert$ is the number of documents in the document collection $D$. The expression $\lvert \{d \in D : t \in d\} \rvert$ is the total number of documents that the term $t$ occurs in. Use the natural log (math.log) in your calculation of $\idf$.

Then, the term frequency–inverse document frequency is defined as

\[\tfidf(t, d, D) = \tf(t, d) \cdot \idf(t, D)\]

where $t$ is a term, $d$ is a document (collection of terms), $D$ is a collection of documents, and $\tf$ and $\idf$ are defined above.

Given a threshold $T$, the set of salient words for a document $d$ in a document collection $D$ is defined as the set of words in the document $d$ that have a tf–idf score above the threshold $T$; that is,

\[\salient(d, D, T) = \{ t \in d : \tfidf(t, d, D) \gt T \}.\]

Example¶

In this section, we will work through the computation of the salient tokens for each document in a sample document collection. Here is the sample collection:

Document Collection:
  [['D', 'B', 'D', 'C', 'D', 'C', 'C'],
   ['D', 'A', 'A'],
   ['D', 'B'],
   []]

Here are the augmented term frequency values for each document:

Document	Token	tf
`['D', 'B', 'D', 'C', 'D', 'C', 'C']`	B	0.6666666666666666
	C	1.0
	D	1.0
`['D', 'A', 'A']`	A	1.0
`['D', 'A', 'A']`	D	0.75
`['D', 'B']`	B	1.0
`['D', 'B']`	D	1.0
`[]`

There are no augmented term frequency values for the last document because it is empty.

Here are the inverse document frequency values for the document collection:

Token	idf
A	1.3862943611198906
B	0.6931471805599453
C	1.3862943611198906
D	0.28768207245178085

Here is the salience computation for each document using a threshold of 0.4.

Document	Token	tf–idf	The token is salient?
`['D', 'B', 'D', 'C', 'D', 'C', 'C']`	B	0.46209812037329684	True
	C	1.3862943611198906	True
	D	0.28768207245178085	False
`['D', 'A', 'A']`	A	1.3862943611198906	True
`['D', 'A', 'A']`	D	0.21576155433883565	False
`['D', 'B']`	B	0.6931471805599453	True
`['D', 'B']`	D	0.28768207245178085	False
`[]`

Notice that empty document has no salient tokens.

Here is the set of salient tokens for each document (still using a threshold of 0.4).

Document	Set of salient tokens
`['D', 'B', 'D', 'C', 'D', 'C', 'C']`	`{'C', 'B'}`
`['D', 'A', 'A']`	`{'A'}`
`['D', 'B']`	`{'B'}`
`[]`	`set()`

Finally, the result of computing the salient tokens for the documents in the document collection is:

[{'C', 'B'}, {'A'}, {'B'}, set()]

Task 1.4: Computing salient tokens¶

Write a function

def find_salient(docs, threshold):

that takes a collection of documents and a floating point threshold and finds the set of salient words for each document (as defined in the previous section). The result of the function will be a list of sets, with one set per document.

You should think carefully about how to organize the code for implementing this algorithm. Do not put all the code for this task into one function.

Some things to keep in mind to write a correct, well-designed, efficient solution:

You should compute an intermediate data structure to keep track of the inverse document frequencies for the words in the document collection.
You should compute a data structure with a count of each word in a document, or a data structure with the (augmented) term frequency for each word in a document, or possibly both.
Remember that the set of salient words for an empty document is just the empty set.
Avoid doing the same calculation twice in your code (whether by repeating code, or by calling a function twice on the same inputs). For example, the inverse document frequency of a word applies to the whole collection; don’t re-calculate it for every document.
While the mathematical definition of tf–idf involves the (mathematical) functions $\tf(t, d)$ and $\idf(t, D)$, your (Python) helper functions might not correspond to those functions exactly. An efficient implementation in Python may include helper functions with more or fewer parameters.
You can use the functions you wrote earlier in the assignment as helper functions.
Testing your intermediate functions (in ipython3) as you go will make it easier to complete this task successfully.

You can run our tests for this task by running the command:

$ py.test -xv -k salient test_basic_algorithms.py

Analyzing Tweets¶

Now that you have implemented the basic algorithms, you can start analyzing Twitter feeds. For the rest of the tasks, put your code in analyze.py.

While we provide function headers for each of the required tasks, the structure of the rest of the code is up to you. Some tasks can be done cleanly with one function, others cannot. We expect you to look for common sub-tasks to abstract into functions and we expect you to reuse previously completed functions. This process of function decomposition and thoughtful reuse of functions is one of the keys to writing clean code. You may place your auxiliary functions anywhere in the file that makes sense to you. Remember to follow the course style guide, which requires that you write proper docstrings (header comments) for every function.

Data¶

Twitter enables us to search for tweets with particular properties, say, from a particular user, containing specific terms, and within a given range of dates. There are several Python libraries that simplify the process of using this feature of Twitter. We used the TwitterSearch library to gather tweets from the Twitter feeds of @Conservatives, @UKLabour, @theSNP, and @LibDems from the purdah period, and we stored the resulting data in JSON files.

If you look in the data directory that you downloaded earlier, you will find files including:

Conservatives.json
UKLabour.json
LibDems.json
theSNP.json

Representing tweets¶

A single tweet is represented by a dictionary that contains a lot of information: creation time, hashtags used, users mentioned, text of the tweet, etc. For example, here is a tweet sent in mid-May by @UKLabour (variable tweet0 mentioned below):

RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…

and here is an abridged version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs:

{'created_at': 'Thu May 18 19:44:01 +0000 2017',
 'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                           {'indices': [132, 139], 'text': 'GE2017'}],
              'symbols': [],
              'urls': [{'display_url': 'gov.uk/register-to-vo…',
                        'expanded_url': 'http://gov.uk/register-to-vote',
                        'indices': [108, 131],
                        'url': 'https://t.co/2Lf9M2q3mP'}],
              'user_mentions': [{'id': 1597669326,
                                 'id_str': '1597669326',
                                 'indices': [3, 15],
                                 'name': 'Patchwork Foundation',
                                 'screen_name': 'UKPatchwork'},
                               {'id': 105573429,
                                'id_str': '105573429',
                                'indices': [18, 30],
                                'name': 'Iain McNicol',
                                'screen_name': 'IainMcNicol'},
                               {'id': 14291684,
                                'id_str': '14291684',
                                'indices': [35, 44],
                                'name': 'The Labour Party',
                                'screen_name': 'UKLabour'}]},
 'text': 'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…'}

Collectively, hashtags, symbols, user mentions, and URLs are referred to as entities. These entities can be accessed through the "entities" key in the tweet’s dictionary. The value associated with the key "entities" is itself a dictionary that maps keys representing entity types ("hashtags", "symbols", "urls", "user_mentions") to lists of entities of that type. In general the “entities” portion of a tweet dictionary has the structure:

'entities': {'key1': [{'subkey1': value11, 'subkey2': value21},
                      {'subkey1': value12, 'subkey2': value22},
                      ...],
             'key2': [{'subkey3': value31, 'subkey4': value41},
                      {'subkey3': value32, 'subkey4': value42},
                     ...],
             ...}

For example, here’s a subset of the entities from the sample tweet:

'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                          {'indices': [132, 139], 'text': 'GE2017'}],
             'urls': [{'display_url': 'gov.uk/register-to-vo…',
                       'expanded_url': 'http://gov.uk/register-to-vote',
                       'indices': [108, 131],
                       'url': 'https://t.co/2Lf9M2q3mP'}], ...}

Each individual entity is represented with a dictionary, whose form depends on the entity type. For example, a hashtag will include information about the text of the hashtag and the indices where it occurred in the tweet:

{'indices': [62, 74], 'text': 'GetInvolved'}

Exploring the data¶

To simplify the process of exploring the data and testing, we have provided code in load_tweets.py to load the tweets for you. Here it is:

"""
Load the tweets for the Analyze Candidate Tweets assignment.
"""

import util

Conservatives = util.get_json_from_file("data/Conservatives.json")
UKLabour = util.get_json_from_file("data/UKLabour.json")
theSNP = util.get_json_from_file("data/theSNP.json")
LibDems = util.get_json_from_file("data/LibDems.json")

# sample tweet from the "Data" section
tweet0 = UKLabour[651]

# sample tweet from the "Pre-processing step" and "Representing
# n-grams" sections.
tweet1 = UKLabour[55]

The first four variables defined by load_tweets.py—Conservatives, UKLabour, theSNP, and LibDems—define lists of tweets. We have also defined variables for a few specific tweets used in examples. tweet0 is the sample tweet shown above. tweet1 is used in the preprocessing discussion below.

Run this code in ipython3 to gain access to these variables:

In [9]: run load_tweets

(You could import the file instead, but then you would need to qualify every name with load_tweets, for example, load_tweets.theSNP.)

We encourage you to play around with a couple of the tweet dictionaries to become familiar with how to access information in the data structure before you move on to the next task.

Part 2: Finding commonly-occurring entities¶

What are @theSNP’s favorite hashtags? Which URLs are referenced at least 5 times by @LibDems? To answer these questions we must extract the desired entities (hashtags, user mentions, etc.) from the parties’ tweets and process them.

Common parameters for Part 2¶

The next two tasks will use two of the same parameters. To avoid repetition later, we describe them here:

tweets is a list of dictionaries representing tweets.
entity_desc is a tuple with three values:
1. the entity type of interest (the key),
2. the information of interest for that type of entity (the subkey), and
3. a boolean indicating whether the values are case sensitive.
For example, we use the tuple ("hashtags", "text", False) to describe hashtags, because
1. "hashtags" is the entity type (and the key for extracting the hashtags out of the entities dictionary),
2. "text" is the subkey (for extracting the text of the hashtag from the entity dictionary), and,
3. we are treating hashtags as case insensitive—when working with hashtags we will consider "#bbcqt" (for example) to be the same as "#BBCQT". (On the other hand, URLs are case sensitive, so we’d use ("urls", "url", True) as the entity description for URLs.)

(You will find the string lower method useful for handling entities that are not case sensitive.)

Task 2.1: Top k entities¶

For Task 2.1, you will write a function that finds the top k most common entities in a list of tweets using the algorithm you wrote in basic_algorithms.py. You must complete the function

def find_top_k_entities(tweets, entity_desc, k):

where the first two parameters are as described above and k is an integer. This function, which is in analyze.py, should return a list of the k most common entities. As in Task 1.2, the most common entity should come first, the next most common entity should come second, etc.

Here’s a sample call using the tweets from @theSNP with ("hashtags", "text", False) as the entity description and a value of three for k:

In [10]: analyze.find_top_k_entities(theSNP, ("hashtags", "text", False), 3)
Out[10]: ['votesnp', 'ge17', 'snpbecause']

After completing this task, you should pass all tests in:

$ py.test -xv -k top_k_entities test_analyze.py

The first few tests are simplified examples to test corner cases, while the remaining tests use data from the data folder.

If your counts are off, make sure you are handling case sensitivity properly.

Task 2.2: Minimum number of occurrences¶

For Task 2.2, you will find all entities that appear a minimum number of times using the min_count algorithm you wrote earlier. You must complete the function

def find_min_count_entities(tweets, entity_desc, min_count):

where the first two parameters are as described above and min_count is an integer that specifies the minimum number of times an entity must occur to be included in the result. This function should return the set of the entities that occurred at least min_count times.

Here’s a sample use of this function using the tweets from @LibDems:

In [11]: analyze.find_min_count_entities(LibDems, ("user_mentions", "screen_name", True), 100)
Out[11]: {'LibDemPress', 'LibDems', 'LiberalBritain', 'timfarron'}

After completing this task, you should pass all tests in:

$ py.test -xv -k min_count_entities test_analyze.py

Part 3: Analyzing n-grams¶

What additional insights might we gain by analyzing words and word sequences?

In this part, you will apply the three algorithms described earlier to contiguous sequences of n words, which are known as n-grams. Before you apply these algorithms to a candidate’s tweets, you will pre-process the tweets to remove distractions (like symbols, punctuation, and common words—see below) and then extract the n-grams. Your solution should use helper functions to avoid duplicated code.

Abridged text

To simplify the work of debugging these tasks, we constructed an abridged version of the text of each tweet by replacing the emojis and symbols with spaces. You can find the abridged text in a tweet’s dictionary under the key: abridged_text.

You must use the abridged text for this part of the assignment.

To-do: Pre-processing step¶

You will need to pre-process the text of the tweets: you will convert the (abridged) text into a list of strings. While it is up to you how to structure your code, your code must have the same effect as following the steps below in order. You can find all named constants in the file analyze.py and can refer to them by their names in your code.

Turn the abridged text of a tweet into a list of words. We define a word to be any sequence of characters delimited by whitespace. For example, “abc”, “10”, and “#2017election.” are all words by this definition.
Remove any leading and trailing punctuation from each word. We have defined a constant, PUNCTUATION, that specifies which characters constitute punctuation for this assignment. Note that apostrophes, as in the word “it’s”, should not be removed, because they occur in the middle of the word.
Eliminate a word if stripping the punctuation from it yields the empty string ("").
For tasks that are not case sensitive, convert the word to lower case. In some tasks, we will want to treat the words as case sensitive, in others we will not.
For the tasks that require it, eliminate all stop words. Stop words are commonly-occurring words, such as “and”, “of”, and “to” that convey little insight into what makes one tweet different from another. We defined the set of stop words we will use for this assignment in the defined constant STOP_WORDS.
Remove URLs, hashtags, and mentions. We defined a constant STOP_PREFIXES (which includes "#", "@", and "http"). You should complete this task by eliminating all words that begin with any of the strings in STOP_PREFIXES.

Notes and Suggestions

While we described the pre-processing step in terms of which words to eliminate, it is actually much easier to write code that constructs a list of the words that should be included, rather than repeatedly removing words that should be excluded. In particular, you cannot remove elements from a list inside a for loop that is iterating through that list (Python may not stop you from trying, but it will produce unexpected behavior).
We suggest you write a function that performs the above steps. Your preprocessing function needs to handle the fact that in some cases we care about the original case of the words and in others we don’t. Similarly, for some tasks, we will want to remove stop words and for others we will not. You are welcome to implement additional helper functions, but it would be bad practice to write two essentially identical functions, where one converts the words to lower case and the other does not or where one removes stop words and the other does not (and remember that you are being graded on having good coding practice).
Important: You will find the lower, split, strip, and startswith methods from the string API useful for this step. You will save yourself a lot of time and unnecessary work if you read about these methods in detail and experiment with them in ipython3 before you start writing code.

Example

Here is the abridged text from tweet1 (as defined in load_tweets.py):

Things you need to vote
Polling card?
ID?
A dog?
Just yourself
You've got until 10pm – #VoteLabour now… https://t.co/sCDJY1Pxc9

Pre-processing it for a task that is not case sensitive and for which stop words should be removed would yield:

['things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
 'just', 'yourself', "you've", 'got', 'until', '10pm', 'now']

Pre-processing it for a task that is case sensitive and for which stop should not be removed would yield:

['Things', 'you', 'need', 'to', 'vote', 'Polling', 'card',
 'ID', 'A', 'dog', 'Just', 'yourself', "You've", 'got',
 'until', '10pm', 'now']

We strongly recommend testing your pre-processing code by hand in ipython3 before you move on to the next part of the assignment. Not doing so is likely to cost you a lot of debugging time later.

To-do: Representing n-grams¶

Your code should compute the n-grams of a tweet after pre-processing the tweet’s abridged text. These n-grams should be represented as tuples of strings. Consider the following list (which came from pre-processing):

['things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
 'just', 'yourself', "you've", 'got', 'until', '10pm', 'now']

Taking $n = 2$ would yield the following bi-grams (2-grams):

[('things', 'need'),
 ('need', 'vote'),
 ('vote', 'polling'),
 ('polling', 'card'),
 ('card', 'id'),
 ('id', 'dog'),
 ('dog', 'just'),
 ('just', 'yourself'),
 ('yourself', "you've"),
 ("you've", 'got'),
 ('got', 'until'),
 ('until', '10pm'),
 ('10pm', 'now')]

Notes

The n-gram does not “circle back” to the beginning. For example, the last word of the tweet and the first word of the tweet do not comprise an bi-gram (so ('now', 'things') is not included).
You should not combine words from different tweets in a single n-gram.

Common parameters for Part 3¶

The rest of the tasks have three parameters in common:

tweets is a list of dictionaries representing tweets (for example, theSNP).
n is the number of words in an n-gram.
case_sensitive is a boolean that is True if the task is case sensitive.

Task 3.1: Top k n-grams¶

You will apply your previously written find_top_k function to find the most commonly occurring n-grams. Your task is to implement the function:

def find_top_k_ngrams(tweets, n, case_sensitive, k):

where the first three parameters are as described above and k is an integer. This function should return a list of the k most common n-grams. As in Task 1.2, the most common n-gram should come first, followed by the second most common, etc. Stop words should be removed for this task.

Here’s a sample use of this function using the tweets from @theSNP with n=2, case_sensitive=False and k=3:

In [12]: analyze.find_top_k_ngrams(theSNP, 2, False, 3)
Out[12]: [('nicola', 'sturgeon'), ('read', 'more'), ('stand', 'up')]

After completing this task, you should pass all tests in:

$ py.test -xv -k top_k_ngrams test_analyze.py

Task 3.2: Minimum number of n-gram occurrences¶

Similarly, you will apply your find_min_count function to find the n-grams that appear at least a minimum number of times. Your task is to implement the function:

def find_min_count_ngrams(tweets, n, case_sensitive, min_count):

where the first three parameters are as described above and min_count specifies the minimum number of times an n-gram must occur to be included in the result. This function should return a set of the n-grams that occur at least a minimum number of times. Stop words should be removed for this task.

Here’s a sample use of this function using the tweets from @LibDems with n=2, and min_count=100:

In [13]: analyze.find_min_count_ngrams(LibDems, 2, True, 100)
Out[13]:
{('Lib', 'Dems'),
 ('RT', 'Only'),
 ('Theresa', 'May'),
 ('can', 'stop'),
 ('stand', 'up'),
 ('stop', 'Tories'),
 ('will', 'stand')}

After completing this task, you should pass all tests in:

$ py.test -xv -k min_count_ngrams test_analyze.py

Task 3.3: Salient n-grams¶

Finally, you will use your find_salient function from Part 1 to find the salient n-grams in each tweet in a collection. Your task is to implement the function:

def find_salient_ngrams(tweets, n, case_sensitive, threshold):

where the first three parameters are as described above and threshold is the cutoff tf–idf score above which an n-gram is considered salient. This function should return a list of sets of salient n-grams, one set per tweet. Stop words should not be removed for this task.

Here’s a sample use of this function:

In [14]: tweets = [ {"abridged_text": "the cat in the hat" },
    ...:            {"abridged_text": "don't let the cat on the hat" },
    ...:            {"abridged_text": "the cat's hat" },
    ...:            {"abridged_text": "the hat cat" }]
    ...:
In [15]: analyze.find_salient_ngrams(tweets, 2, False, 1.3)
Out[15]:
[{('cat', 'in'), ('in', 'the')},
 {('cat', 'on'), ("don't", 'let'), ('let', 'the'), ('on', 'the')},
 {("cat's", 'hat'), ('the', "cat's")},
 {('hat', 'cat')}]

Notice that when constructing the n-grams for this task, we did not remove the stop words (for example, “the” is in STOP_WORDS, but appears in the result).

After completing this task, you should pass all tests in:

$ py.test -xv -k salient_ngrams test_analyze.py

Grading¶

The assignment will be graded according to the Rubric for Programming Assignments. Make sure to read that page carefully; the remainder of this section explains aspects of the grading that are specific to this assignment.

In particular, your completeness score will be determined solely on the basis of the automated tests, which provide a measure of how many of the tasks you have completed:

Grade	Percent tests passed
Exemplary	at least 95%
Satisfactory	at least 80%
Needs Improvement	at least 50%
Ungradable	less than 50%

For example, if your implementation passes 85% of the tests, you will earn an S (satisfactory) score for completeness.

The code quality score will be based on the general criteria in the Rubric for Programming Assignments but, in this programming assignment, we will also be paying special attention to a few issues in particular.

The most common class of issues can be summarized as “iterating through a data structure too many times.” It includes the following:

Iterating through a dictionary to find a key instead of accessing the key directly. Don’t write

for key, val in d.items():
    if key == "Golden Snitch":
        print(val)

Instead write

print(d["Golden Snitch"])

Computing the same value multiple times within the same task. Are you computing all the frequencies for a document, and then re-computing them from scratch to find the maximum frequency? You could instead store the frequencies, and then refer to those stored values when finding the maximum frequency. More generally, is there anywhere that you call a function multiple times with the exact same inputs within a task? Instead, you could call that function just once and store the output.
The use of unnecessary loops: Are you looping through a list of tokens multiple times when a single pass through it is enough? Careful with using the count method! That uses a for loop internally, so each call to that method on a list counts as a full pass through the list.
Constructing unnecessary lists: Are you constructing a list that is not actually needed to perform a certain computation? For example, in your pre-processing code, are you making a list of words with punctuation removed, then another list with punctuation removed and with the empty string removed, then another list with everything converted to lowercase? Try to instead take care of all of these steps in a single pass of the list, to construct a single, final output list.

We will also be paying special attention to the following:

Not using decomposition: Is your code for pre-processing and your code for computing n-grams within your definition of the find_top_k_ngrams function? You need to decompose the problem into small tasks and implement those tasks in helper functions.
Repeating code instead of using helper functions: You should never copy-paste the code from one function to another if you need to repeat a task. Instead, put code into a helper function that you can call multiple times.
Writing redundant helper functions: Do you have two or more functions with very similar code? Do you have two or more functions that compute the same result (even if the code looks different)? Think about how you can use the existing functions in your code before writing new ones.
Using a while loop when a for loop would be more appropriate: When iterating over a list or a defined range of values, you should always use a for loop.

While these are the main things we care about in this assignment, please remember that it is not possible for us to give you an exhaustive list of every single thing that could affect your code quality score (and that thinking in those terms is generally counterproductive to learning how to program; see our How to Learn in this Class page for more details).

In general, avoiding all of the above issues will increase the likelihood of getting an E; if your code has a few (but not most) of the above issues, that will likely result in an S; if your code suffers from most of those issues, it will likely get an N. That said, to reiterate, we could also find other issues in your code that we did not anticipate, but that end up affecting your code quality score. When that happens, we encourage you to see these as opportunities to improve your code in future assignments (or as specific things to change if you decide to resubmit the assignment).

And don’t forget that style also matters! You could avoid all of the above issues, and still not get an E because of style issues in your code. Make sure to review the general guidelines in the Rubric for Programming Assignments, as well as our Style Guide.

Cleaning up¶

Before you submit your final solution, you should, remove

any print statements that you added for debugging purposes and
all in-line comments of the form: “YOUR CODE HERE” and “REPLACE …”

Also, check your code against the style guide. Did you use good variable and function names? Do you have any lines that are too long? Did you remember to include a header comment for all of your functions?

Do not remove header comments, that is, the triple-quote strings that describe the purpose, inputs, and return values of each function.

As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.

Submission¶

You will be submitting your work through Gradescope (linked from our Canvas site). The process will be the same as with previous coursework: Gradescope will fetch your files directly from your PA3 repository on GitHub, so it is important that you remember to commit and push your work! You should also get into the habit of making partial submissions as you make progress on the assignment; remember that you’re allowed to make as many submissions as you want before the deadline.

To submit your work, go to the “Gradescope” section on our Canvas site. Then, click on “Programming Assignment #3”. If you completed previous assignments, Gradescope should already be connected to your GitHub account. If it isn’t, you will see a “Connect to GitHub” button. Pressing that button will take you to a GitHub page asking you to confirm that you want to authorize Gradescope to access your GitHub repositories. Just click on the green “Authorize gradescope” button.

Then, under “Repository”, make sure to select your uchicago-cmsc12100-aut-21/pa3-$GITHUB_USERNAME.git repository. Under “Branch”, just select “main”.

Finally, click on “Upload”. An autograder will run, and will report back a score. Please note that this autograder runs the exact same tests (and the exact same grading script) described in Testing Your Code. If there is a discrepancy between the tests when you run them on your computer, and when you submit your code to Gradescope, please let us know.