Analyzing Candidate Tweets

Due: Friday, October 30 at 9pm CDT.

You must work alone on this assignment.

The purpose of this assignment is to give you a chance to practice

  • using dictionaries to represent data and as a mechanism for mapping keys to values that change over time,

  • using functions to avoid repeated code,

  • structuring a task into logical, easy-to-test pieces, and

  • reading the documentation for library functions.

Introduction

On April 18th, 2017, Theresa May, then Prime Minister of the United Kingdom, announced that she would call a snap election. This news came as a bit of a surprise since the next election was not due to be held for several years. Her stated reason for calling the election was a desire to strengthen the UK Government’s hand in negotiations with the European Union (EU) over Britain’s exit from the EU (colloquially referred to as Brexit). While the election did not necessarily play out to Prime Minister May’s satisfaction, it did yield a trove of tweets that we can mine for insight into British politics.

Unlike US presidential elections, which seem to last forever, in the UK, general elections are quite short. The period between when a general election is officially called and the date of the election is typically six weeks or so. During this pre-election period, known as purdah, civil servants are restricted from certain activities that might be viewed as biased towards the party currently in power. Purdah ran from April 22nd until June 9th for the 2017 General Election.

For this assignment, we’ll be analyzing tweets sent from the official Twitter accounts of four parties: the Conservative and Unionist Party (@Conservatives), the Labour Party (@UKLabour), the Liberal Democrats (@LibDems) and the Scottish National Party (@theSNP) during purdah. We’ll ask questions such as:

  • What was @Conservatives’s favorite hashtag during purdah? [#bbcqt]

  • Who was mentioned at least 50 times by @UKLabour? [@jeremycorbyn]

  • What words occurred most often in @theSNP’s tweets? [snp, scotland, our, have, more]

  • What two-word phrases occurred most often in @LibDems’s tweets? [stand up, stop tories, will stand, theresa may, lib dems]

For those of you who do not follow British politics, a few notes:

  1. Jeremy Corbyn was the leader of the Labour Party during the 2017 election.

  2. Tim Farron was the leader of the Liberal Democrats during the 2017 election.

  3. Theresa May was the leader of the Conservatives during the 2017 election.

  4. Nicola Sturgeon has been the leader of the Scottish National Party since 2014.

  5. The Conservatives are also known as the Tories.

  6. The hashtag #bbcqt refers to BBC Question Time, a political debate program on the British Broadcasting Company.

Getting started

Before you start working on the assignment’s tasks, please take a moment to follow the steps described in Coursework Basics page to get the files for this assignment (these steps will be the same as the ones you followed to get the files previous assignments. Please note that you will not be able to start working on the assignment until you fetch the assignment files (which will appear in a pa3 directory in your repository)

On this assignment, you may assume that the input passed to your functions has the correct format. You may not alter any of the input that is passed to your functions. In general, it is bad style to modify a data structure passed as input to a function, unless that is the explicit purpose of the function. If someone calls your function, they might have other uses for the data and should not be surprised by unexpected changes.

Because that data files for this assignment are large, we did not include them the GitLab distribution. Instead, please run the shell script get_files.sh to download the data. To run the script, change to your pa3 directory and run this command from the Linux command-line:

$ ./get_files.sh

This script will create directories named data and tests and will download one file per party:

  • Conservatives.json

  • UKLabour.json

  • LibDems.json

  • theSNP.json

plus some files used by the test code.

Then as usual, fire up ipython3 from the Linux command-line, set up autoreload, and import your code and the json library as follows:

$ ipython3

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import basic_algorithms, analyze

In [4]: import json

The json library will be needed when you recreate our tests in ipython3.

Part 1: Basic algorithms

Algorithms for efficiently counting and sorting distinct tokens, or unique values, are widely used in data analysis. For example, we might want to find the most frequent keyword used in a document or the top 10 most popular links clicked on a website. In Part 1, you will implement two such algorithms: find_top_k and find_min_count. You will also implement an algorithm for finding the most salient (that is, the most important or notable) tokens for a document in the context of a document collection.

In this part, you will add code to basic_algorithms.py to implement the algorithms described in the following subsections.

Task 1.1: Counting distinct tokens

The first step is to write a helper function, count_tokens, that counts distinct tokens. This function takes as input a list of tokens (in our example strings, but they can be any immutable type), and returns a dictionary that maps tokens to counts.

For example, if we have a list:

['A', 'B', 'C', 'A']

the function should yield (the exact order of the key-value pairs in the dictionary is not important):

{'A': 2, 'B': 1, 'C': 1}

Notes:

  1. Do not use Python libraries other than what is already imported. (For example, you may not use collections.Counter.)

  2. If you use Python’s list.count method, your solution will be inefficient and you will not receive credit for this task.

We provide test code for many of the tasks in this assignment, but not this one. We encourage you to test your implementation using ipython3.

Task 1.2: Top K

You will write the algorithm find_top_k, which takes a list of tokens and a nonnegative integer K, and returns a list of the the K tokens that occur most frequently in the list. The result should be sorted in descending order of token frequency.

Here is an example use of this function:

In [4]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']

In [5]: basic_algorithms.find_top_k(l, 2)
Out[5]: ['D','B']

To do this computation, you will need to:

  1. Count the number of times each token appears,

  2. Convert that data into a list of (token, count) tuples

  3. Sort the resulting list in decreasing order by count.

  4. Extract the K tokens that occurred most frequently.

The standard Python sort function sorts pairs using the first value as the primary key and the second value as the secondary key (to break ties) and so, is not suitable for this task. Instead, you need a sort function that uses the second value as the primary key and the first value to break ties. We have provided such a function: sort_count_pairs in util.py. To learn about this function, you can look at the docstring or import util into ipython3 and then run help(util.sort_count_pairs).

You can test your code for this task by running:

$ py.test -xv -k top_k test_basic_algorithms.py

(As usual, we use $ to signal the Linux command-line prompt. You should not type the $.)

If you have not done so already, please follow the instructions for downloading the data in the Getting started section before you run this test.

Task 1.3: Minimum number of occurrences

You will write the algorithm find_min_count, which computes a set of the tokens in a list that occur at least some specified minimum number of times.

Here is an example use of this function:

In [6]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']

In [7]: basic_algorithms.find_min_count(l, 2)
Out[7]: {'D', 'B', 'C'}

Recall that sets are not ordered, so the result of this call could just as easily have been {'B', 'C', 'D'}.

After writing this function you should be able to pass all tests in:

$ py.test -xv -k min_count test_basic_algorithms.py

Salient tokens

The most frequent tokens in the list may not be the most salient. For example, if the list contains words from an English-language document, the fact the words “a”, “an”, and “the” occur frequently is not at all surprising.

According to Wikipedia, term frequency–inverse document frequency (aka tf–idf) is a statistic designed to reflect how important a word is to a document in a collection or corpus and is often used as a weighting factor in information retrieval and text mining. A word or term is considered salient to a particular document if it occurs frequently in that document, but not in the document corpus over all.

Definitions

Term frequency–inverse document frequency is defined as:

\[ \DeclareMathOperator{\tfidf}{\textsf{tf_idf}} \DeclareMathOperator{\tf}{\textsf{tf}} \DeclareMathOperator{\idf}{\textsf{idf}} \DeclareMathOperator{\salient}{\textsf{salient}} \tfidf(t, d, D) = \tf(t, d) \cdot \idf(t, D)\]

where \(t\) is a term, \(d\) is a document (collection of terms), \(D\) is a collection of documents, and \(\tf\) and \(\idf\) are defined below.

There are several variants of both term frequency (\(\tf\)) and inverse document frequency (\(\idf\)) that can be used to compute \(\tfidf\). We will be using augmented frequency as our measure of term frequency to prevent bias towards longer documents, and we will use basic inverse document frequency.

The augmented frequency of a term \(t\) in a document \(d\) is defined as

\[\tf(t, d) = 0.5 + 0.5 \cdot \left ( \frac{f_{t,d}}{\max(\{f_{t^\prime,d}: t^\prime \in d\})} \right )\]

where \(f_{t,d}\) is the number of times the term \(t\) appears in the document \(d\).

The basic inverse document frequency of a term \(t\) in a document collection \(D\) is defined as

\[\idf(t, D) = \log \left ( \frac{N}{\lvert \{d \in D : t \in d\} \rvert} \right )\]

where \(N\) is the number of documents in the document collection \(D\) and where \(\lvert \{d \in D : t \in d\} \rvert\) is the number of documents that term \(t\) occurs in. Use the natural log (math.log) in your calculation of \(\idf\).

Given a threshold \(T\), the set of salient words for a document \(d\) in a document collection \(D\) is defined as:

\[\salient(d, D, T) = \{ t : t \in d\, \rm{and} \, \tfidf(t, d, D) \gt T \}.\]

Example

In this section, we will work through the computation of the salient tokens for each document in a sample document collection. Here is the sample collection:

  [['D', 'B', 'D', 'C', 'D', 'C', 'C'],
   ['D', 'A', 'A'],
   ['D', 'B'],
   []]

Here are the augmented term frequency values for each document:

  Document: ['D', 'B', 'D', 'C', 'D', 'C', 'C']
  Token: TF
    B:  0.6666666666666666
    C:  1.0
    D:  1.0

  Document: ['D', 'A', 'A']
  Token: TF
    A:  1.0
    D:  0.75

  Document: ['D', 'B']
  Token: TF
    B:  1.0
    D:  1.0

  Document: []

There are no augmented term frequency values for the last document because it is empty.

Here are the inverse document frequency values for the document collection:

  Token: IDF:
    A: 1.3862943611198906
    B: 0.6931471805599453
    C: 1.3862943611198906
    D: 0.28768207245178085

Here is the salience computation for each document using a threshold of 0.4.

  Document: ['D', 'B', 'D', 'C', 'D', 'C', 'C']
  Token: The token is salient (TF_IDF)
    B: True (0.46209812037329684)
    C: True (1.3862943611198906)
    D: False (0.28768207245178085)
  Salient tokens: {'B', 'C'}

  Document: ['D', 'A', 'A']
  Token: The token is salient (TF_IDF)
    A: True (1.3862943611198906)
    D: False (0.21576155433883565)
  Salient tokens: {'A'}

  Document: ['D', 'B']
  Token: The token is salient (TF_IDF)
    B: True (0.6931471805599453)
    D: False (0.28768207245178085)
  Salient tokens: {'B'}

  Document: []
  Salient tokens: set()

Notice that empty document has no salient tokens.

Finally, the result of computing the salient tokens for the documents in the document collection is:

    [{'B', 'C'}, {'A'}, {'B'}, set()]

Task 1.4: Computing salient tokens

In this task, you will write the algorithm find_salient, which takes a collection of documents (docs) and a floating point threshold (threshold) and finds the set of salient words for each document (as defined in the previous section). The result of the function will be a list of sets, with one set per document.

You should think carefully about how to organize the code for implementing this algorithm. Do not put all the code for this task into one function.

Some things to keep in mind:

  • You will need an intermediate data structure to keep track of the inverse document frequencies for the words in the document collection.

  • You may find it helpful to compute a data structure with the term frequencies for the words in a document, but it is not strictly necessary.

  • Remember that the set of salient words for an empty document is just the empty set.

  • Testing your intermediate functions (in ipython3) as you go will make it easier to complete this task successfully.

You can run our tests for this task by running the command:

$ py.test -xv -k salient test_basic_algorithms.py

Analyzing Tweets

Now that you have implemented the basic algorithms, you can start analyzing Twitter feeds. For the rest of the tasks, put your code in analyze.py.

While we provide function headers for each of the required tasks, the structure of the rest of the code is up to you. Some tasks can be done cleanly with one function, others cannot. We expect you to look for common sub-tasks to abstract into functions and we expect you to reuse previously completed functions. This process of function decomposition and thoughtful reuse of functions is one of the keys to writing clean code. You may place your auxiliary functions anywhere in the file that makes sense to you. Remember to follow the course style guide, which requires that you write proper docstrings (header comments) for every function.

Data

Twitter enables us to search for tweets with particular properties, say, from a particular user, containing specific terms, and within a given range of dates. There are several Python libraries that simplify the process of using this feature of Twitter. We used the TwitterSearch library to gather tweets from the Twitter feeds of @Conservatives, @UKLabour, @theSNP, and @LibDems from the purdah period, and we stored the resulting data in JSON files.

If you look in the data directory that you downloaded earlier, you will find files including:

  • Conservatives.json

  • UKLabour.json

  • LibDems.json

  • theSNP.json

Representing tweets

A single tweet is represented by a dictionary that contains a lot of information: creation time, hashtags used, users mentioned, text of the tweet, etc. For example, here is a tweet sent in mid-May by @UKLabour (variable tweet0 mentioned below):

RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…

and here is an abridged version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs:

{'created_at': 'Thu May 18 19:44:01 +0000 2017',
 'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                           {'indices': [132, 139], 'text': 'GE2017'}],
              'symbols': [],
              'urls': [{'display_url': 'gov.uk/register-to-vo…',
                        'expanded_url': 'http://gov.uk/register-to-vote',
                        'indices': [108, 131],
                        'url': 'https://t.co/2Lf9M2q3mP'}],
              'user_mentions': [{'id': 1597669326,
                                 'id_str': '1597669326',
                                 'indices': [3, 15],
                                 'name': 'Patchwork Foundation',
                                 'screen_name': 'UKPatchwork'},
                               {'id': 105573429,
                                'id_str': '105573429',
                                'indices': [18, 30],
                                'name': 'Iain McNicol',
                                'screen_name': 'IainMcNicol'},
                               {'id': 14291684,
                                'id_str': '14291684',
                                'indices': [35, 44],
                                'name': 'The Labour Party',
                                'screen_name': 'UKLabour'}]},
 'text': 'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…'}

Collectively, hashtags, symbols, user mentions, and URLs are referred to as entities. These entities can be accessed through the "entities" key in the tweet’s dictionary. The value associated with the key "entities" is itself a dictionary that maps keys representing entity types ("hashtags", "symbols", "urls", "user_mentions") to lists of entities of that type. In general the entities data structure has the form:

'entities': {'key1': [{'subkey1': value11, 'subkey2': value21},
                      {'subkey1': value12, 'subkey2': value22},
                      ...],
             'key2': [{'subkey3': value31, 'subkey4': value41},
                      {'subkey3': value32, 'subkey4': value42},
                     ...],
             ...}

For example, here’s a subset of the entities from the sample tweet:

'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                          {'indices': [132, 139], 'text': 'GE2017'}],
             'urls': [{'display_url': 'gov.uk/register-to-vo…',
                       'expanded_url': 'http://gov.uk/register-to-vote',
                       'indices': [108, 131],
                       'url': 'https://t.co/2Lf9M2q3mP'}], ...}

Each individual entity is represented with a dictionary, whose form depends on the entity type. For example, a hashtag will include information about the text of the hashtag and the indices where it occurred in the tweet:

{'indices': [62, 74], 'text': 'GetInvolved'}

Exploring the data

To simplify the process of exploring the data and testing, we have provided code in load_tweets.py to load the tweets for you. Here it is:

"""
Load the tweets for the Analyze Candidate Tweets assignment.
"""

import util

Conservatives = util.get_json_from_file("data/Conservatives.json")
UKLabour = util.get_json_from_file("data/UKLabour.json")
theSNP = util.get_json_from_file("data/theSNP.json")
LibDems = util.get_json_from_file("data/LibDems.json")

# sample tweet from the "Data" section
tweet0 = UKLabour[651]

# sample tweet from the "Pre-processing step" and "Representing
# N-grams" sections.
tweet1 = UKLabour[55]

The first four variables defined by load_tweets.pyConservatives, UKLabour, theSNP, and LibDems—define lists of tweets. We have also defined variables for a few specific tweets used in examples. tweet0 is the sample tweet shown above. tweet1 is used in the preprocessing discussion below.

Run this code in ipython3 to gain access to these variables:

In [8]: run load_tweets

(You could import the file instead, but then you would need to qualify every name with load_tweets, for example, load_tweets.theSNP.)

We encourage you to play around with a couple of the tweet dictionaries to become familiar with how to access information in the data structure before you move on to the next task.

Part 2: Finding commonly-occurring entities

What are @theSNP’s favorite hashtags? Which URLs are referenced at least 5 times by @LibDems? To answer these questions we must extract the desired entities (hashtags, user mentions, etc.) from the parties’ tweets and process them.

Common parameters for Part 2

The next two tasks will use two of the same parameters. To avoid repetition later, we describe them here:

  • tweets is a list of dictionaries representing tweets.

  • entity_desc is a tuple with three values: the first is the type of entity of interest (the key), the second is the information of interest for that type of entity (the subkey), and the third is a boolean indicating whether the values are case sensitive. For example, we use the tuple ("hashtags", "text", False) to describe hashtags, because "hashtags" is the key for extracting the hashtags out of the entities dictionary, "text" is the subkey for extracting the text of the hashtag from the entity dictionary, and, for our purposes, we will usually treat hashtags as case insensitive. For example, when working with hashtags we will treat "#bbcqt" to be the same as "#BBCQT". URLs on the other hand are case sensitive, so we’d use ("urls", "url", True) as the entity description for URLs.

(You will find the string lower method useful for handling entities that are not case sensitive.)

Task 2.1: Top K entities

For Task 2.1, you will write a function that finds the top k most common entities in a list of tweets using the algorithm you wrote in basic_algorithms.py. You must complete the following function:

def find_top_k_entities(tweets, entity_desc, k):

The first two parameters are as described above and k is an integer. This function, which is in analyze.py, should return a list of the k most common entities. As in Task 1.2, the most common entity should come first, the next most common entity should come second, etc.

Here’s a sample call using the tweets from @theSNP with ("hashtags", "text", False) as the entity description and a value of three for k:

In [13]: analyze.find_top_k_entities(theSNP, ("hashtags", "text", False), 3)
Out[13]: ['votesnp', 'ge17', 'snpbecause']

After completing this task, you should pass all tests in:

$ py.test -xv -k top_k_entities test_analyze.py

The first few tests are simplified examples to test corner cases, while the remaining tests use data from the data folder.

If your counts are off, make sure you are handling case sensitivity properly.

Task 2.2: Minimum number of occurrences

For Task 2.2, you will find all entities that appear a minimum number of times using the min_count algorithm you wrote earlier. You must complete the function:

def find_min_count_entities(tweets, entity_desc, min_count):

where the first two parameters are as described above and min_count is an integer that specifies the minimum number of times an entity must occur to be included in the result. This function should return set of the entities occurred at least min_count times.

Here’s a sample use of this function using the tweets from @LibDems:

In [14]: analyze.find_min_count_entities(LibDems, ("user_mentions", "screen_name", True), 100)
Out[14]: {'LibDemPress', 'LibDems', 'LiberalBritain', 'timfarron'}

After completing this task, you should pass all tests in:

$ py.test -xv -k min_count_entities test_analyze.py

Part 3: Analyzing N-grams

What additional insights might we gain by analyzing words and word sequences?

In this part, you will apply the three algorithms described earlier to contiguous sequences of \(N\) words, which are known as n-grams. Before you apply these algorithms to a candidate’s tweets, you will pre-process the tweets to reveal useful words and then extract the n-grams. Your solution should use helper functions to avoid duplicated code.

Abridged text

To simplify the work of debugging these tasks, we constructed an abridged version of the text of each tweet by replacing the emoji and symbols with spaces. You can find the abridged text in a tweet’s dictionary under the key: abridged_text.

You must use the abridged text for this part of the assignment.

To-do: Pre-processing step

You will need to pre-process the text of the tweets: you will convert the (abridged) text into a list of strings. You must follow the steps below in order so that your solution passes our tests. You can find all named constants in the file analyze.py and can refer to them by their names in your code.

  1. We define a word to be any sequence of characters delimited by whitespace. For example, “abc”, “10”, and “#2017election.” are all words by this definition. Turn the abridged text of a tweet into a list of words.

  2. We have defined a constant, PUNCTUATION, that specifies which characters constitute punctuation for this assignment. Remove any leading and trailing punctuation from each word. Note that apostrophes, as in the word “it’s”, should not be removed, because they occur in the middle of the word. You should eliminate a word, if stripping the punctuation from it yields the empty string ("").

  3. In some tasks, we will want to treat the words as case sensitive, in others we will not. For tasks that are not case sensitive, convert the word to lower case.

  4. Stop words are commonly-occurring words, such as “and”, “of”, and “to” that convey little insight into what makes one tweet different from another. We defined a set of stop words in the defined constant STOP_WORDS. For the tasks that require it, eliminate any word from the list of words that is included in STOP_WORDS.

  5. Finally, we want to remove URLs, hashtags, and mentions. We defined a constant STOP_PREFIXES (which includes "#", "@", and "http") to help with this task. Eliminate all words that begin with any of the strings in STOP_PREFIXES.

Notes and Suggestions

  • While we described the pre-processing step in terms of which words to eliminate, it is actually much easier to write code that constructs a list of the words that should be included, rather than repeatedly removing words that should be excluded.

  • We suggest you write a function that performs the above steps. Your preprocessing function needs to handle the fact that in some cases we care about the original case of the words and in others we don’t. Similarly, for some tasks, we will want to remove stop words and for others we will not. You are welcome to implement additional helper functions, but it would be bad practice to write two essentially identical functions, where one converts the words to lower case and the other does not or where one removes stop words and the other does not; this will be reflected in the grading.

  • Important: You will find the lower, split, strip, and startswith methods from the string API useful for this step. You will save yourself a lot of time and unnecessary work if you read about these methods in detail and experiment with them in ipython3 before you start writing code.

Example

Here is the abridged text from tweet1 (as defined in load_tweets.py):

Things you need to vote
Polling card?
ID?
A dog?
Just yourself
You've got until 10pm – #VoteLabour now… https://t.co/sCDJY1Pxc9

Pre-processing it for a task that is not case sensitive and for which stop words should be removed would yield:

['things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
 'just', 'yourself', "you've", 'got', 'until', '10pm', 'now']

Pre-processing it for a task that is case sensitive and for which stop should not be removed would yield:

['Things', 'you', 'need', 'to', 'vote', 'Polling', 'card',
 'ID', 'A', 'dog', 'Just', 'yourself', "You've", 'got',
 'until', '10pm', 'now']

We strongly recommend testing your cleaning code by hand in ipython3 before you move on to the next part of the assignment.

To-do: Representing N-grams

Your code should compute the n-grams of a tweet after pre-processing the tweet’s abridged text. These n-grams should be represented as tuples of strings. Consider the following list (which came from pre-processing):

['things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
 'just', 'yourself', "you've", 'got', 'until', '10pm', 'now']

Taking \(N = 2\) would yield the following bi-grams (2-grams):

[('things', 'need'),
 ('need', 'vote'),
 ('vote', 'polling'),
 ('polling', 'card'),
 ('card', 'id'),
 ('id', 'dog'),
 ('dog', 'just'),
 ('just', 'yourself'),
 ('yourself', "you've"),
 ("you've", 'got'),
 ('got', 'until'),
 ('until', '10pm'),
 ('10pm', 'now')]

Notes

  • The n-gram does not “circle back” to the beginning. That is, the last word of the tweet and the first word of the tweet do not comprise an n-gram (so ('now', 'things') is not included).

  • You should not combine words from different tweets in a single n-gram.

Common parameters for Part 3

The rest of the tasks have three parameters in common:

  • tweets is a list of dictionaries representing tweets (for example, theSNP)

  • n is the number of words in an n-gram.

  • case_sensitive is a boolean that is True if the task is case sensitive.

Task 3.1: Top K n-grams

You will apply your previously written find_top_k function to find the most commonly occurring n-grams. Your task is to implement the function:

def find_top_k_ngrams(tweets, n, case_sensitive, k):

where the first three parameters are as described above and k is an integer. This function should return a list of the k most common n-grams. As in Task 1.2, the most common n-gram should come first, followed by the second most common, etc.

Here’s a sample use of this function using the tweets from @theSNP with n=2, case_sensitive=False and k=3:

In [16]: analyze.find_top_k_ngrams(theSNP, 2, False, 3)
Out[16]: [('nicola', 'sturgeon'), ('read', 'more'), ('stand', 'up')]

After completing this task, you should pass all tests in:

$ py.test -xv -k top_k_ngrams test_analyze.py

Task 3.2: Minimum number of n-gram occurrences

Similarly, you will apply your find_min_count function to find the n-grams that appear at least a minimum number of times. Your task is to implement the function:

def find_min_count_ngrams(tweets, n, case_sensitive, min_count):

where the first three parameters are as described above and min_count specifies the minimum number of times an n-gram must occur to be included in the result. This function should return a set of the n-grams that occur at least a minimum number of times.

Here’s a sample use of this function using the tweets from @LibDems with n=2, and min_count=100:

In [17]: analyze.find_min_count_ngrams(LibDems, 2, True, 100)
Out[17]:
{('Lib', 'Dems'),
 ('RT', 'Only'),
 ('Theresa', 'May'),
 ('can', 'stop'),
 ('stand', 'up'),
 ('stop', 'Tories'),
 ('will', 'stand')}

After completing this task, you should pass all tests in:

$ py.test -xv -k min_count_ngrams test_analyze.py

Task 3.3: Salient n-grams

Finally, you will use your find_salient function from Part 1 to find the salient n-grams in each tweet in a collection. Your task is to implement the function:

def find_salient_ngrams(tweets, n, case_sensitive, threshold):

where the first three parameters are as described above and threshold is the \(\tfidf\) threshold for deciding that a n-gram is salient. This function should return a list of sets of salient n-grams, one set per tweet.

Here’s a sample use of this function:

In [68]: tweets = [ {"abridged_text": "the cat in the hat" },
    ...:            {"abridged_text": "don't let the cat on the hat" },
    ...:            {"abridged_text": "the cat's hat" },
    ...:            {"abridged_text": "the hat cat" }]
    ...:
In [69]: analyze.find_salient_ngrams(tweets, 2, False, 1.3)
Out[69]:
[{('cat', 'in'), ('in', 'the')},
 {('cat', 'on'), ("don't", 'let'), ('let', 'the'), ('on', 'the')},
 {("cat's", 'hat'), ('the', "cat's")},
 {('hat', 'cat')}]

Notice that when constructing the n-grams for this task, we did not remove the stop words (for example, “the” is in STOP_WORDS, but appears in the result).

After completing this task, you should pass all tests in:

$ py.test -xv -k salient_ngrams test_analyze.py

Grading

Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)

The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:

  • Completeness: 50%

  • Correctness: 15%

  • Design: 20%

  • Style: 15%

The completeness part of your score will be determined using automated tests. To get your score for the automated tests, simply run the grader script, as described in our Testing Your Code page.

While we are telling you many of the functions to implement in this assignment, some of the tasks will benefit from using helper functions. Your design score will depend largely on whether you make adequate use of helper functions, as well as whether your code is well structured and easy to read.

As usual, we may deduct points if we spot errors in your code that are not explicitly captured by the tests. In this assignment, we will also be paying special attention to the efficiency of your solutions. For example, if you write a solution that uses a doubly-nested loop when a single loop would’ve been enough (or which iterates over a data structure multiple times when a single pass would’ve been enough) we would deduct correctness points for this.

Finally, remember that you must include docstrings in any functions you write (and you must make sure they conform to the format specified in the style guide). Do not remove the header comments on any of the functions we provide.

Cleaning up

Before you submit your final solution, you should, remove

  • any print statements that you added for debugging purposes and

  • all in-line comments of the form: “YOUR CODE HERE” and “REPLACE …”

Also, check your code against the style guide. Did you use good variable names? Do you have any lines that are too long, etc.

As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.

Submission

You must submit your work through Gradescope (linked from our Canvas site). In the “Programming Assignment #3” assignment, simply upload files basic_algorithms.py and analyze.py (do not upload any other files!). Please note:

  • You are allowed to make as many submissions as you want before the deadline.

  • Please make sure you have read and understood our Late Submission Policy

  • Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).

  • Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.