Analyzing Candidate Tweets¶

Due: Friday, October 25th at 4pm

You must work alone on this assignment.

The purpose of this assignment is to give you experience with using dictionaries to represent data and as a mechanism for mapping keys to values that change over time. You will also get a chance to practice using functions to avoid repeated code and to structure a task into logical, easy-to-test pieces.

Introduction¶

On April 18th, 2017, Theresa May, then Prime Minister of the United Kingdom, announced that she would call a snap election. This news came as a bit of a surprise since the next election was not due to be held until 2020. Her stated reason for calling the election was a desire to strengthen the UK Government’s hand in negotiations with the European Union (EU) over Britain’s exit from the EU (colloquially referred to as Brexit). While the election did not necessarily play out to Prime Minister May’s satisfaction, it did yield a trove of tweets that we can mine for insight into British politics.

Unlike US presidential elections, which seem to last forever, in the UK, the period between when a general election is officially called and the date the election is held is quite short, typically six weeks or so. During this pre-election period, known as purdah, civil servants are restricted from certain activities that might be viewed as biased towards the party currently in power. Purdah ran from April 22nd until June 9th for the 2017 General Election.

For this assignment, we’ll be analyzing tweets sent from the official Twitter feeds of four parties: the Conservative and Unionist Party (@Conservatives), the Labour Party (@UKLabour), the Liberal Democrats (@LibDems) and the Scottish National Party (@theSNP) during purdah. We’ll ask questions such as:

What was @Conservatives’s favorite hashtag during purdah? [#bbcqt]
Who was mentioned at least 50 times by @UKLabour? [@jeremycorbyn]
What words occurred most often in @theSNP’s tweets? [snp, scotland, our, have, more]
What two-word phrases occurred most often in @LibDems’s tweets? [stand up, stop tories, will stand, theresa may, lib dems]

For those of you who do not follow British politics, a few notes:

The hashtag #bbcqt refers to BBC Question Time, a political debate program on the British Broadcasting Company.
Jeremy Corbyn is the leader of the Labour Party.
Theresa May was the leader of the Conservatives during the 2017 election.
Nicola Sturgeon is the leader of the Scottish National Party.
Tim Farron was the leader of the Liberal Democrats during the 2017 election.
The Conservatives are also known as the Tories.

Getting Started¶

Accessing the `pa3` directory¶

We have seeded your repository with a directory (pa3) for this assignment. To pick up pa3, change to your cmsc12100-aut-19-username directory (where the string username should be replaced with your username) and then run the command: git pull upstream master. You should also run git pull to make sure any changes you previous pushed to GitLab are integrated into the local copy of your repository.

A description of the contents of the pa3 directory follows. There are two files you will be modifying:

basic_algorithms.py – you will add code for Part 1 to this file.
analyze.py – you will add code for Part 2 to this file.

Two other files contain the test code we will use throughout this assignment. You will not modify these files:

test_basic_algorithms.py – this file contains test code for the algorithms you will implement for Part 1 of this assignment.
test_analyze.py – test code for Part 2 of this assignment.

Two files contain functions that will be helpful in completing the assignment. You will not modify these files:

load_tweets.py – this file contains code to load the tweets for the four different parties to use during testing.
util.py – this file contains a few useful functions.

Accessing the tweet data and the test files¶

Finally, the last two parts of the distribution are the tweet data and the test files. To pick up these subdirectories, run:

$ sh get_files.sh

from the linux command-line in your pa3 directory.

data/ – a directory for the tweet files. These files are large and should not be added to your repository.
tests/ – a directory with the input to the test code. These files should not be added to your repository.

Your VM must be connected to the network to use the get_files script. Also, if you wish to use both CSIL & the Linux servers and your VM, you will need to run the get_files.sh script twice, once for CSIL & the Linux servers and once for your VM.

Auto-reload in IPython¶

Recall that you can set-up ipython3 to reload code automatically when it is modified by running the following commands after you fire up ipython3:

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import basic_algorithms, analyze

In [4]: import json

Part 1: Basic algorithms¶

Algorithms for efficiently counting and sorting distinct tokens, or unique values, are widely used in data analysis. For example, we might want to find the most frequent keyword used in a document or the top 10 most popular links clicked on a website. In Part 1, you will implement two such algorithms: find_top_k and find_min_count. You will also implement an algorithm for finding the most salient (that is, the most important or notable) tokens in a list.

In this part, you will add code to basic_algorithms.py to implement the algorithms described in the following subsections.

Testing Part 1¶

We have provided code, test_basic_algorithms.py, that runs test cases for each of the algorithms below. As in the previous assignments, our test code uses the pytest testing framework. You can use the -k flag and the name of the algorithm (top_k, min_count, and most_salient) to run the test code for a specific algorithm. For example, running the following command from the Linux command-line:

$ py.test -xv -k min_count test_basic_algorithms.py

will run the tests for the min count algorithm. Recall that the -x and -v flags, which indicate respectively that pytest should stop on the first error and generate verbose output, can be combined into a single flag -xv.

(As usual, we use $ to signal the Linux command-line prompt. You should not type the $.)

Task 1.1: Counting distinct tokens¶

The first step is to write a helper function that counts distinct tokens, count_tokens.

This function takes as input a list of tokens (in our case strings, but they can be any comparable and hashable type), and returns a new list with each distinct string and the number of times it appears in the list.

Use a dictionary to count the number of times each unique value occurs,
Extract a list of (key, count) pairs from the dictionary,
Return this list.

For example, if we have a list:

['A', 'B', 'C', 'A']

the function should yield (the exact order of the tuples in the list is not important):

[('A', 2), ('B', 1), ('C', 1)]

Notes:

Do not use python libraries other than what is already imported. (For example, you may not use collections.Counter.)
If you use python’s list.count method, your solution wil be inefficient. And it is a sign that you are probably not following the instructions above to use a dictionary to perform the counting. You will not receive credit if you use the list.count method.

We have not provided test code for this function. We do, however, encourage you to test your implementation using ipython3.

Ordering tuples¶

While the return value of count_tokens does not have to be ordered in any particular way, subsequent tasks may require that you provide a sorted list of pairs. We provide a function, named sort_count_pairs, that will handle sorting the pairs for you. Given a python list of tuples of (string, integer):

[(k0, v0), (k1, v1), ...]

the function sorts the list in descending order by the integer values. We use the keys (k0, k1, etc) to break ties and order them in ascending lexicographic (alphabetical) order.

Note: this differs from the default sort order for pairs in python which uses the first item in the pair as the primary sort key and the second value as the secondary sort key and sorts both in ascending order.

For example, given the list:

[('D', 5), ('C', 2), ('A', 3), ('B', 2)]

our function sort_count_pairs should return:

[('D', 5), ('A', 3), ('B', 2), ('C', 2)]

Task 1.2: Top K¶

You will write the algorithm find_top_k, which computes the $K$ tokens that occur most frequently in the list. To do this computation, you will need to:

Count the tokens
Sort the resulting pairs using the supplied function sort_count_pairs.
Return the first $K$ pairs. These pairs must be sorted in descending order of the counts.

Here is an example use of this function:

In [4]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']

In [5]: basic_algorithms.find_top_k(l, 2)
Out[5]: [('D', 5), ('B', 2)]

You can test your code by running:

$ py.test -xv -k top_k test_basic_algorithms.py

Task 1.3: Minimum number of occurrences¶

You will write the algorithm find_min_count, which finds the tokens in a list that occur at least some specified minimum number of times. To perform this algorithm, you will need to:

Count the tokens
Build a list of the tokens and associated counts that meet the threshold.
Return that list in descending order by count.

Here is an example use of this function:

In [6]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']

In [7]: basic_algorithms.find_min_count(l, 2)
Out[7]: [('D', 5), ('B', 2), ('C', 2)]

After writing this function you should be able to pass all tests in:

$ py.test -xv -k min_count test_basic_algorithms.py

Salient tokens¶

The most frequent tokens in the list may not be the most salient. For example, if the list contains words in an english-language document, the fact the words “a”, “an”, and “the” occur frequently is not at all surprising.

According to Wikipedia, term frequency–inverse document frequency (aka tf–idf) is a statistic designed to reflect how important a word is to a document in a collection or corpus and is often used as a weighting factor in information retrieval and text mining. A word or term is considered salient to a particular document if it occurs frequently in that document, but not in the document corpus over all.

Term frequency–inverse document frequency is defined as:

\[ \DeclareMathOperator{\tfidf}{\textsf{tf_idf}} \DeclareMathOperator{\tf}{\textsf{tf}} \DeclareMathOperator{\idf}{\textsf{idf}} \tfidf(t, d, D) = \tf(t, d) \cdot \idf(t, D)\]

where $t$ is a term, $d$ is a document (collection of terms), $D$ is the collection of documents, and $\tf$ and $\idf$ are defined below.

There are several variants of both term frequency ($\tf$) and inverse document frequency ($\idf$) that can be used to compute $\tfidf$. We will be using augumented frequency as our measure of term frequency, and we will use vanilla inverse document frequency.

The augmented frequency of a term $t$ in a document $d$ is defined as

\[\tf(t, d) = 0.5 + 0.5 \cdot \left ( \frac{f_{t,d}}{\max(\{f_{t^\prime,d}: t^\prime \in d\})} \right )\]

where $f_{t,d}$ is the number of times the term $t$ appears in the document $d$.

The vanilla inverse document frequency of a term $t$ in a document collection $D$ is defined as

\[\idf(t, D) = \log \left ( \frac{N}{\lvert \{d \in D : t \in d\} \rvert} \right )\]

where $N$ is the number of documents in the document collection $D$.

Use the natural log (math.log) in the $\idf$ computation.

Task 1.4: Computing salient tokens¶

You will write the algorithm most_salient, which takes a collection of documents and an integer k and returns a list of the k most salient terms, that is, the terms with the highest tf–idf, for each document. The definition for tf–idf is given above.

For this function, for each non-empty document, you will need to:

build a dictionary that maps each word to its tf–idf
sort the resulting (term, tf–idf) pairs in decreasing order by tf–idf
extract a list with the k most salient terms and add it to the result.

Note that, depending on how you write your code, your $\tfidf$ computation might not work for an empty document. If you encounter an empty document, just add an empty list to the result and move on to the next document.

Here is an example use of this function:

In [7]: l = [["D", "B", "D", "C", "D", "C", "C"],
   ...:      ["D", "A", "A"],
   ...:      ["D", "B"],
   ...:      [],
   ...:      ["E", "B"]]
   ...: basic_algorithms.find_most_salient(l, 2)
   ...:
Out[7]: [['C', 'D'], ['A', 'D'], ['B', 'D'], [], ['E', 'B']]

Think carefully about how to organize the code for this task. Don’t put all of the code in one function!

We strongly suggest testing any helper functions you write by hand in ipython3. Once you are done with that process, you can run our tests by running the command:

$ py.test -xv -k most_salient test_basic_algorithms.py

Part 2: Analyzing Tweets¶

Now that you have implemented the basic algorithms, you can start analyzing Twitter feeds. For the rest of the tasks, put your code in analyze.py.

While we provide function headers for each of the required tasks, the structure of the rest of the code is up to you. Some tasks can be done cleanly with one function, others cannot. We expect you to look for sub-tasks that are common to multiple tasks and to reuse previously completed functions. This process of function decomposition and thoughtful reuse of functions is one of the keys to writing clean code. You may place your auxiliary functions anywhere in the file that makes sense to you. Remember to follow the course style guide, which requires that you write proper function headers for each auxiliary function.

Testing Part 2¶

We have provided code for testing this part in test_analyze.py. Our test suite contains tests for checking your code on the examples shown below, on various corner cases, and larger data sets:

$ py.test -xv test_analyze.py

Data¶

Files¶

Twitter enables us to search for tweets with particular properties, say, from a particular user, containing specific terms, and within a given range of dates. There are several Python libraries that simplify the process of using this feature of Twitter. We used the TwitterSearch library to gather tweets from the Twitter feeds of @Conservatives, @UKLabour, @theSNP, and @LibDems from the purdah period, and we stored the resulting data in JSON files.

If you look in the data directory that you downloaded earlier, you will find files including:

Conservatives.json
UKLabour.json
LibDems.json
theSNP.json

Exploring the data¶

To simplify the process of exploring the data and testing, we have provided example code in load_tweets.py to load tweet files and assign the results to variables, one per party. This code is described below:

import util

Conservatives = util.get_json_from_file("data/Conservatives.json")
UKLabour = util.get_json_from_file("data/UKLabour.json")
theSNP = util.get_json_from_file("data/theSNP.json")
LibDems = util.get_json_from_file("data/LibDems.json")

util.py defines some useful helper routines for loading the stored tweets. The code above loads the set of tweets into each of the variables (more about the structure of the data later).

We have also defined variables for a few specific tweets used in the examples, as described in the code below:

# sample tweet from the "Data" section
tweet0 = UKLabour[651]

# sample tweet from the "Pre-processing step" and "Representing
# N-grams" sections.
tweet1 = UKLabour[55]

We encourage you to run this code in ipython3 to gain access to these variables:

In [8]: run load_tweets

You could also import the file, but then you would need to qualify every name with load_tweets, for example, load_tweets.theSNP.

Representing tweets¶

We encourage you to play around with extracting information about hashtags, user mentions, etc. for a few tweets before you start working on the tasks for Part 2. To do so, you will have to understand how information in a tweet is structured.

Each of the variables, theSNP, Conservatives, UKLabour, and LibDems, defines a list of tweets. All of the analysis code that you will be writing will operate on such lists.

A single tweet is represented by a dictionary that contains a lot of information: creation time, hashtags used, users mentioned, text of the tweet, etc. For example, here is a tweet sent in mid-May by @UKLabour (variable tweet0 above):

RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…

and here is an abridged version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs:

{'created_at': 'Thu May 18 19:44:01 +0000 2017',
 'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                           {'indices': [132, 139], 'text': 'GE2017'}],
              'symbols': [],
              'urls': [{'display_url': 'gov.uk/register-to-vo…',
                        'expanded_url': 'http://gov.uk/register-to-vote',
                        'indices': [108, 131],
                        'url': 'https://t.co/2Lf9M2q3mP'}],
              'user_mentions': [{'id': 1597669326,
                                 'id_str': '1597669326',
                                 'indices': [3, 15],
                                 'name': 'Patchwork Foundation',
                                 'screen_name': 'UKPatchwork'},
                               {'id': 105573429,
                                'id_str': '105573429',
                                'indices': [18, 30],
                                'name': 'Iain McNicol',
                                'screen_name': 'IainMcNicol'},
                               {'id': 14291684,
                                'id_str': '14291684',
                                'indices': [35, 44],
                                'name': 'The Labour Party',
                                'screen_name': 'UKLabour'}]},
 'text': 'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…'}

Collectively, hashtags, symbols, user mentions, and URLs are referred to as entities. These entities can be accessed through the "entities" key in the tweet dictionary. The value associated with the key "entities" is itself a dictionary that maps keys representing entity types ("hashtags", "symbols", "urls", "user_mentions") to lists of entities of that type. The entities themselves are represented with dictionaries dependent upon the entity type.

Part 2a: Finding commonly-occurring entities¶

What are @theSNP’s favorite hashtags? Which URLs are referenced at least 5 times by @LibDems? To answer these questions we must extract the desired entities (hashtags, user mentions, etc.) from the parties’ tweets and process them.

Entities¶

The entities dictionary maps a key to a list of entities associated with that key (each of which is a dictionary):

'entities': {'key': [{'subkey1': value11, 'subkey2': value12},
                     {'subkey1': value21, 'subkey2': value22}]}

For example, the hashtags contained in a tweet are represented by:

'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
                           {'indices': [132, 139], 'text': 'GE2017'}]}

Common parameters for Tasks 2.1 and 2.2¶

The next two tasks will use two of the same parameters. To avoid repetition later, we describe them here:

tweets is a list of dictionaries representing tweets.
entity_key is a pair, such as ("hashtags", "text"), where the first item in the pair is the type of entity of interest and the second is the key of interest for that type of entity. For example, we use "hashtags to extract the information about hashtags from entities dictionary and "text" to get the actual text of each hashtag.

For the tasks that use tweets, we will ignore case differences and so, for example, we will consider "#bbcqt" to be the same as "#BBCQT". In this part of the assignment, you should convert everything to lowercase. (The string lower method is useful for this purpose.) If you find yourself under-counting some entities, verify that you are correctly ignoring case differences.

Task 2.1: Top K entities¶

For Task 2.1, you will write a function that finds the top k most common entities in a list of tweets using the algorithm you wrote in basic_algorithms.py. You must complete the following function:

def find_top_k_entities(tweets, entity_key, k):

The first two parameters are as described above and k is an integer. This function, which is in analyze.py, should return a sorted list of (entity, count) pairs for the k most common entities, where count is the number of occurrences of entity.

Here’s a sample call using the tweets from @theSNP with ("hashtags", "text") as the entity key and k=3:

In [13]: analyze.find_top_k_entities(theSNP, ("hashtags", "text"), 3)
Out[13]: [('votesnp', 625), ('ge17', 428), ('snpbecause', 195)]

After completing this task, you should pass all tests in:

$ py.test -xv -k top_k_entities test_analyze.py

The first few tests are simplified examples to test corner cases, while the remaining tests use real data in the data folder.

Task 2.2: Minimum number of occurrences¶

For Task 2.2, we will find all entities that appear a minimum number of times using the min_count algorithm you wrote earlier. You must complete the function:

def find_min_count_entities(tweets, entity_key, min_count):

where the first two parameters are as described above and min_count is an integer that specifies the minimum number of times an entity must occur to be included in the result. This function should return a sorted list of (entity, count) pairs.

Here’s a sample use of this function using the tweets from @LibDems with ("user_mentions", "screen_name") as the entity key and min_count=100:

In [14]: analyze.find_min_count_entities(LibDems, ("user_mentions", "screen_name"), 100)
Out[14]:
[('libdems', 568),
 ('timfarron', 547),
 ('liberalbritain', 215),
 ('libdempress', 115)]

After completing this task, you should pass all tests in:

$ py.test -xv -k min_count_entities test_analyze.py

Part 2b: Analyzing N-grams¶

What additional insights might we gain by analyzing words and word sequences?

In this part, you will apply the three algorithms described earlier to contiguous sequences of $N$ words, which are known as n-grams. Before you apply these algorithms to a candidate’s tweets, you will pre-process the tweets to reveal useful words and then extract the n-grams. Your solution should use helper functions to avoid duplicated code.

To-do: Pre-processing step¶

You will need to pre-process the text of the tweets. The pre-processing step converts the text of a tweet into a list of strings. It is important to follow the order of the steps below precisely so that your solution passes our tests. You can find all named constants in the file analyze.py and can refer to them by their names in your code.

To simplify the work of debugging this task, we have added an abridged version of the text of each tweet to the tweet’s dictionary under the key: abridged_text. To construct the abridged version, we replaced emoji and symbols with spaces.

You should use the abridged version of the text of the tweets for this part of the assignment.

1. We will define a word to be any sequence of characters delimited by whitespace. For example, “abc”, “10”, and “#2017election.” are all words by this definition. Turn the abridged text of a tweet into a list of lowercase words.

2. Next, we must remove any leading and trailing punctuation. We have defined a constant, PUNCTUATION, that specifies which characters constitute punctuation for this assignment. Note that apostrophes, as in the word “it’s”, should not be removed, because they occur in the middle of the word. Remove any leading and trailing punctuation in each word.

3. Then, we want to focus attention on the important words in the tweets. Stop words are commonly- occurring words, such as “and”, “of”, and “to” that convey little insight into what makes one tweet different from another. We defined a set of stop words in the defined constant STOP_WORDS. For the first two tasks below, you should eliminate all words that are in the set STOP_WORDS.

4. Finally, we want to remove URLs, hashtags, and mentions. We defined a constant STOP_PREFIXES (which includes "#", "@", and "http"). Eliminate all words that begin with any of the strings in STOP_PREFIXES

Please note that while we described the pre-processing step in terms of which words to remove, it is actually much easier to write code that determines the words that should be included, rather than removing the words that should be excluded.

We suggest you write a function that performs the above steps, but in a way that allows you to easily skip Step 3 if necessary (as you will need to skip that step in one of the tasks below). Please note that you are welcome to implement additional helper functions, but it would be bad practice to write two essentially identical functions, where one performs Step 3 but the other does not; this will be reflected in the grading.

Here is an example: pre-processing the text (see tweet1 defined in load_tweets.py)

Things you need to vote
Polling card?
ID?
A dog?
Just yourself
You've got until 10pm – #VoteLabour now… https://t.co/sCDJY1Pxc9

would yield:

('things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
 'just', 'yourself', "you've", 'got', 'until', '10pm', 'now')

You will find the lower, split, strip, and startswith methods from the string API useful for this step. You will save yourself a lot of time and unnecessary work if you read about these methods in detail before you start writing code.

To-do: Representing N-grams¶

Your code should compute the n-grams of a tweet after pre-processing the tweet’s abridged text. These n-grams should be represented as tuples of strings. Given a value of 2 for $N$, the above @UKLabour tweet would yield the following bi-grams (2-grams):

[('things', 'need'),
 ('need', 'vote'),
 ('vote', 'polling'),
 ('polling', 'card'),
 ('card', 'id'),
 ('id', 'dog'),
 ('dog', 'just'),
 ('just', 'yourself'),
 ('yourself', "you've"),
 ("you've", 'got'),
 ('got', 'until'),
 ('until', '10pm'),
 ('10pm', 'now')]

Notice that the n-gram does not “circle back” to the beginning. That is, the last word of the tweet and the first word of the tweet do not comprise an n-gram (notice: (‘now’, ‘things’) is not included).

Common parameters for Tasks 2.3–2.5¶

The rest of the tasks have two parameters in common:

tweets is a list of dictionaries representing tweets (for example, theSNP)
n is the number of words in an n-gram.

Task 2.3: Top K n-grams¶

You will apply your previously written find_top_k function to find the most commonly occuring n-grams. Your task is to implement the function:

def find_top_k_ngrams(tweets, n, k):

where the first two parameters are as described above and k is the desired number of n-grams. This function should return a sorted list of the (n-gram, count) pairs for the k most common n-grams.

Here’s a sample use of this function using the tweets from @theSNP with n=2 and k=3:

In [16]: analyze.find_top_k_ngrams(theSNP, 2, 3)
Out[16]: [(('nicola', 'sturgeon'), 82), (('read', 'more'), 69), (('stand', 'up'), 55)]

After completing this task, you should pass all tests in:

$ py.test -xv -k top_k_ngrams test_analyze.py

Task 2.4: Minimum number of n-gram occurrences¶

Similarly, you will apply your find_min_count function to find the n-grams that appear at least a minimum number of times. Your task is to implement the function:

def find_min_count_ngrams(tweets, n, min_count):

where the first two parameters are as described above and min_count specifies the minimum number of times an n-gram must occur to be included in the result. This function should return a sorted list of (n-gram, count) pairs.

Here’s a sample use of this function using the tweets from @LibDems with n=2 and min_count=100:

In [17]: analyze.find_min_count_ngrams(LibDems, 2, 100)
Out[17]:
[(('stand', 'up'), 189),
 (('stop', 'tories'), 189),
 (('will', 'stand'), 166),
 (('theresa', 'may'), 125),
 (('lib', 'dems'), 116),
 (('can', 'stop'), 104),
 (('only', 'can'), 100)]

After completing this task, you should pass all tests in:

$ py.test -xv -k min_count_ngrams test_analyze.py

Task 2.5: Most salient n-grams¶

Finally, you will use your find_most_salient function from Part 1 to find the most salient words in each tweet in a collection. Your task is to implement the function:

def find_most_salient_ngrams(tweets, n, k):

where the first two parameters are as described above and k is the desired number of salient n-grams for each tweet. This function should return a list of list of ngrams, sorted by salience (that is, in decreasing order of tf–idf score).

Here’s a sample use of this function:

In [68]: tweets = [ {"abridged_text": "the cat in the hat" },
    ...:            {"abridged_text": "don't let the cat on the hat" },
    ...:            {"abridged_text": "the cat's hat" },
    ...:            {"abridged_text": "the hat cat" }]
    ...: analyze.find_most_salient_ngrams(tweets, 2, 2)
Out[66]:
[[('cat', 'in'), ('in', 'the')],
 [('cat', 'on'), ("don't", 'let')],
 [("cat's", 'hat'), ('the', "cat's")],
 [('hat', 'cat'), ('the', 'hat')]]

Notice that when constructing the ngrams for this task, we do not remove the stop words (for example, “the” is in STOP_WORDS, but appears in the result.

After completing this task, you should pass all tests in:

$ py.test -xv -k most_salient_ngrams test_analyze.py

Grading¶

Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)

The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:

Completeness: 50%
Correctness: 15%
Design: 20%
Style: 15%

While we are telling you many of the functions to implement in this assignment, some of the tasks will benefit from using helper functions. Your design score will depend largely on whether you make adequate use of helper functions, as well as whether your code is well structured and easy to read.

As usual, we may deduct points if we spot errors in your code that are not explicitly captured by the tests. In this assignment, we will also be paying special attention to the efficiency of your solutions. For example, if you write a solution that uses a doubly-nested loop when a single loop would’ve been enough (or which iterates over a data structure multiple times when a single pass would’ve been enough) we would deduct correctness points for this.

Finally, remember that you must include header comments in any functions you write (and you must make sure they conform to the format specified in the style guide). Do not remove the header comments on any of the functions we provide.

Obtaining your test score¶

Like previous assignments, you can obtain your test score by running py.test followed by ../common/grader.py.

Continuous Integration¶

Continuous Integration (CI) is available for this assignment. For more details, please see our Continuous Integration page. We strongly encourage you to use CI in this and subsequent assignments.

Submission¶

To submit your assignment, make sure that you have:

put your name at the top of your file,
registered for the assignment using chisubmit,
added, committed, and pushed basic_algorithms.py and analyze.py to the git server, and
run the chisubmit submission command for pa3.

Remember to push your code to the server early and often! Also, remember that you can submit as many times as you like before the deadline.

Analyzing Candidate Tweets¶

Introduction¶

Getting Started¶

Accessing the pa3 directory¶

Accessing the tweet data and the test files¶

Auto-reload in IPython¶

Part 1: Basic algorithms¶

Testing Part 1¶

Task 1.1: Counting distinct tokens¶

Ordering tuples¶

Task 1.2: Top K¶

Task 1.3: Minimum number of occurrences¶

Salient tokens¶

Task 1.4: Computing salient tokens¶

Part 2: Analyzing Tweets¶

Testing Part 2¶

Data¶

Files¶

Exploring the data¶

Representing tweets¶

Part 2a: Finding commonly-occurring entities¶

Entities¶

Common parameters for Tasks 2.1 and 2.2¶

Task 2.1: Top K entities¶

Task 2.2: Minimum number of occurrences¶

Part 2b: Analyzing N-grams¶

To-do: Pre-processing step¶

To-do: Representing N-grams¶

Common parameters for Tasks 2.3–2.5¶

Task 2.3: Top K n-grams¶

Task 2.4: Minimum number of n-gram occurrences¶

Task 2.5: Most salient n-grams¶

Grading¶

Obtaining your test score¶

Continuous Integration¶

Submission¶

Accessing the `pa3` directory¶