Analyzing Candidate Tweets¶
Due: Friday, October 30 at 3pm CDT.
You must work alone on this assignment.
The purpose of this assignment is to give you a chance to practice
using dictionaries to represent data and as a mechanism for mapping keys to values that change over time,
using functions to avoid repeated code,
structuring a task into logical, easy-to-test pieces, and
reading the documentation for library functions.
Introduction¶
On April 18th, 2017, Theresa May, then Prime Minister of the United Kingdom, announced that she would call a snap election. This news came as a bit of a surprise since the next election was not due to be held for several years. Her stated reason for calling the election was a desire to strengthen the UK Government’s hand in negotiations with the European Union (EU) over Britain’s exit from the EU (colloquially referred to as Brexit). While the election did not necessarily play out to Prime Minister May’s satisfaction, it did yield a trove of tweets that we can mine for insight into British politics.
Unlike US presidential elections, which seem to last forever, in the UK, general elections are quite short. The period between when a general election is officially called and the date of the election is typically six weeks or so. During this pre-election period, known as purdah, civil servants are restricted from certain activities that might be viewed as biased towards the party currently in power. Purdah ran from April 22nd until June 9th for the 2017 General Election.
For this assignment, we’ll be analyzing tweets sent from the official
Twitter accounts of four parties: the Conservative and Unionist Party
(@Conservatives
), the Labour Party (@UKLabour
), the Liberal Democrats
(@LibDems
) and the Scottish National Party (@theSNP
) during purdah.
We’ll ask questions such as:
What was
@Conservatives
’s favorite hashtag during purdah? [#bbcqt
]Who was mentioned at least 50 times by
@UKLabour
? [@jeremycorbyn
]What words occurred most often in
@theSNP
’s tweets? [snp
,scotland
,our
,have
,more
]What two-word phrases occurred most often in
@LibDems
’s tweets? [stand up
,stop tories
,will stand
,theresa may
,lib dems
]
For those of you who do not follow British politics, a few notes:
Jeremy Corbyn was the leader of the Labour Party during the 2017 election.
Tim Farron was the leader of the Liberal Democrats during the 2017 election.
Theresa May was the leader of the Conservatives during the 2017 election.
Nicola Sturgeon has been the leader of the Scottish National Party since 2014.
The Conservatives are also known as the Tories.
The hashtag
#bbcqt
refers to BBC Question Time, a political debate program on the British Broadcasting Company.
Getting started¶
Before you start working on the assignment’s tasks, please take a
moment to follow the steps described in Coursework Basics page
to get the files for this assignment (these steps will be the same as
the ones you followed to get the files previous assignments. Please
note that you will not be able to start working on the assignment
until you fetch the assignment files (which will appear in a pa3
directory in your repository)
On this assignment, you may assume that the input passed to your functions has the correct format. You may not alter any of the input that is passed to your functions. In general, it is bad style to modify a data structure passed as input to a function, unless that is the explicit purpose of the function. If someone calls your function, they might have other uses for the data and should not be surprised by unexpected changes.
Because that data files for this assignment are large, we did not
include them the GitLab distribution. Instead, please run the shell
script get_files.sh
to download the data. To run the
script, change to your pa3
directory and run this command from the
Linux command-line:
$ ./get_files.sh
This script will create directories named data
and tests
and
will download one file per party:
Conservatives.json
UKLabour.json
LibDems.json
theSNP.json
plus some files used by the test code.
Then as usual, fire up ipython3
from the Linux command-line,
set up autoreload, and import your code and the json library as
follows:
$ ipython3
In [1]: %load_ext autoreload
In [2]: %autoreload 2
In [3]: import basic_algorithms, analyze
In [4]: import json
The json
library will be needed when you recreate our tests in
ipython3
.
Part 1: Basic algorithms¶
Algorithms for efficiently counting and sorting distinct tokens, or
unique values, are widely used in data analysis. For example, we might
want to find the most frequent keyword used in a document or the top
10 most popular links clicked on a website. In Part 1, you will
implement two such algorithms: find_top_k
and find_min_count
.
You will also implement an algorithm for finding the most salient
(that is, the most important or notable) tokens for a document in the
context of a document collection.
In this part, you will add code to basic_algorithms.py
to implement
the algorithms described in the following subsections.
Task 1.1: Counting distinct tokens¶
The first step is to write a helper function, count_tokens
, that
counts distinct tokens. This function takes as input a list of tokens
(in our example strings, but they can be any immutable type), and returns
a dictionary that maps tokens to counts.
For example, if we have a list:
['A', 'B', 'C', 'A']
the function should yield (the exact order of the key-value pairs in the dictionary is not important):
{'A': 2, 'B': 1, 'C': 1}
Notes:
Do not use Python libraries other than what is already imported. (For example, you may not use
collections.Counter
.)If you use Python’s
list.count
method, your solution will be inefficient and you will not receive credit for this task.
We provide test code for many of the tasks in this assignment, but
not this one. We encourage you to test your implementation using
ipython3
.
Task 1.2: Top K¶
You will write the algorithm find_top_k
, which takes a list of
tokens and a nonnegative integer K, and returns a list of the the K tokens that occur most
frequently in the list. The result should be sorted in descending
order of token frequency.
Here is an example use of this function:
In [4]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']
In [5]: basic_algorithms.find_top_k(l, 2)
Out[5]: ['D','B']
To do this computation, you will need to:
Count the number of times each token appears,
Convert that data into a list of (token, count) tuples
Sort the resulting list in decreasing order by count.
Extract the K tokens that occurred most frequently.
The standard Python sort function sorts pairs using the first value as
the primary key and the second value as the secondary key (to break
ties) and so, is not suitable for this task. Instead, you need a sort
function that uses the second value as the primary key and the first
value to break ties. We have provided such a function:
sort_count_pairs
in util.py
. To learn about this function,
you can look at the docstring or import util
into ipython3
and
then run help(util.sort_count_pairs)
.
You can test your code for this task by running:
$ py.test -xv -k top_k test_basic_algorithms.py
(As usual, we use $
to signal the Linux command-line
prompt. You should not type the $
.)
If you have not done so already, please follow the instructions for downloading the data in the Getting started section before you run this test.
Task 1.3: Minimum number of occurrences¶
You will write the algorithm find_min_count
, which computes a set
of the tokens in a list that occur at least some specified minimum
number of times.
Here is an example use of this function:
In [6]: l = ['D', 'B', 'C', 'D', 'D', 'B', 'D', 'C', 'D', 'A']
In [7]: basic_algorithms.find_min_count(l, 2)
Out[7]: {'D', 'B', 'C'}
Recall that sets are not ordered, so the result of this call could
just as easily have been {'B', 'C', 'D'}
.
After writing this function you should be able to pass all tests in:
$ py.test -xv -k min_count test_basic_algorithms.py
Salient tokens¶
The most frequent tokens in the list may not be the most salient. For example, if the list contains words from an English-language document, the fact the words “a”, “an”, and “the” occur frequently is not at all surprising.
According to Wikipedia, term frequency–inverse document frequency (aka tf–idf) is a statistic designed to reflect how important a word is to a document in a collection or corpus and is often used as a weighting factor in information retrieval and text mining. A word or term is considered salient to a particular document if it occurs frequently in that document, but not in the document corpus over all.
Definitions
Term frequency–inverse document frequency is defined as:
where \(t\) is a term, \(d\) is a document (collection of terms), \(D\) is a collection of documents, and \(\tf\) and \(\idf\) are defined below.
There are several variants of both term frequency (\(\tf\)) and inverse document frequency (\(\idf\)) that can be used to compute \(\tfidf\). We will be using augmented frequency as our measure of term frequency to prevent bias towards longer documents, and we will use basic inverse document frequency.
The augmented frequency of a term \(t\) in a document \(d\) is defined as
where \(f_{t,d}\) is the number of times the term \(t\) appears in the document \(d\).
The basic inverse document frequency of a term \(t\) in a document collection \(D\) is defined as
where \(N\) is the number of documents in the document collection
\(D\) and where \(\lvert \{d \in D : t \in d\} \rvert\) is
the number of documents that term \(t\) occurs in. Use the natural log (math.log
) in your calculation of \(\idf\).
Given a threshold \(T\), the set of salient words for a document \(d\) in a document collection \(D\) is defined as:
Example
In this section, we will work through the computation of the salient tokens for each document in a sample document collection. Here is the sample collection:
[['D', 'B', 'D', 'C', 'D', 'C', 'C'],
['D', 'A', 'A'],
['D', 'B'],
[]]
Here are the augmented term frequency values for each document:
Document: ['D', 'B', 'D', 'C', 'D', 'C', 'C']
Token: TF
B: 0.6666666666666666
C: 1.0
D: 1.0
Document: ['D', 'A', 'A']
Token: TF
A: 1.0
D: 0.75
Document: ['D', 'B']
Token: TF
B: 1.0
D: 1.0
Document: []
There are no augmented term frequency values for the last document because it is empty.
Here are the inverse document frequency values for the document collection:
Token: IDF:
A: 1.3862943611198906
B: 0.6931471805599453
C: 1.3862943611198906
D: 0.28768207245178085
Here is the salience computation for each document using a threshold of 0.4.
Document: ['D', 'B', 'D', 'C', 'D', 'C', 'C']
Token: The token is salient (TF_IDF)
B: True (0.46209812037329684)
C: True (1.3862943611198906)
D: False (0.28768207245178085)
Salient tokens: {'B', 'C'}
Document: ['D', 'A', 'A']
Token: The token is salient (TF_IDF)
A: True (1.3862943611198906)
D: False (0.21576155433883565)
Salient tokens: {'A'}
Document: ['D', 'B']
Token: The token is salient (TF_IDF)
B: True (0.6931471805599453)
D: False (0.28768207245178085)
Salient tokens: {'B'}
Document: []
Salient tokens: set()
Notice that empty document has no salient tokens.
Finally, the result of computing the salient tokens for the documents in the document collection is:
[{'B', 'C'}, {'A'}, {'B'}, set()]
Task 1.4: Computing salient tokens¶
In this task, you will write the algorithm find_salient
, which
takes a collection of documents (docs
) and a floating point
threshold (threshold
) and finds the set of salient words for each
document (as defined in the previous section). The result of the
function will be a list of sets, with one set per document.
You should think carefully about how to organize the code for implementing this algorithm. Do not put all the code for this task into one function.
Some things to keep in mind:
You will need an intermediate data structure to keep track of the inverse document frequencies for the words in the document collection.
You may find it helpful to compute a data structure with the term frequencies for the words in a document, but it is not strictly necessary.
Remember that the set of salient words for an empty document is just the empty set.
Testing your intermediate functions (in
ipython3
) as you go will make it easier to complete this task successfully.
You can run our tests for this task by running the command:
$ py.test -xv -k salient test_basic_algorithms.py
Analyzing Tweets¶
Now that you have implemented the basic algorithms, you can start
analyzing Twitter feeds. For the rest of the tasks, put your code in
analyze.py
.
While we provide function headers for each of the required tasks, the structure of the rest of the code is up to you. Some tasks can be done cleanly with one function, others cannot. We expect you to look for common sub-tasks to abstract into functions and we expect you to reuse previously completed functions. This process of function decomposition and thoughtful reuse of functions is one of the keys to writing clean code. You may place your auxiliary functions anywhere in the file that makes sense to you. Remember to follow the course style guide, which requires that you write proper docstrings (header comments) for every function.
Data¶
Twitter enables us to search for tweets with particular properties,
say, from a particular user, containing specific terms, and within a
given range of dates. There are several Python libraries that
simplify the process of using this feature of Twitter. We used the
TwitterSearch library to
gather tweets from the Twitter feeds of @Conservatives
,
@UKLabour
, @theSNP
, and @LibDems
from the purdah period,
and we stored the resulting data in JSON files.
If you look in the data
directory that you downloaded earlier, you
will find files including:
Conservatives.json
UKLabour.json
LibDems.json
theSNP.json
Representing tweets¶
A single tweet is represented by a dictionary that contains a lot of
information: creation time, hashtags used, users mentioned, text of
the tweet, etc. For example, here is a tweet sent in mid-May by
@UKLabour
(variable tweet0
mentioned below):
RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…
and here is an abridged version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs:
{'created_at': 'Thu May 18 19:44:01 +0000 2017',
'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
{'indices': [132, 139], 'text': 'GE2017'}],
'symbols': [],
'urls': [{'display_url': 'gov.uk/register-to-vo…',
'expanded_url': 'http://gov.uk/register-to-vote',
'indices': [108, 131],
'url': 'https://t.co/2Lf9M2q3mP'}],
'user_mentions': [{'id': 1597669326,
'id_str': '1597669326',
'indices': [3, 15],
'name': 'Patchwork Foundation',
'screen_name': 'UKPatchwork'},
{'id': 105573429,
'id_str': '105573429',
'indices': [18, 30],
'name': 'Iain McNicol',
'screen_name': 'IainMcNicol'},
{'id': 14291684,
'id_str': '14291684',
'indices': [35, 44],
'name': 'The Labour Party',
'screen_name': 'UKLabour'}]},
'text': 'RT @UKPatchwork: .@IainMcNicol and @UKLabour encourage you to #GetInvolved and get registered to vote here: https://t.co/2Lf9M2q3mP #GE2017…'}
Collectively, hashtags, symbols, user mentions, and URLs are referred
to as entities. These entities can be accessed through the
"entities"
key in the tweet’s dictionary. The value associated with
the key "entities"
is itself a dictionary that maps keys
representing entity types ("hashtags"
, "symbols"
, "urls"
,
"user_mentions"
) to lists of entities of that type. In general
the entities data structure has the form:
'entities': {'key1': [{'subkey1': value11, 'subkey2': value21},
{'subkey1': value12, 'subkey2': value22},
...],
'key2': [{'subkey3': value31, 'subkey4': value41},
{'subkey3': value32, 'subkey4': value42},
...],
...}
For example, here’s a subset of the entities from the sample tweet:
'entities': {'hashtags': [{'indices': [62, 74], 'text': 'GetInvolved'},
{'indices': [132, 139], 'text': 'GE2017'}],
'urls': [{'display_url': 'gov.uk/register-to-vo…',
'expanded_url': 'http://gov.uk/register-to-vote',
'indices': [108, 131],
'url': 'https://t.co/2Lf9M2q3mP'}], ...}
Each individual entity is represented with a dictionary, whose form
depends on the entity type. For example, a hashtag will
include information about the text
of the hashtag and the
indices
where it occurred in the tweet:
{'indices': [62, 74], 'text': 'GetInvolved'}
Exploring the data¶
To simplify the process of exploring the data and testing, we have
provided code in load_tweets.py
to load the tweets for you. Here
it is:
"""
Load the tweets for the Analyze Candidate Tweets assignment.
"""
import util
Conservatives = util.get_json_from_file("data/Conservatives.json")
UKLabour = util.get_json_from_file("data/UKLabour.json")
theSNP = util.get_json_from_file("data/theSNP.json")
LibDems = util.get_json_from_file("data/LibDems.json")
# sample tweet from the "Data" section
tweet0 = UKLabour[651]
# sample tweet from the "Pre-processing step" and "Representing
# N-grams" sections.
tweet1 = UKLabour[55]
The first four variables defined by load_tweets.py
—Conservatives
, UKLabour
, theSNP
, and LibDems
—define lists of
tweets. We have also defined variables for a few specific tweets used
in examples. tweet0
is the sample tweet shown above. tweet1
is used in the preprocessing discussion below.
Run this code in ipython3
to gain access to these variables:
In [8]: run load_tweets
(You could import the file instead, but then you would need to qualify
every name with load_tweets
, for example, load_tweets.theSNP
.)
We encourage you to play around with a couple of the tweet dictionaries to become familiar with how to access information in the data structure before you move on to the next task.
Part 2: Finding commonly-occurring entities¶
What are @theSNP
’s favorite hashtags? Which URLs are referenced at
least 5 times by @LibDems
? To answer these questions we must
extract the desired entities (hashtags, user mentions, etc.) from the
parties’ tweets and process them.
Common parameters for Part 2¶
The next two tasks will use two of the same parameters. To avoid repetition later, we describe them here:
tweets
is a list of dictionaries representing tweets.entity_desc
is a tuple with three values: the first is the type of entity of interest (the key), the second is the information of interest for that type of entity (the subkey), and the third is a boolean indicating whether the values are case sensitive. For example, we use the tuple("hashtags", "text", False)
to describe hashtags, because"hashtags"
is the key for extracting the hashtags out of the entities dictionary,"text"
is the subkey for extracting the text of the hashtag from the entity dictionary, and, for our purposes, we will usually treat hashtags as case insensitive. For example, when working with hashtags we will treat"#bbcqt"
to be the same as"#BBCQT"
. URLs on the other hand are case sensitive, so we’d use("urls", "url", True)
as the entity description for URLs.
(You will find the string lower
method useful for handling entities that are not case sensitive.)
Task 2.1: Top K entities¶
For Task 2.1, you will write a function that finds the top k
most
common entities in a list of tweets using the algorithm you wrote in
basic_algorithms.py
. You must complete the following function:
def find_top_k_entities(tweets, entity_desc, k):
The first two parameters are as described above and k
is an
integer. This function, which is in analyze.py
, should return a
list of the k
most common entities. As in Task 1.2, the most common
entity should come first, the next most common entity should come
second, etc.
Here’s a sample call using the tweets from @theSNP
with
("hashtags", "text", False)
as the entity description and a value
of three for k
:
In [13]: analyze.find_top_k_entities(theSNP, ("hashtags", "text", False), 3)
Out[13]: ['votesnp', 'ge17', 'snpbecause']
After completing this task, you should pass all tests in:
$ py.test -xv -k top_k_entities test_analyze.py
The first few tests are simplified examples to test corner cases, while the remaining tests use data from the data folder.
If your counts are off, make sure you are handling case sensitivity properly.
Task 2.2: Minimum number of occurrences¶
For Task 2.2, you will find all entities that appear a minimum number of
times using the min_count
algorithm you wrote earlier. You must
complete the function:
def find_min_count_entities(tweets, entity_desc, min_count):
where the first two parameters are as described above and
min_count
is an integer that specifies the minimum number of times
an entity must occur to be included in the result. This function
should return set of the entities occurred at least min_count
times.
Here’s a sample use of this function using the tweets from
@LibDems
:
In [14]: analyze.find_min_count_entities(LibDems, ("user_mentions", "screen_name", True), 100)
Out[14]: {'LibDemPress', 'LibDems', 'LiberalBritain', 'timfarron'}
After completing this task, you should pass all tests in:
$ py.test -xv -k min_count_entities test_analyze.py
Part 3: Analyzing N-grams¶
What additional insights might we gain by analyzing words and word sequences?
In this part, you will apply the three algorithms described earlier to contiguous sequences of \(N\) words, which are known as n-grams. Before you apply these algorithms to a candidate’s tweets, you will pre-process the tweets to reveal useful words and then extract the n-grams. Your solution should use helper functions to avoid duplicated code.
Abridged text
To simplify the work of debugging these tasks, we constructed an
abridged version of the text of each tweet by replacing the emoji and
symbols with spaces. You can find the abridged text in a tweet’s
dictionary under the key: abridged_text
.
You must use the abridged text for this part of the assignment.
To-do: Pre-processing step¶
You will need to pre-process the text of the tweets: you
will convert the (abridged) text into a list of
strings. You must follow the steps below in order
so that your solution passes our tests. You can find all
named constants in the file analyze.py
and can refer to them by
their names in your code.
We define a word to be any sequence of characters delimited by whitespace. For example, “abc”, “10”, and “#2017election.” are all words by this definition. Turn the abridged text of a tweet into a list of words.
We have defined a constant,
PUNCTUATION
, that specifies which characters constitute punctuation for this assignment. Remove any leading and trailing punctuation from each word. Note that apostrophes, as in the word “it’s”, should not be removed, because they occur in the middle of the word. You should eliminate a word, if stripping the punctuation from it yields the empty string (""
).In some tasks, we will want to treat the words as case sensitive, in others we will not. For tasks that are not case sensitive, convert the word to lower case.
Stop words are commonly-occurring words, such as “and”, “of”, and “to” that convey little insight into what makes one tweet different from another. We defined a set of stop words in the defined constant
STOP_WORDS
. For the tasks that require it, eliminate any word from the list of words that is included inSTOP_WORDS
.Finally, we want to remove URLs, hashtags, and mentions. We defined a constant
STOP_PREFIXES
(which includes"#"
,"@"
, and"http"
) to help with this task. Eliminate all words that begin with any of the strings inSTOP_PREFIXES
.
Notes and Suggestions
While we described the pre-processing step in terms of which words to eliminate, it is actually much easier to write code that constructs a list of the words that should be included, rather than repeatedly removing words that should be excluded.
We suggest you write a function that performs the above steps. Your preprocessing function needs to handle the fact that in some cases we care about the original case of the words and in others we don’t. Similarly, for some tasks, we will want to remove stop words and for others we will not. You are welcome to implement additional helper functions, but it would be bad practice to write two essentially identical functions, where one converts the words to lower case and the other does not or where one removes stop words and the other does not; this will be reflected in the grading.
Important: You will find the
lower
,split
,strip
, andstartswith
methods from the string API useful for this step. You will save yourself a lot of time and unnecessary work if you read about these methods in detail and experiment with them inipython3
before you start writing code.
Example
Here is the abridged text from tweet1
(as defined in
load_tweets.py
):
Things you need to vote
Polling card?
ID?
A dog?
Just yourself
You've got until 10pm – #VoteLabour now… https://t.co/sCDJY1Pxc9
Pre-processing it for a task that is not case sensitive and for which stop words should be removed would yield:
['things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
'just', 'yourself', "you've", 'got', 'until', '10pm', 'now']
Pre-processing it for a task that is case sensitive and for which stop should not be removed would yield:
['Things', 'you', 'need', 'to', 'vote', 'Polling', 'card',
'ID', 'A', 'dog', 'Just', 'yourself', "You've", 'got',
'until', '10pm', 'now']
We strongly recommend testing your cleaning code by hand in
ipython3
before you move on to the next part of the assignment.
To-do: Representing N-grams¶
Your code should compute the n-grams of a tweet after pre-processing the tweet’s abridged text. These n-grams should be represented as tuples of strings. Consider the following list (which came from pre-processing):
['things', 'need', 'vote', 'polling', 'card', 'id', 'dog',
'just', 'yourself', "you've", 'got', 'until', '10pm', 'now']
Taking \(N = 2\) would yield the following bi-grams (2-grams):
[('things', 'need'),
('need', 'vote'),
('vote', 'polling'),
('polling', 'card'),
('card', 'id'),
('id', 'dog'),
('dog', 'just'),
('just', 'yourself'),
('yourself', "you've"),
("you've", 'got'),
('got', 'until'),
('until', '10pm'),
('10pm', 'now')]
Notes
The n-gram does not “circle back” to the beginning. That is, the last word of the tweet and the first word of the tweet do not comprise an
n-gram
(so('now', 'things')
is not included).You should not combine words from different tweets in a single n-gram.
Common parameters for Part 3¶
The rest of the tasks have three parameters in common:
tweets
is a list of dictionaries representing tweets (for example,theSNP
)n
is the number of words in an n-gram.case_sensitive
is a boolean that isTrue
if the task is case sensitive.
Task 3.1: Top K n-grams¶
You will apply your previously written find_top_k
function to find
the most commonly occurring n-grams. Your task is to implement the
function:
def find_top_k_ngrams(tweets, n, case_sensitive, k):
where the first three parameters are as described above and k
is
an integer. This function should
return a list of the k
most common n
-grams. As in Task 1.2,
the most common n
-gram should come first, followed by the second
most common, etc.
Here’s a sample use of this function using the tweets from @theSNP
with n=2
, case_sensitive=False
and k=3
:
In [16]: analyze.find_top_k_ngrams(theSNP, 2, False, 3)
Out[16]: [('nicola', 'sturgeon'), ('read', 'more'), ('stand', 'up')]
After completing this task, you should pass all tests in:
$ py.test -xv -k top_k_ngrams test_analyze.py
Task 3.2: Minimum number of n-gram occurrences¶
Similarly, you will apply your find_min_count
function to find the
n-grams that appear at least a minimum number of times. Your task is
to implement the function:
def find_min_count_ngrams(tweets, n, case_sensitive, min_count):
where the first three parameters are as described above and
min_count
specifies the minimum number of times an n-gram must
occur to be included in the result. This function should return a
set of the n-grams that occur at least a minimum number of times.
Here’s a sample use of this function using the tweets from
@LibDems
with n=2
, and min_count=100
:
In [17]: analyze.find_min_count_ngrams(LibDems, 2, True, 100)
Out[17]:
{('Lib', 'Dems'),
('RT', 'Only'),
('Theresa', 'May'),
('can', 'stop'),
('stand', 'up'),
('stop', 'Tories'),
('will', 'stand')}
After completing this task, you should pass all tests in:
$ py.test -xv -k min_count_ngrams test_analyze.py
Task 3.3: Salient n-grams¶
Finally, you will use your find_salient
function from Part 1 to
find the salient n-grams in each tweet in a collection. Your task is to
implement the function:
def find_salient_ngrams(tweets, n, case_sensitive, threshold):
where the first three parameters are as described above and
threshold
is the \(\tfidf\) threshold for deciding that a
n-gram is salient. This function should return a list of sets of
salient n-grams, one set per tweet.
Here’s a sample use of this function:
In [68]: tweets = [ {"abridged_text": "the cat in the hat" },
...: {"abridged_text": "don't let the cat on the hat" },
...: {"abridged_text": "the cat's hat" },
...: {"abridged_text": "the hat cat" }]
...:
In [69]: analyze.find_salient_ngrams(tweets, 2, False, 1.3)
Out[69]:
[{('cat', 'in'), ('in', 'the')},
{('cat', 'on'), ("don't", 'let'), ('let', 'the'), ('on', 'the')},
{("cat's", 'hat'), ('the', "cat's")},
{('hat', 'cat')}]
Notice that when constructing the n-grams for this task, we did not
remove the stop words (for example, “the” is in STOP_WORDS
, but
appears in the result).
After completing this task, you should pass all tests in:
$ py.test -xv -k salient_ngrams test_analyze.py
Grading¶
Programming assignments will be graded according to a general rubric. Specifically, we will assign points for completeness, correctness, design, and style. (For more details on the categories, see our PA Rubric page.)
The exact weights for each category will vary from one assignment to another. For this assignment, the weights will be:
Completeness: 50%
Correctness: 15%
Design: 20%
Style: 15%
The completeness part of your score will be determined using automated tests. To get your score for the automated tests, simply run the grader script, as described in our Testing Your Code page.
While we are telling you many of the functions to implement in this assignment, some of the tasks will benefit from using helper functions. Your design score will depend largely on whether you make adequate use of helper functions, as well as whether your code is well structured and easy to read.
As usual, we may deduct points if we spot errors in your code that are not explicitly captured by the tests. In this assignment, we will also be paying special attention to the efficiency of your solutions. For example, if you write a solution that uses a doubly-nested loop when a single loop would’ve been enough (or which iterates over a data structure multiple times when a single pass would’ve been enough) we would deduct correctness points for this.
Finally, remember that you must include docstrings in any functions you write (and you must make sure they conform to the format specified in the style guide). Do not remove the header comments on any of the functions we provide.
Cleaning up¶
Before you submit your final solution, you should, remove
any
print
statements that you added for debugging purposes andall in-line comments of the form: “YOUR CODE HERE” and “REPLACE …”
Also, check your code against the style guide. Did you use good variable names? Do you have any lines that are too long, etc.
As you clean up, you should periodically save your file and run your code through the tests to make sure that you have not broken it in the process.
Submission¶
You must submit your work through Gradescope (linked from our Canvas
site). In the “Programming Assignment #3” assignment, simply upload
files basic_algorithms.py
and analyze.py
(do not upload any
other files!). Please note:
You are allowed to make as many submissions as you want before the deadline.
Please make sure you have read and understood our Late Submission Policy
Your completeness score is determined solely based on the automated tests, but we may adjust your score if you attempt to pass tests by rote (e.g., by writing code that hard-codes the expected output for each possible test input).
Gradescope will report the test score it obtains when running your code. If there is a discrepancy between the score you get when running our grader script, and the score reported by Gradescope, please let us know so we can take a look at it.