Analyzing Candidate Tweets¶
Due: Friday, October 21st at 4pm
You may work alone or in a pair on this assignment.
The purpose of this assignment is to give you experience with using dictionaries to represent data and as a mechanism for mapping keys to values that change over time.
Introduction¶
The candidates in the 2016 presidential primaries and general election have used Twitter as a way to communicate with their supporters (and detractors) and gain media attention. In this assignment, you will analyze tweets from Hillary Clinton’s (HRC) and Donald Trump’s (DJT) Twitter feeds to gain some insight into how they use Twitter as a campaign tool. We’ll ask question such as:
- What is HRC’s favorite hashtag? [#demdebate]
- Who has DJT mentioned at least 100 times? [realdonaldtrump, cnn, foxnews]
- What words occur most often in HRC’s tweets? [our, with, are, have, not]
- What two-word phrases occur most often in DJT’s tweets? [i will, make america, great again, i am, america great]
- How do the candidates’ feeds change over time? [Not as much as you might think!]
Getting started¶
See these start-up instructions if you intend to work alone.
See these start-up instructions if you intend to work with the same partner as in PA #2.
See these start-up instructions if you intend to work in a NEW pair.
Here is a description of the contents of this directory:
basic_algorithms.py
– you will add code for the basic analysis algorithms to this file.test_basic_algorithms.py
– this file contains very basic test code for the basic algorithms.analyze.py
– you will add code to extract information from tweets and analyze them to this file. Note that this file includes several constants and two functions that you may find useful. They are documented with code comments, and you should make sure to read and understand them before you start writing code.test_analyze_writeup.py
– test code for the examples in the writeup.util.py
– this file contains a few useful functions.data/
– a directory for the data files. Since the data files are very large they are not included in the Git repository. See below for instructions on how to obtain the data files.
The data/
directory contains a file called get_files.sh
that will download the data
files. To download the data files, go into the data
directory and run this command:
./get_files.sh
This script will download the complete HRC dataset (hrc-all.json
),
the complete DJT dataset (djt-all.json
), and several smaller
datasets that may be useful for testing purposes. Please note that
you must be connected to the network to use this script.
- Do not add the data to your repository! If you wish to use both
- CSIL & the Linux servers and your VM, you will need to run the
get_files.sh
script twice once for CSIL & the Linux servers and once for your VM.
Part 1: Basic Algorithms¶
In this part, you will implement three algorithms: top k, min count, and frequent. In Part 2, you will use these algorithms for to analyze data extracted from tweets.
Before we describe the algorithms you will implement, we first discuss a function that we have provided for ordering tuples.
Ordering tuples
All three of the algorithms we describe in this section compute an
ordered list of (key, value) pairs. We provide a function, named
sort_count_pairs
, that will handle sorting the pairs for you. It
takes a list of the form:
[(k0, v0), (k1, v1), ...]
and sorts it. The natural sort order for pairs uses the first item in
the pair as the primary sort key and the second value as the secondary
sort key. This ordering is not desirable for our purposes. Instead,
we will use the values (v0
, v1
, etc) as the primary sort key.
When sorting the pairs, our sort method puts larger values earlier in
the ordering than smaller values. If two values are the same, then
the keys (k0
, k1
, etc) are used as the secondary sort key.
They are ordered in increasing order.
For example, given the list:
[('D', 1), ('C', 2), ('A', 5), ('B', 2)]
our function would yield:
[('A', 5), ('B', 2), ('C', 2), ('D', 1)]
Top K
The first algorithm, Top K, which will be familiar from class, computes the \(K\) items that occur most frequently in the list. To do this computation, you will need to:
- use a dictionary to count the number of times each unique value occurs,
- extract a list of (key, count) pairs from the dictionary,
- sort the pairs using the supplied function and finally,
- pick off the first \(K\) pairs.
The first step can and should be done with one pass over the data.
Given the following list:
l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']
and a value of two for \(K\), this algorithm would yield:
[('A', 5), ('B', 2)]
Minimum number of occurrences
The second algorithm, Min Count, finds the items in a list that occur at least some specified minimum number of times. To perform this algorithm, you will need to:
- compute the counts,
- build a list of the items and associated counts that meet the threshold, and then
- sort it using the supplied function.
Running this algorithm on the list l
defined above with a minimum
count of two would yield the following list:
[('A', 5), ('B', 2), ('C', 2)
Frequent items
The previous two algorithms require space proportional to the number of unique items in the list. The third algorithm, Frequent, finds the items in a sequence with a frequency that exceeds a \(1/K\) fraction of \(N\), the total number of items. This algorithm uses space proportional to \(K\), which is likely to be much smaller than \(N\).
The frequent algorithm uses a data structure \(D\) with up to \(K-1\) counters, each associated with a particular item, to track an approximation to the frequency of the corresponding items in the list. For a given list item \(I\), you will update the counters using the following rules:
- If item \(I\) occurs in \(D\), then increment by one the count associated with \(I\).
- If item \(I\) does not occur in \(D\) and there are fewer than \(K-1\) items in \(D\), then add \(I\) with a value of 1 to \(D\).
- If item \(I\) does not occur in \(D\) and there are \(K-1\) items in \(D\), then decrement all of the counters by one and remove any items with a count of zero from \(D\).
Once this data structure is computed, you will need extract the (key, count) pairs from it and sort them using our sort method.
Before we look at the result of applying this algorithm to the list
l
, let’s look at a more straightforward example. Given the list:
["A", "A", "B", "B", "A", "B", "C", "B", "A"]
and a value of three for \(K\), the frequent algorithm will yield:
[('A', 3), ('B', 3)]
Notice that while the items identified by the frequent algorithm are correct, the counts are not. The algorithm computes an approximation, not the exact count.
Returning to our earlier example, running the algorithm on the list
l
defined above and a value of three for \(K\), would yield:
[('A', 3), ('D', 1)]
This result may seem a bit odd. The algorithm guarantees that the
result will include any item whose frequency exceeds \(N/K\),
which is why 'A'
occurs in the result. If there are fewer than
\(K-1\) hashtags with a frequency that exceeds \(N/K\), then
the result may include hashtags that occur less frequently, which
explains the presence of 'D'
in the result.
It is instructive to think through what happens when running this
algorithm with the list l
defined above and two as the value of
\(K\). The data structure \(D\) will contain one key, 'A`
with an associated value of one after processing the last 'A'
in
the list. When the last value ('D'
) in the list is processed, the
third rule (decrement all the counts) will be invoked and as a result,
the data structure \(D\) will be empty when the algorithm
finishes.
This algorithm and other similar algorithms can also be used with streaming data. See the paper Finding Frequent Items in Data Streams by Cormode and Hadjieleftheriou for a good summary and an explanation of the relationship between the frequencies reported in the results and the true frequencies.
Task 0
Your first task is to implement the three algorithms described above.
We have provided code that runs a few tests cases for each the
algorithms (see test_basic_algorithms.py
). Our code assumes that
you have defined three functions— find_top_k
,
find_min_count
, and find_frequent
—each of which takes a list
and an integer as parameters and returns a sorted list of tuples. We
encourage you to inspect the tests and ask yourself “Why did they
choose these tests?” and “What tests are obviously missing?”
As in the previous assignments, our test code uses the pytest
testing framework. You can use the -k
flag and the name of the
algorithm (top_k
, min_count
, and frequent
) to run the test
code for a specific algorithm. For example,
py.test test_basic_algorithms.py -k min_count
will run the tests for the min count algorithm
Data¶
Twitter makes it possible to search for tweets with particular properties, say, from a particular user, containing specific terms, and within a given range of dates. There are several Python libraries that simplify the process of using this feature of Twitter. We used the TwitterSearch library to gather tweets from HRC and DJT’s Twitter feeds from mid-November 2015 through mid-October 2016 and stored the resulting data in JSON files. We have provided code that reads the data from the JSON files back into a list of dictionaries, one per tweet.
The representation of a tweet contains a lot of information: creation time, hashtags used, users mentioned, text of the tweet, etc. For example, here is the fourth tweet from HRC from Feb 1st.
"'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt'"
and here is an abbreviated version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs:
{'created_at': 'Mon Feb 01 01:18:37 +0000 2016',
'entities': {
'hashtags': [{'indices': [22, 31], 'text': 'iacaucus'},
{'indices': [52, 62], 'text': 'ImWithHer'}],
'symbols': [],
'urls': [],
'user_mentions': [{'id': 3056142064,
'id_str': '3056142064',
'indices': [3, 16],
'name': 'Hillary for Iowa',
'screen_name': 'HillaryforIA'}]},
'text': 'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt',
}
Notice that the value associated with entities
, which contains
information about hashtags, users mentioned, URLs included, etc, is
itself a dictionary that maps strings to lists of dictionaries. The
structure of the different entity dictionaries is dependent upon the
entity type. For example, if we wanted to find the users mentioned in
a tweet, we’d extract the "screen_name"
values from the dictionaries
in the list associated with the "user_mentions"
key in the
"entities"
dictionary.
Part 2: Analyzing Tweets¶
Now that you have implemented the basic analysis algorithms, you can
move on to analyzing the Twitter feeds. Your code for this part and
all subsequent parts should be added to the file
analyze.py
. This file contains a main block to allow the
program to be run from the command-line. Run:
python3 analyze.py --help
to get a description of the arguments.
While we provide function headers for each of the required tasks, the structure of the rest of the code is entirely up to you. We will note that while some tasks can be done cleanly with one function, others cannot. Also, we recommend looking for sub-tasks that are common to multiple tasks and re-using previously completed functions.
Finding commonly-occurring entities¶
What are DJT’s favorite hashtags? Which URLs are referenced at least 5 times by HRC? Are there users who represent at least 5% of the mentions in DJT’s tweets? Answering these questions requires us to extract the desired entities (hashtags, user mentions, etc) from the candidate’s tweets and process them appropriately.
For these tasks, we will ignore case differences and so, for example, we will consider “demdebate” to be the same as “DemDebate”.
Common Parameters
The next three tasks will use many of the same parameters. To avoid repetition later, we describe them here:
tweets
will be a list of tweet dictionaries,entity_type
will be one of"hashtags"
,"urls"
, or"user_mentions"
, andvalue_key
will be one of"text"
,"url"
, or"screen_name
.
Task 1: Top K entities
For Task 1, you must complete the function:
def find_top_k_entities(tweets, entity_key, value_key, k):
where the first three parameters are as described above and k
is
an integer. This function should return a sorted list of (entity,
count) pairs with the k
most frequently occurring entities and
their associated counts.
Running this function on the tweets in hrc-feb.json
with
"hashtags"
as the entity type, "text"
as the value key, and a
value of three for k
, for example, would yield:
[('imwithher', 70), ('demdebate', 49), ('demtownhall', 28)]
Task 2: Minimum number of occurrences
For Task 2, you must complete the function:
def find_min_count_entities(tweets, entity_key, value_key, min_count):
where the first three parameters are as described above and
min_count
is an integer that specifies the minimum number of times
a entity must occur to be included in the result. This function
should return a sorted list of (entity, count) pairs.
Running this function on the tweets in djt-feb.json
with
"hashtags"
as the entity type, "text"
as the value key, and a
value of 15 for min_count
, for example, would yield:
[('trump2016', 54), ('makeamericagreatagain', 50), ('fitn', 16)]
FYI, #FITN stands for “First In The Nation”.
Task 3: Frequent entities
For Task 3, you must implement the function:
def find_frequent_entities(tweets, entity_key, value_key, k)
where the first three parameters are as described above and k
is
an integer. This function should return a sorted list of (entity,
count) pairs computed using the frequent algorithm.
Running this function on the tweets in hrc-feb.json
with
"hashtags"
as the entity type, "text"
as the value key, and a
value of three for k
, for example, would yield:
[('imwithher', 6),
('iacaucus', 3)]
Testing
We have provided test cases for the basic analysis algorithms and for the tweet-based examples shown in the writeup. Beyond the few examples that we provided, you are on your own for testing these and all subsequent tasks.
Analyzing N-grams¶
If looking at frequently occurring hashtags or finding all the users mentioned some minimum number of times yields some insight into a candidate, what might we learn by identifying commonly occurring pairs of words, word triples, etc?
In this part, you will apply the three algorithms described earlier to contiguous sequences of \(N\) words, which are known as n-grams. Before you can apply these algorithms to a candidate’s tweets, you will pre-process the tweets to try to reveal salient words.
Pre-processing step
The pre-processing step converts the text of a tweet into a list or tuple of strings.
We will define a word to be any sequence of characters delimited by whitespace. For example, “abc”, “7xy”, and “#2016election.” are all words by this definition.
Once you extract the words from a tweet, you should remove leading and
trailing punctuation and convert the word to lower case. We have
defined a constant, PUNCTUATION
that specifies what constitutes
punctuation for this assignment. Note that apostrophes, as in the
word “it’s”, should not be removed, because they occur in the middle
of the word.
Finally, you should eliminate words that appear in a specified set of
stop words (for example, “an”, “of”, and “to”) or that start with a
specified set of stop prefixes (for example, "#"
, "@"
, and
"http"
).
Here is an example: pre-processing the text from the fourth tweet from DJT on Feb 1st:
‘RT @kevcirilli: CEDAR RAPIDS –\n\nTRUMP\’S DAUGHT IVANKA:\n\n”I can just say without equivocation, my father will make America great again.”’
with the sets STOP_WORDS["djt"]
and STOP_PREFIX["default"]
should
yield:
('cedar', 'rapids', "trump's", 'daught', 'ivanka', 'i', 'can', 'just',
'say', 'without', 'equivocation', 'my', 'father', 'will', 'make',
'america', 'great', 'again')
You will find the lower
, split
, strip
, and startswith
methods from the
str API useful for this step.
Representing N-grams
N-grams should be represented as tuples of strings. Given a value of 2 for \(N\), the above DJT tweet would yield the following bi-grams (2-grams):
('cedar', 'rapids')
('rapids', "trump's")
("trump's", 'daught')
('daught', 'ivanka')
('ivanka', 'i')
('i', 'can')
('can', 'just')
('just', 'say')
('say', 'without')
('without', 'equivocation')
('equivocation', 'my')
('my', 'father')
('father', 'will')
('will', 'make')
('make', 'america')
('america', 'great')
('great', 'again')
You might ask “Why use tuples and not lists to represents n-grams?” Tuples of strings are immutable (that is, cannot be modified) and as a result, can be used as keys for dictionaries. Lists, on the other hand, cannot be used as dictionary keys because the data structure used to implement dictionaries relies on the fact that keys are not mutable.
Common parameters
As in Tasks 1-3, Tasks 4-6 have several parameters in common:
tweets
is a list of tweet dictionaries,n
is the number of words in a sequence,stop_words
is a set of words (strings) to ignore, andstop_prefixes
is a set of strings that describe prefixes for words that should be ignored.
Task 4: Top K N-grams
Your task is to implement the function:
def find_top_k_ngrams(tweets, n, stop_words, stop_prefixes, k):
where the first four parameters are as described above and k
is
the desired number of n-grams. This function should return a sorted
list of the (n-gram, integer count of occurrences) pairs for the k
most frequently occurring n-grams.
Running this function on the tweets in hrc-feb.json
with two as
the value for n
, the Clinton stop words (STOP_WORDS["hrc"]
), the
default stop prefixes (STOP_PREFIXES["default"]
), and three as the
value for k
would yield:
[(('new', 'hampshire'), 22),
(('south', 'carolina'), 14),
(('stand', 'with'), 12)]
Task 5: Minimum number of n-gram occurrences
Your task is to implement the function:
def find_min_count_ngrams(tweets, n, stop_words, stop_prefixes, min_count):
where the first four parameters are as described above and
min_count
specifies the minimum number of times an n-gram must
occur to be included in the result. This function should return a
sorted list of (n-gram, integer count of occurrences) pairs.
Running this function on the tweets in djt-feb.json
with
two as the value for n
, the Trump stop words (STOP_WORDS["djt"]
), the
default stop prefixes (STOP_PREFIXES["default"]
), and 30 as the
value for min_count
would yield:
[(('ted', 'cruz'), 36),
(('south', 'carolina'), 35),
(('i', 'will'), 32),
(('new', 'hampshire'), 31)]
Task 6: Frequent N-grams
Your task is to implement the function:
def find_frequent_ngrams(tweets, n, stop_words, stop_prefixes, k)
where the first four parameters are as described above and k
is an
integer. This function should return a a sorted list of (n-gram,
integer approximate counter) pairs as computed using the Frequent
algorithm.
Running this function on the tweets in hrc-feb.json
with two as
the value for n
, the Clinton stop words (STOP_WORDS["hrc"]
), the
default stop prefixes (STOP_PREFIXES["default"]
), and three as the
value for k
would yield:
[(('few', 'iowans'), 1)]
By Month Analysis¶
A candidate’s focus shifts over the course of an election. The number of voters who participate in the primaries is relatively small and their values are often not in line with those of the much larger general election electorate. The oft-discussed need to “pivot to the center” for the general election reflects this difference. An interesting question is whether this shift is visible in the candidates Twitter feeds. (In retrospect. 2016 may not be the best year to illustrate this principle.)
Task 7
We’ll look at this question by calculating the Top K n-grams by month. Specifically, your last task is to implement the function:
def find_top_k_ngrams_per_month(tweets, n, stop_words, stop_prefixes, k):
which takes the same parameters as find_top_k_ngrams
and returns a
list of tuples of the form ((year, month), list of the top k
n
-grams and their counts sorted using our function). The
resulting list should be sorted using the standard sort function.
Running this function on the tweets in djt-all.json
with two as the
value for n
, the combined stop words (STOP_WORDS["both"]
), the
default stop prefixes (STOP_PREFIXES["default"]
), and three as the
value for k
would yield:
[((2015, 11),
[(('make', 'america'), 10),
(('south', 'carolina'), 9),
(('america', 'great'), 8)]),
((2015, 12),
[(('i', 'will'), 34), (('i', 'am'), 26), (('great', 'again'), 19)]),
((2016, 1), [(('i', 'will'), 33), (('ted', 'cruz'), 28), (('i', 'am'), 16)]),
((2016, 2),
[(('ted', 'cruz'), 36), (('south', 'carolina'), 35), (('i', 'will'), 32)]),
((2016, 3), [(('i', 'am'), 24), (('i', 'will'), 24), (('ted', 'cruz'), 19)]),
((2016, 4),
[(('new', 'york'), 28), (('i', 'will'), 22), (('great', 'again'), 15)]),
((2016, 5),
[(('i', 'will'), 25),
(('elizabeth', 'warren'), 16),
(('goofy', 'elizabeth'), 16)]),
((2016, 6), [(('i', 'will'), 15), (('join', 'me'), 14), (('i', 'am'), 12)]),
((2016, 7),
[(('bernie', 'sanders'), 19),
(('our', 'country'), 17),
(('make', 'america'), 12)]),
((2016, 8), [(('i', 'will'), 14), (('i', 'am'), 13), (('join', 'me'), 10)]),
((2016, 9),
[(('i', 'will'), 16), (('join', 'me'), 16), (('make', 'america'), 14)]),
((2016, 10),
[(('i', 'will'), 5), (('join', 'me'), 5), (('about', 'her'), 4)])]
Because the results can be a bit hard to decipher in this form, we
have provided a function, util.pretty_print_by_month
, that takes the
output of find_top_k_ngrams_per_month
and prints it in an
easier-to-read form. For example, here is the result of calling
pretty_print_by_month
on the list above:
November 2015: make america 10
south carolina 9
america great 8
December 2015: i will 34
i am 26
great again 19
January 2016: i will 33
ted cruz 28
i am 16
February 2016: ted cruz 36
south carolina 35
i will 32
March 2016: i am 24
i will 24
ted cruz 19
April 2016: new york 28
i will 22
great again 15
May 2016: i will 25
elizabeth warren 16
goofy elizabeth 16
June 2016: i will 15
join me 14
i am 12
July 2016: bernie sanders 19
our country 17
make america 12
August 2016: i will 14
i am 13
join me 10
September 2016: i will 16
join me 16
make america 14
October 2016: i will 5
join me 5
about her 4
Submission¶
See these submission instructions if you are working alone.
See these submission instructions if you are working in the same pair as for PA #2.
See these submission instructions if you are working in a NEW pair.