Analyzing Candidate Tweets

Due: Friday, October 21st at 4pm

You may work alone or in a pair on this assignment.

The purpose of this assignment is to give you experience with using dictionaries to represent data and as a mechanism for mapping keys to values that change over time.

Introduction

The candidates in the 2016 presidential primaries and general election have used Twitter as a way to communicate with their supporters (and detractors) and gain media attention. In this assignment, you will analyze tweets from Hillary Clinton’s (HRC) and Donald Trump’s (DJT) Twitter feeds to gain some insight into how they use Twitter as a campaign tool. We’ll ask question such as:

  • What is HRC’s favorite hashtag? [#demdebate]
  • Who has DJT mentioned at least 100 times? [realdonaldtrump, cnn, foxnews]
  • What words occur most often in HRC’s tweets? [our, with, are, have, not]
  • What two-word phrases occur most often in DJT’s tweets? [i will, make america, great again, i am, america great]
  • How do the candidates’ feeds change over time? [Not as much as you might think!]

Getting started

See these start-up instructions if you intend to work alone.

See these start-up instructions if you intend to work with the same partner as in PA #2.

See these start-up instructions if you intend to work in a NEW pair.

Here is a description of the contents of this directory:

  • basic_algorithms.py – you will add code for the basic analysis algorithms to this file.
  • test_basic_algorithms.py – this file contains very basic test code for the basic algorithms.
  • analyze.py – you will add code to extract information from tweets and analyze them to this file. Note that this file includes several constants and two functions that you may find useful. They are documented with code comments, and you should make sure to read and understand them before you start writing code.
  • test_analyze_writeup.py – test code for the examples in the writeup.
  • util.py – this file contains a few useful functions.
  • data/ – a directory for the data files. Since the data files are very large they are not included in the Git repository. See below for instructions on how to obtain the data files.

The data/ directory contains a file called get_files.sh that will download the data files. To download the data files, go into the data directory and run this command:

./get_files.sh

This script will download the complete HRC dataset (hrc-all.json), the complete DJT dataset (djt-all.json), and several smaller datasets that may be useful for testing purposes. Please note that you must be connected to the network to use this script.

Do not add the data to your repository! If you wish to use both
CSIL & the Linux servers and your VM, you will need to run the get_files.sh script twice once for CSIL & the Linux servers and once for your VM.

Part 1: Basic Algorithms

In this part, you will implement three algorithms: top k, min count, and frequent. In Part 2, you will use these algorithms for to analyze data extracted from tweets.

Before we describe the algorithms you will implement, we first discuss a function that we have provided for ordering tuples.

Ordering tuples

All three of the algorithms we describe in this section compute an ordered list of (key, value) pairs. We provide a function, named sort_count_pairs, that will handle sorting the pairs for you. It takes a list of the form:

[(k0, v0), (k1, v1), ...]

and sorts it. The natural sort order for pairs uses the first item in the pair as the primary sort key and the second value as the secondary sort key. This ordering is not desirable for our purposes. Instead, we will use the values (v0, v1, etc) as the primary sort key. When sorting the pairs, our sort method puts larger values earlier in the ordering than smaller values. If two values are the same, then the keys (k0, k1, etc) are used as the secondary sort key. They are ordered in increasing order.

For example, given the list:

[('D', 1), ('C', 2), ('A', 5), ('B', 2)]

our function would yield:

[('A', 5), ('B', 2), ('C', 2), ('D', 1)]

Top K

The first algorithm, Top K, which will be familiar from class, computes the \(K\) items that occur most frequently in the list. To do this computation, you will need to:

  1. use a dictionary to count the number of times each unique value occurs,
  2. extract a list of (key, count) pairs from the dictionary,
  3. sort the pairs using the supplied function and finally,
  4. pick off the first \(K\) pairs.

The first step can and should be done with one pass over the data.

Given the following list:

l = ['A', 'B', 'C', 'A', 'A', 'B', 'A', 'C', 'A', 'D']

and a value of two for \(K\), this algorithm would yield:

[('A', 5), ('B', 2)]

Minimum number of occurrences

The second algorithm, Min Count, finds the items in a list that occur at least some specified minimum number of times. To perform this algorithm, you will need to:

  1. compute the counts,
  2. build a list of the items and associated counts that meet the threshold, and then
  3. sort it using the supplied function.

Running this algorithm on the list l defined above with a minimum count of two would yield the following list:

[('A', 5), ('B', 2), ('C', 2)

Frequent items

The previous two algorithms require space proportional to the number of unique items in the list. The third algorithm, Frequent, finds the items in a sequence with a frequency that exceeds a \(1/K\) fraction of \(N\), the total number of items. This algorithm uses space proportional to \(K\), which is likely to be much smaller than \(N\).

The frequent algorithm uses a data structure \(D\) with up to \(K-1\) counters, each associated with a particular item, to track an approximation to the frequency of the corresponding items in the list. For a given list item \(I\), you will update the counters using the following rules:

  • If item \(I\) occurs in \(D\), then increment by one the count associated with \(I\).
  • If item \(I\) does not occur in \(D\) and there are fewer than \(K-1\) items in \(D\), then add \(I\) with a value of 1 to \(D\).
  • If item \(I\) does not occur in \(D\) and there are \(K-1\) items in \(D\), then decrement all of the counters by one and remove any items with a count of zero from \(D\).

Once this data structure is computed, you will need extract the (key, count) pairs from it and sort them using our sort method.

Before we look at the result of applying this algorithm to the list l, let’s look at a more straightforward example. Given the list:

["A", "A", "B", "B", "A", "B", "C", "B", "A"]

and a value of three for \(K\), the frequent algorithm will yield:

[('A', 3), ('B', 3)]

Notice that while the items identified by the frequent algorithm are correct, the counts are not. The algorithm computes an approximation, not the exact count.

Returning to our earlier example, running the algorithm on the list l defined above and a value of three for \(K\), would yield:

[('A', 3), ('D', 1)]

This result may seem a bit odd. The algorithm guarantees that the result will include any item whose frequency exceeds \(N/K\), which is why 'A' occurs in the result. If there are fewer than \(K-1\) hashtags with a frequency that exceeds \(N/K\), then the result may include hashtags that occur less frequently, which explains the presence of 'D' in the result.

It is instructive to think through what happens when running this algorithm with the list l defined above and two as the value of \(K\). The data structure \(D\) will contain one key, 'A` with an associated value of one after processing the last 'A' in the list. When the last value ('D') in the list is processed, the third rule (decrement all the counts) will be invoked and as a result, the data structure \(D\) will be empty when the algorithm finishes.

This algorithm and other similar algorithms can also be used with streaming data. See the paper Finding Frequent Items in Data Streams by Cormode and Hadjieleftheriou for a good summary and an explanation of the relationship between the frequencies reported in the results and the true frequencies.

Task 0

Your first task is to implement the three algorithms described above. We have provided code that runs a few tests cases for each the algorithms (see test_basic_algorithms.py). Our code assumes that you have defined three functions— find_top_k, find_min_count, and find_frequent—each of which takes a list and an integer as parameters and returns a sorted list of tuples. We encourage you to inspect the tests and ask yourself “Why did they choose these tests?” and “What tests are obviously missing?”

As in the previous assignments, our test code uses the pytest testing framework. You can use the -k flag and the name of the algorithm (top_k, min_count, and frequent) to run the test code for a specific algorithm. For example,

py.test test_basic_algorithms.py -k min_count

will run the tests for the min count algorithm

Data

Twitter makes it possible to search for tweets with particular properties, say, from a particular user, containing specific terms, and within a given range of dates. There are several Python libraries that simplify the process of using this feature of Twitter. We used the TwitterSearch library to gather tweets from HRC and DJT’s Twitter feeds from mid-November 2015 through mid-October 2016 and stored the resulting data in JSON files. We have provided code that reads the data from the JSON files back into a list of dictionaries, one per tweet.

The representation of a tweet contains a lot of information: creation time, hashtags used, users mentioned, text of the tweet, etc. For example, here is the fourth tweet from HRC from Feb 1st.

"'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt'"

and here is an abbreviated version of the corresponding tweet dictionary that includes a few of the 20+ key/value pairs:

{'created_at': 'Mon Feb 01 01:18:37 +0000 2016',
 'entities': {
   'hashtags': [{'indices': [22, 31], 'text': 'iacaucus'},
                {'indices': [52, 62], 'text': 'ImWithHer'}],

   'symbols': [],
   'urls': [],
   'user_mentions': [{'id': 3056142064,
                      'id_str': '3056142064',
                       'indices': [3, 16],
                       'name': 'Hillary for Iowa',
                       'screen_name': 'HillaryforIA'}]},
 'text': 'RT @HillaryforIA: The #iacaucus starts in 24 hours! #ImWithHer https://t.co/bF5fyfKbFt',
}

Notice that the value associated with entities, which contains information about hashtags, users mentioned, URLs included, etc, is itself a dictionary that maps strings to lists of dictionaries. The structure of the different entity dictionaries is dependent upon the entity type. For example, if we wanted to find the users mentioned in a tweet, we’d extract the "screen_name" values from the dictionaries in the list associated with the "user_mentions" key in the "entities" dictionary.

Part 2: Analyzing Tweets

Now that you have implemented the basic analysis algorithms, you can move on to analyzing the Twitter feeds. Your code for this part and all subsequent parts should be added to the file analyze.py. This file contains a main block to allow the program to be run from the command-line. Run:

python3 analyze.py --help

to get a description of the arguments.

While we provide function headers for each of the required tasks, the structure of the rest of the code is entirely up to you. We will note that while some tasks can be done cleanly with one function, others cannot. Also, we recommend looking for sub-tasks that are common to multiple tasks and re-using previously completed functions.

Finding commonly-occurring entities

What are DJT’s favorite hashtags? Which URLs are referenced at least 5 times by HRC? Are there users who represent at least 5% of the mentions in DJT’s tweets? Answering these questions requires us to extract the desired entities (hashtags, user mentions, etc) from the candidate’s tweets and process them appropriately.

For these tasks, we will ignore case differences and so, for example, we will consider “demdebate” to be the same as “DemDebate”.

Common Parameters

The next three tasks will use many of the same parameters. To avoid repetition later, we describe them here:

  • tweets will be a list of tweet dictionaries,
  • entity_type will be one of "hashtags", "urls", or "user_mentions", and
  • value_key will be one of "text", "url", or "screen_name.

Task 1: Top K entities

For Task 1, you must complete the function:

def find_top_k_entities(tweets, entity_key, value_key, k):

where the first three parameters are as described above and k is an integer. This function should return a sorted list of (entity, count) pairs with the k most frequently occurring entities and their associated counts.

Running this function on the tweets in hrc-feb.json with "hashtags" as the entity type, "text" as the value key, and a value of three for k, for example, would yield:

[('imwithher', 70), ('demdebate', 49), ('demtownhall', 28)]

Task 2: Minimum number of occurrences

For Task 2, you must complete the function:

def find_min_count_entities(tweets, entity_key, value_key, min_count):

where the first three parameters are as described above and min_count is an integer that specifies the minimum number of times a entity must occur to be included in the result. This function should return a sorted list of (entity, count) pairs.

Running this function on the tweets in djt-feb.json with "hashtags" as the entity type, "text" as the value key, and a value of 15 for min_count, for example, would yield:

[('trump2016', 54), ('makeamericagreatagain', 50), ('fitn', 16)]

FYI, #FITN stands for “First In The Nation”.

Task 3: Frequent entities

For Task 3, you must implement the function:

def find_frequent_entities(tweets, entity_key, value_key, k)

where the first three parameters are as described above and k is an integer. This function should return a sorted list of (entity, count) pairs computed using the frequent algorithm.

Running this function on the tweets in hrc-feb.json with "hashtags" as the entity type, "text" as the value key, and a value of three for k, for example, would yield:

[('imwithher', 6),
 ('iacaucus', 3)]

Testing

We have provided test cases for the basic analysis algorithms and for the tweet-based examples shown in the writeup. Beyond the few examples that we provided, you are on your own for testing these and all subsequent tasks.

Analyzing N-grams

If looking at frequently occurring hashtags or finding all the users mentioned some minimum number of times yields some insight into a candidate, what might we learn by identifying commonly occurring pairs of words, word triples, etc?

In this part, you will apply the three algorithms described earlier to contiguous sequences of \(N\) words, which are known as n-grams. Before you can apply these algorithms to a candidate’s tweets, you will pre-process the tweets to try to reveal salient words.

Pre-processing step

The pre-processing step converts the text of a tweet into a list or tuple of strings.

We will define a word to be any sequence of characters delimited by whitespace. For example, “abc”, “7xy”, and “#2016election.” are all words by this definition.

Once you extract the words from a tweet, you should remove leading and trailing punctuation and convert the word to lower case. We have defined a constant, PUNCTUATION that specifies what constitutes punctuation for this assignment. Note that apostrophes, as in the word “it’s”, should not be removed, because they occur in the middle of the word.

Finally, you should eliminate words that appear in a specified set of stop words (for example, “an”, “of”, and “to”) or that start with a specified set of stop prefixes (for example, "#", "@", and "http").

Here is an example: pre-processing the text from the fourth tweet from DJT on Feb 1st:

‘RT @kevcirilli: CEDAR RAPIDS –\n\nTRUMP\’S DAUGHT IVANKA:\n\n”I can just say without equivocation, my father will make America great again.”’

with the sets STOP_WORDS["djt"] and STOP_PREFIX["default"] should yield:

('cedar', 'rapids', "trump's", 'daught', 'ivanka', 'i',  'can',  'just',
 'say', 'without', 'equivocation', 'my', 'father', 'will', 'make',
 'america', 'great', 'again')

You will find the lower, split, strip, and startswith methods from the str API useful for this step.

Representing N-grams

N-grams should be represented as tuples of strings. Given a value of 2 for \(N\), the above DJT tweet would yield the following bi-grams (2-grams):

('cedar', 'rapids')
('rapids', "trump's")
("trump's", 'daught')
('daught', 'ivanka')
('ivanka', 'i')
('i', 'can')
('can', 'just')
('just', 'say')
('say', 'without')
('without', 'equivocation')
('equivocation', 'my')
('my', 'father')
('father', 'will')
('will', 'make')
('make', 'america')
('america', 'great')
('great', 'again')

You might ask “Why use tuples and not lists to represents n-grams?” Tuples of strings are immutable (that is, cannot be modified) and as a result, can be used as keys for dictionaries. Lists, on the other hand, cannot be used as dictionary keys because the data structure used to implement dictionaries relies on the fact that keys are not mutable.

Common parameters

As in Tasks 1-3, Tasks 4-6 have several parameters in common:

  • tweets is a list of tweet dictionaries,
  • n is the number of words in a sequence,
  • stop_words is a set of words (strings) to ignore, and
  • stop_prefixes is a set of strings that describe prefixes for words that should be ignored.

Task 4: Top K N-grams

Your task is to implement the function:

def find_top_k_ngrams(tweets, n, stop_words, stop_prefixes, k):

where the first four parameters are as described above and k is the desired number of n-grams. This function should return a sorted list of the (n-gram, integer count of occurrences) pairs for the k most frequently occurring n-grams.

Running this function on the tweets in hrc-feb.json with two as the value for n, the Clinton stop words (STOP_WORDS["hrc"]), the default stop prefixes (STOP_PREFIXES["default"]), and three as the value for k would yield:

[(('new', 'hampshire'), 22),
 (('south', 'carolina'), 14),
 (('stand', 'with'), 12)]

Task 5: Minimum number of n-gram occurrences

Your task is to implement the function:

def find_min_count_ngrams(tweets, n, stop_words, stop_prefixes, min_count):

where the first four parameters are as described above and min_count specifies the minimum number of times an n-gram must occur to be included in the result. This function should return a sorted list of (n-gram, integer count of occurrences) pairs.

Running this function on the tweets in djt-feb.json with two as the value for n, the Trump stop words (STOP_WORDS["djt"]), the default stop prefixes (STOP_PREFIXES["default"]), and 30 as the value for min_count would yield:

[(('ted', 'cruz'), 36),
 (('south', 'carolina'), 35),
 (('i', 'will'), 32),
 (('new', 'hampshire'), 31)]

Task 6: Frequent N-grams

Your task is to implement the function:

def find_frequent_ngrams(tweets, n, stop_words, stop_prefixes, k)

where the first four parameters are as described above and k is an integer. This function should return a a sorted list of (n-gram, integer approximate counter) pairs as computed using the Frequent algorithm.

Running this function on the tweets in hrc-feb.json with two as the value for n, the Clinton stop words (STOP_WORDS["hrc"]), the default stop prefixes (STOP_PREFIXES["default"]), and three as the value for k would yield:

[(('few', 'iowans'), 1)]

By Month Analysis

A candidate’s focus shifts over the course of an election. The number of voters who participate in the primaries is relatively small and their values are often not in line with those of the much larger general election electorate. The oft-discussed need to “pivot to the center” for the general election reflects this difference. An interesting question is whether this shift is visible in the candidates Twitter feeds. (In retrospect. 2016 may not be the best year to illustrate this principle.)

Task 7

We’ll look at this question by calculating the Top K n-grams by month. Specifically, your last task is to implement the function:

def find_top_k_ngrams_per_month(tweets, n, stop_words, stop_prefixes, k):

which takes the same parameters as find_top_k_ngrams and returns a list of tuples of the form ((year, month), list of the top k n-grams and their counts sorted using our function). The resulting list should be sorted using the standard sort function. Running this function on the tweets in djt-all.json with two as the value for n, the combined stop words (STOP_WORDS["both"]), the default stop prefixes (STOP_PREFIXES["default"]), and three as the value for k would yield:

[((2015, 11),
  [(('make', 'america'), 10),
   (('south', 'carolina'), 9),
   (('america', 'great'), 8)]),
 ((2015, 12),
  [(('i', 'will'), 34), (('i', 'am'), 26), (('great', 'again'), 19)]),
 ((2016, 1), [(('i', 'will'), 33), (('ted', 'cruz'), 28), (('i', 'am'), 16)]),
 ((2016, 2),
  [(('ted', 'cruz'), 36), (('south', 'carolina'), 35), (('i', 'will'), 32)]),
 ((2016, 3), [(('i', 'am'), 24), (('i', 'will'), 24), (('ted', 'cruz'), 19)]),
 ((2016, 4),
  [(('new', 'york'), 28), (('i', 'will'), 22), (('great', 'again'), 15)]),
 ((2016, 5),
  [(('i', 'will'), 25),
   (('elizabeth', 'warren'), 16),
   (('goofy', 'elizabeth'), 16)]),
 ((2016, 6), [(('i', 'will'), 15), (('join', 'me'), 14), (('i', 'am'), 12)]),
 ((2016, 7),
  [(('bernie', 'sanders'), 19),
   (('our', 'country'), 17),
   (('make', 'america'), 12)]),
 ((2016, 8), [(('i', 'will'), 14), (('i', 'am'), 13), (('join', 'me'), 10)]),
 ((2016, 9),
  [(('i', 'will'), 16), (('join', 'me'), 16), (('make', 'america'), 14)]),
 ((2016, 10),
  [(('i', 'will'), 5), (('join', 'me'), 5), (('about', 'her'), 4)])]

Because the results can be a bit hard to decipher in this form, we have provided a function, util.pretty_print_by_month, that takes the output of find_top_k_ngrams_per_month and prints it in an easier-to-read form. For example, here is the result of calling pretty_print_by_month on the list above:

November 2015:   make america       10
                 south carolina     9
                 america great      8

December 2015:   i will             34
                 i am               26
                 great again        19

January 2016:    i will             33
                 ted cruz           28
                 i am               16

February 2016:   ted cruz           36
                 south carolina     35
                 i will             32

March 2016:      i am               24
                 i will             24
                 ted cruz           19

April 2016:      new york           28
                 i will             22
                 great again        15

May 2016:        i will             25
                 elizabeth warren   16
                 goofy elizabeth    16

June 2016:       i will             15
                 join me            14
                 i am               12

July 2016:       bernie sanders     19
                 our country        17
                 make america       12

August 2016:     i will             14
                 i am               13
                 join me            10

September 2016:  i will             16
                 join me            16
                 make america       14

October 2016:    i will             5
                 join me            5
                 about her          4

Submission

See these submission instructions if you are working alone.

See these submission instructions if you are working in the same pair as for PA #2.

See these submission instructions if you are working in a NEW pair.