Text Processing

In this lecture we will discuss a number of ways to process text data, with an emphasis on how to use regular expressions.

Regular Expressions

Regular expressions are a language for matching text patterns. As we'll see, it is a very expressive language, where I can specify patterns as simple as "does this string contain the word 'foo'?" to patterns to check whether a given string is an e-mail address or a URL. We typically refer to the specification of a single pattern as a regular expression (singular).

We will discuss regular expressions in the context of Python, but they appear in many other contexts.

To use regular expressions in Python, we need to import the "re" module.

In [1]:
import re

A simple pattern

Before we get into the language details of regular expressions, let's see how the re module works with a simple example that doesn't require understanding much about regular expressions. Let's say we have the following strings:

In [2]:
seq0 = "AAACCCTTTGGG"
seq1 = "AAGCGTTGGG"

And we want to determine whether the string GTT appears in these strings. We already know how to do this with string operations:

In [3]:
"GTT" in seq0
Out[3]:
False
In [4]:
"GTT" in seq1
Out[4]:
True

However, "GTT" is a valid regular expression: it simply specified the pattern GTT (i.e., literally that string of characters). So, we can instead use the re module. The findall function takes a string and a regular expression, and returns a list with every ocurrence of the pattern:

In [5]:
re.findall("GTT", seq0)
Out[5]:
[]
In [6]:
re.findall("GTT", seq1)
Out[6]:
['GTT']

The regex language

Regular expressions can be used to specify patterns that are much more complex than the simple example shown above. For example suppose we are given a text where we want to extract all the valid e-mail address in the text. Let's start with this simple string:

In [7]:
s = "purple alice-b@google.com max @uchicago foo@bar alan@cs.uchicago.edu XYZ@a...edu ?&#@%$!"

If we didn't have regular expressions, we would probably split the string, and then try to determine whether each individual string is an e-mail or not. We could start by checking whether it includes an @ sign, but that would also include Twitter handles. We could check whether it includes an @ sign and a dot sign, but that could include invalid e-mail addresses.

Regular expressions will give us a way to specify a "valid e-mail address" pattern in a compact way (and finding those patterns in a straightforward way). As we'll see, the language will allow us to easily specify sets of characters (letters, numbers, alphanumeric, custom sets, etc.), repetition of patterns, groups of patterns, etc.

First of all, when specifying patterns in a regular expression we can use the period character to represent any character (except newlines). We can use the findall function to return all the (non-overlapping) substrings that match a pattern. For example, the pattern .@. matches any character, followed by an at-sign, followed by any character:

In [8]:
re.findall(r".@.", s)
Out[8]:
['b@g', ' @u', 'o@b', 'n@c', 'Z@a', '#@%']

Note: We use "raw" strings when specifying regular expressions (r"..."). There is a good reason for that, but not one that is worth getting into in detail. If you'd like to learn more about why raw strings are preferable, read The Backlash Plague in Python's regular expression documentation.

This looks like it may be a good stepping stone towards an e-mail pattern, but notice how it includes portions of invalid e-mails address: ' @u is part of a Twitter handle, and o@b, 'Z@a', and '#@%' are not part of valid e-mail addresses.

Fortunately, regular expressions also define a number of other sets of characters:

  • \d: Matches any digit (0-9)
  • \s: Matches any whitespace character (spaces, newlines, tabs, etc.)
  • \w: Matches any alphanumeric character, plus the underscore (the w stands for "word")

So, let's try \w@\w

In [9]:
re.findall(r"\w@\w", s)
Out[9]:
['b@g', 'o@b', 'n@c', 'Z@a']

That seems to get us closer to what we want (it still includes two invalid e-mail address, o@b and Z@a, but let's worry about that later)

So far, we're only matching one character at a time, but we would like to actually match multiple characters before and after the at sign. Regular expressions allow us to add repetition qualifiers:

  • *: Matches 0 or more occurrences of the preceding character/pattern
  • +: Matches 1 or more occurrences of the preceding character/pattern
  • ?: Matches 0 or 1 occurrences of the precending character/pattern
  • {m,n}: Matches m to n occurrences of the precending character/pattern
In [10]:
re.findall(r"\w+@\w+", s)
Out[10]:
['b@google', 'foo@bar', 'alan@cs', 'XYZ@a']

We're getting closer, but \w does not include the hyphen or dot character. In regular expressions, we can specify custom sets of characters in square brackets. For example, [ab] is a set containing a and b and [ab]+ is "one or more occurrences of either a or b:

In [11]:
re.findall(r"[ab]+", "---aab--a---bbaabbaba-------b--ba")
Out[11]:
['aab', 'a', 'bbaabbaba', 'b', 'ba']

So, we can try including hyphens and dots in our e-mail regex:

In [12]:
re.findall(r"[\w.-]+@[\w.-]+", s)
Out[12]:
['alice-b@google.com', 'foo@bar', 'alan@cs.uchicago.edu', 'XYZ@a...edu']

Note: The hyphen appears at the end because hyphens also have a special meaning when specifying sets of characters. In particular, it is used to specify ranges of characters. For example, [a-z] is the set of characters "between a and z". If we want to include the literal hyphen character in a set of characters, it must appear as the last character, as done above.

Now we're getting all the valid e-mails, but some of the invalid e-mails are still slipping through! We could tweak this by saying that the domain (the part after the at sign) should include an alphanumeric string (including hyphens) followed by a dot followed by another alphanumeric string:

In [13]:
re.findall(r"[\w.-]+@[\w-]+\.[\w-]+", s)
Out[13]:
['alice-b@google.com', 'alan@cs.uchicago']

But now we're not matching all of alan@cs.uchicago.edu. To do this, we need to specify the repetition of a specific pattern, not just a character or set of characters. In particular, the domain will be an alphanumeric string (including hyphens), followed by one or more occurrences of a dot followed by an alphanumeric string. "A dot followed by an alphanumeric string" is a pattern (not a set of characters), because we expect the dot to appear in a specific position, not anywhere in the string. We can apply the + qualifier to that pattern by putting the pattern between (?: and ):

(?:\.[\w-]+)+

Note that we put a backslash before the dot so that it is interpreted as "literally the dot character" (as opposed to the special meaning that the dot character has in regular expressions)

In [14]:
re.findall(r"[\w.-]+@[\w-]+(?:\.[\w-]+)+", s)
Out[14]:
['alice-b@google.com', 'alan@cs.uchicago.edu']

Parsing e-mail addresses is actually much more complicated. A regex that captures (most) e-mail addresses is the following:

In [15]:
email_regex = "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

Source: http://emailregex.com/ (which, in turn, is based on the standard definition in RFC 5322

In [16]:
re.findall(email_regex, s)
Out[16]:
['alice-b@google.com', 'alan@cs.uchicago.edu']

Group matching

Sometimes, we may want to match a pattern while also taking into account where that pattern appears. For example, let's look at an e-mail message:

In [17]:
msg = open("sample-email.txt").read()
In [18]:
print(msg)
From: jdoe@machine.example
To: mary@example.net
Subject: Saying Hello
Date: Fri, 21 Nov 1997 09:55:06 -0600
Message-ID: <1234@local.machine.example>

This is a message just to say hello.
So, "Hello".

The first lines are called the headers of the e-mail. There are a lot of different headers, and we will only be concerned with the "From" and "To" headers, which specify the sender and recipient of the e-mail. Let's say we want to extract the sender of an e-mail; we could use the regular expression we specified before:

In [19]:
re.findall(email_regex, msg)
Out[19]:
['jdoe@machine.example', 'mary@example.net', '1234@local.machine.example']

However, this returns two e-mails (the From and To) and something that happens to match the e-mail regex, but is not actually an e-mail (it is an identifier for the e-mail message). We could be tempted to say "the first e-mail is the sender's e-mail", but that may not always work. For example, suppose we have multiple e-mail messages in one file:

In [20]:
emails = open("sample-emails.txt").read()
In [21]:
print(emails)
From: jdoe@machine.example
To: mary@example.net
Subject: Jeff's e-mail?
Date: Fri, 21 Nov 1997 09:55:06 -0600
Message-ID: <1234@local.machine.example>

What is Jeff's e-mail?


From: mary@example.net
To: jdoe@machine.example
Reply-To: smith@home.example
Subject: Re: Jeff's e-mail?
Date: Fri, 21 Nov 1997 10:01:10 -0600
Message-ID: <3456@example.net>
In-Reply-To: <1234@local.machine.example>
References: <1234@local.machine.example>

Jeff's e-mail is jeff@example.net, but
you may want to e-mail his assistant
rob@example.net


To: smith@home.example
From: jdoe@machine.example
Subject: Re: Jeff's e-mail?
Date: Fri, 21 Nov 1997 11:00:00 -0600
Message-ID: <abcd.1234@local.machine.test>
In-Reply-To: <3456@example.net>
References: <1234@local.machine.example> <3456@example.net>

Thanks!
In [22]:
re.findall(email_regex, emails)
Out[22]:
['jdoe@machine.example',
 'mary@example.net',
 '1234@local.machine.example',
 'mary@example.net',
 'jdoe@machine.example',
 'smith@home.example',
 '3456@example.net',
 '1234@local.machine.example',
 '1234@local.machine.example',
 'jeff@example.net',
 'rob@example.net',
 'smith@home.example',
 'jdoe@machine.example',
 'abcd.1234@local.machine.test',
 '3456@example.net',
 '1234@local.machine.example',
 '3456@example.net']

Extracting the sender e-mails from the above can be tricky. Fortunately, we can write a regular expression that matches all the e-mail addresses that are preceded by "From: "

In [23]:
re.findall(r"From: " + email_regex, msg)
Out[23]:
['From: jdoe@machine.example']
In [24]:
re.findall(r"From: " + email_regex, emails)
Out[24]:
['From: jdoe@machine.example',
 'From: mary@example.net',
 'From: jdoe@machine.example']

While we easily remove the "From: " using string processing functions, we can also use groups in regular expressions to specify the actual part of the pattern that we're interested in. A group is any pattern in parentheses:

In [25]:
re.findall(r"From: (" + email_regex + ")", emails)
Out[25]:
['jdoe@machine.example', 'mary@example.net', 'jdoe@machine.example']

The above regular expression only specifies one group, but we can specify multiple groups in a single regular expression. For example, let's take our original e-mail regex, and put parentheses around the part of the pattern before the @ sign and in the part after the @ sign:

In [26]:
re.findall(r"([\w.-]+)@([\w-]+(?:\.[\w-]+)+)", emails)
Out[26]:
[('jdoe', 'machine.example'),
 ('mary', 'example.net'),
 ('1234', 'local.machine.example'),
 ('mary', 'example.net'),
 ('jdoe', 'machine.example'),
 ('smith', 'home.example'),
 ('3456', 'example.net'),
 ('1234', 'local.machine.example'),
 ('1234', 'local.machine.example'),
 ('jeff', 'example.net'),
 ('rob', 'example.net'),
 ('smith', 'home.example'),
 ('jdoe', 'machine.example'),
 ('abcd.1234', 'local.machine.test'),
 ('3456', 'example.net'),
 ('1234', 'local.machine.example'),
 ('3456', 'example.net')]
In [27]:
re.findall(r"From: ([\w.-]+)@([\w-]+(?:\.[\w-]+)+)", emails)
Out[27]:
[('jdoe', 'machine.example'),
 ('mary', 'example.net'),
 ('jdoe', 'machine.example')]

Notice how we now get a list of tuples back, with each tuple containing the values for each group within the matched pattern.

When using a regular expression where we only expect, at most, one match, we can also use the search function, which returns a Match object that provides a better interface for working with these groups:

In [28]:
m = re.search(r"([\w.-]+)@([\w-]+(?:\.[\w-]+)+)", "alan@cs.uchicago.edu")
In [29]:
m.groups()
Out[29]:
('alan', 'cs.uchicago.edu')
In [30]:
m.group(0)
Out[30]:
'alan@cs.uchicago.edu'
In [31]:
m.group(1)
Out[31]:
'alan'
In [32]:
m.group(2)
Out[32]:
'cs.uchicago.edu'

We can also name each group by using (?P<groupname>...) instead of just parentheses:

In [33]:
m = re.search(r"(?P<user>[\w.-]+)@(?P<domain>[\w-]+(?:\.[\w-]+)+)", "alan@cs.uchicago.edu")
In [34]:
m.group("user")
Out[34]:
'alan'
In [35]:
m.group("domain")
Out[35]:
'cs.uchicago.edu'

The Enron e-mail dataset

The Enron Corporation was an American energy, commodities, and services company that collapsed due to corporate fraud and corruption. The Federal Energy Regulatory Commission publicly posted all the information it used in its investigation, including several hundreds of thousands of e-mails.

This email dataset has been used in a number of scholarly works on text processing, network analysis, etc. We will do a very simple analysis of the dataset, which we downloaded from https://www.cs.cmu.edu/~enron/. We converted the dataset into the mbox file format because it will make it easy to process using regular expressions, while still allowing us to process the dataset one message at a time.

Once you have downloaded the datasets, you will need to uncompress them:

gunzip emails-100k.mbx.gz
gunzip emails-all.mbx.gz

We are going to count how many e-mails are sent from each e-mail address. We can simply use the same regular expression we used earlier to extract the e-mail address from the "From" header. However, because we are going to check a lot of data, we are going to compile the regular expression so we can reuse it efficiently:

In [36]:
from_regex = re.compile(r"From: (" + email_regex + ")")

The from_regex variable now contains an object with an efficient internal representation of the regular expression (produced from the string specification we passed to compile, and which we were earlier passing to findall and search)

The following code is the standard "iterate over a list of things and count them" code, but where we use a regex to extract the e-mail address in the From header. Notice how we don't read the entire file into memory (it's 1.5GB!). Instead, we read it line by line and search each line using the regex.

In [37]:
enron = open("emails-all.mbx")

from_counts = {}
for l in enron:
    m = from_regex.search(l)
    if m is not None:
        email = m.group(1)
        if email not in from_counts:
            from_counts[email] = 0
        from_counts[email] += 1

We will use the Counter object from the collections module to produce the top 10 entries in the from_counts dictionary:

In [38]:
import collections

c = collections.Counter(from_counts)
c.most_common(10)
Out[38]:
[('kay.mann@enron.com', 16735),
 ('vince.kaminski@enron.com', 14368),
 ('jeff.dasovich@enron.com', 11411),
 ('pete.davis@enron.com', 9152),
 ('chris.germany@enron.com', 8801),
 ('sara.shackleton@enron.com', 8780),
 ('enron.announcements@enron.com', 8587),
 ('tana.jones@enron.com', 8490),
 ('steven.kean@enron.com', 6759),
 ('kate.symes@enron.com', 5441)]

Interestingly, there are not that many e-mails from Ken Lay or Jeff Skilling:

In [39]:
from_counts["kenneth.lay@enron.com"]
Out[39]:
36
In [40]:
from_counts["jeff.skilling@enron.com"]
Out[40]:
108

Avoid reinventing the wheel: the email module

Regular expressions provide a quick and convenient way to search for patterns in (potentially large) datasets.

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.

However, whenever you find yourself parsing data that follows a standard format (like e-mails do), you should avoid reinventing the wheel. Someone has probably already implemented code to parse data in that format, and they probably account for a lot more corner cases than you do. For example, the From and To headers in an e-mail can list senders/recipients in a variety of ways:

In [41]:
s = open("sample-email-multiple-recipients.txt").read()
In [42]:
print(s)
From: "Joe Q. Public" <john.q.public@example.com>
To: Mary Smith <mary@x.test>, jdoe@example.org, Who? <one@y.test>
Cc: <boss@nil.test>, "Giant; "Big" Box" <sysservices@example.net>
Date: Tue, 1 Jul 2003 10:52:37 +0200
Message-ID: <5678.21-Nov-1997@example.com>

Hi everyone.


Our regular expression does not account for this! While we could tweak it, it turns out that Python already includes an email module to parse e-mail messages.

In [43]:
import email
In [44]:
msg = email.message_from_string(s)

We can access the headers using a dictionary-like syntax:

In [45]:
msg["from"]
Out[45]:
'"Joe Q. Public" <john.q.public@example.com>'
In [46]:
msg["date"]
Out[46]:
'Tue, 1 Jul 2003 10:52:37 +0200'
In [47]:
msg.items()
Out[47]:
[('From', '"Joe Q. Public" <john.q.public@example.com>'),
 ('To', 'Mary Smith <mary@x.test>, jdoe@example.org, Who? <one@y.test>'),
 ('Cc', '<boss@nil.test>, "Giant; "Big" Box" <sysservices@example.net>'),
 ('Date', 'Tue, 1 Jul 2003 10:52:37 +0200'),
 ('Message-ID', '<5678.21-Nov-1997@example.com>')]

Note: The message is not a dictionary. It just provides dictionary-like access to the headers. Notice how the header names are case-insensitive.

Accessing the body of an e-mail is slightly more complicated because we have to account for messages with attachments, etc.

In [48]:
def get_email_body(msg):
    if msg.is_multipart():
        for part in msg.walk():
            if part.get_content_type() == 'text/plain':
                return part.get_payload()
    else:
        return msg.get_payload()
In [49]:
get_email_body(msg)
Out[49]:
'Hi everyone.\n\n'

Parsing the addresses in the From, To, and CC headers can be easily done with email.utils.parseaddr and email.utils.getaddresses.

In [50]:
msg["from"]
Out[50]:
'"Joe Q. Public" <john.q.public@example.com>'
In [51]:
email.utils.parseaddr(msg["from"])
Out[51]:
('Joe Q. Public', 'john.q.public@example.com')
In [52]:
msg["to"]
Out[52]:
'Mary Smith <mary@x.test>, jdoe@example.org, Who? <one@y.test>'
In [53]:
email.utils.getaddresses([msg["to"]])
Out[53]:
[('Mary Smith', 'mary@x.test'),
 ('', 'jdoe@example.org'),
 ('Who?', 'one@y.test')]
In [54]:
def get_all_recipients(msg):
    tos = msg.get_all('to', [])
    ccs = msg.get_all('cc', [])
    return [a for _, a in email.utils.getaddresses(tos + ccs)]
In [55]:
get_all_recipients(msg)
Out[55]:
['mary@x.test',
 'jdoe@example.org',
 'one@y.test',
 'boss@nil.test',
 'sysservices@example.net']

We can now write some code to count all the (sender, recipient) pairs in the Enron email dataset. We use the mailbox module, which allows us to easily iterate over each message in the file.

In [56]:
import mailbox
In [57]:
enron = mailbox.mbox("emails-100k.mbx")
In [58]:
pairs_counts = {}
for msg in enron:
    _, from_addr = email.utils.parseaddr(msg["from"])
    recipients = get_all_recipients(msg)
    for r in recipients:
        pair = (from_addr, r)
        if pair not in pairs_counts:
            pairs_counts[pair] = 0
        pairs_counts[pair] += 1

This code takes longer to run than our previous code (even when using a smaller dataset) because, for every message, we have to parse the entire e-mail message (previously, we were only running a regular expression that targetted the From header exclusively). This is a tradeoff: we can get better performance with a regular expression, but we may be limited by how complete our regular expression is. However, regular expressions are still useful on their own in many cases.

In [59]:
import collections

c = collections.Counter(pairs_counts)
c.most_common(10)
Out[59]:
[(('pete.davis@enron.com', 'pete.davis@enron.com'), 3635),
 (('pete.davis@enron.com', 'ryan.slinger@enron.com'), 1812),
 (('pete.davis@enron.com', 'geir.solberg@enron.com'), 1810),
 (('pete.davis@enron.com', 'mark.guzman@enron.com'), 1810),
 (('pete.davis@enron.com', 'craig.dean@enron.com'), 1664),
 (('pete.davis@enron.com', 'bert.meyers@enron.com'), 1207),
 (('pete.davis@enron.com', 'leaf.harasin@enron.com'), 1207),
 (('pete.davis@enron.com', 'monika.causholli@enron.com'), 1146),
 (('pete.davis@enron.com', 'bill.williams.iii@enron.com'), 1075),
 (('pete.davis@enron.com', 'jbryson@enron.com'), 1075)]