Regular Expressions¶

The objective of this lab is to practice using regular expressions.

A formal definition of regular expressions¶

Given an alphabet Σ (a collection of symbols),

the empty string is a regular expression;
any character, c, in Σ is a regular expression that matches the character c;
if α and β are regular expressions then, αβ is a regular expression that matches any string that marches α concatenated with a string that matches β;
if α and β are regular expressions then, α|β is a regular expression that matches a string that matches α or a string that matches β and
if α is a regular expression then, α* is a regular expression that matches zero or more occurrences of α

To make this concept more concrete, we will first look at a Linux utility that uses regular expressions to match lines in a file or files and then see how these same concepts are expressed in Python.

Grep¶

We are going to start our exploration of regular expressions using egrep, which is part of the grep family of Linux utilities (grep, egrep, fgrep). These programs, which are invoked on the command-line, are used to determine which lines in a file (or files) match a pattern that is specified using a regular expression. We will be using egrep, because its regular expression syntax is similar to Python’s regular expression syntax.

We will be using the file test0.txt in our examples. Before moving on, please open the file in your favorite editor (or use cat test0.txt from the Linux command-line) and take a quick look at it. You’ll see each line contains a string over the alphabet {a,b,c,d,1,2,3,4}.

We will start with the simplest type of pattern: a sequence of characters that should be matched exactly. For example, to match the pattern “aabb” in the file “test0.txt” run the command:

--input--
egrep 'aabb' test0.txt
--output--
ccccccccccccaabb
aaaaaaaaabb

Notice that a line is said to match a pattern if any part of the line matches the pattern.

Finding exact matches is not very interesting, so we will generalize our pattern language. The first generalization is to add a way to match any single character. The standard symbol for this purpose is the character ”.” Try this example:

--input--
egrep 'aa.bb' test0.txt
 --output--
aacbb            #aacbb matches aa.bb
aacbbbaaaaaaaaaa
aaaaaaaaabb      #aaabb matches aa.bb

As noted above, the “*” character matches to zero of more repetitions of the previous pattern. Try the pattern a*:

--input--
grep "a*" test0.txt
--output--
aacbb
abbb
aaab
ccccccccccccaabb
aacbbbaaaaaaaaaa
abbccccccccaaaaaaa
a
b
aa
bb
aaaaaaaaabb
ddddddddddddd
ddddd\dddddddd
ddd4ddd
123

Notice that grep returns every line in the file, because every line has zero or more a’s. Here’s a more interesting example:

--input--
egrep  'aa.*bb' test0.txt
--output
aacbb
ccccccccccccaabb
aacbbbaaaaaaaaaa
aaaaaaaaabb

If we want to match a pattern only at the beginning of the line, we can use the ‘^’ symbol at the start of the pattern. To match the end of the line, we can use the ‘$’ symbol at the end the pattern. To get every line that begins with a ‘c’ try:

--input--
egrep '^c' test0.txt
--output--
ccccccccccccaabb

To get every line that ends with a ‘b’ try:

--input--
egrep 'b$' test0.txt
--output--
aacbb
abbb
aaab
ccccccccccccaabb
b
bb
aaaaaaaaabb

As is typical, special characters need to be escaped. For instance, tab == \t. To match a ”.”, use “\.” and to match ‘\’ try the pattern ‘\\’. For example:

--input--
egrep '\\' test0.txt
--output--
ddddd\dddddddd

A list of characters enclosed in square brackets means that we want to match one of the specified characters exactly once. For example [a-zA-Z] matches a single lower or upper case letter and [ \t] matches a space or a tab. To look for every line that has either a ‘c’ or a ‘4’ try the pattern [c4]:

--input--
egrep '[c4]' test0.txt
--output--
aacbb
ccccccccccccaabb
aacbbbaaaaaaaaaa
abbccccccccaaaaaaa
ddd4ddd

Try the pattern [0-9] to find every line with a number.

--input--
egrep '[0-9]' test0.txt
--output--
ddd4ddd
123

The square bracket notation allows us to specify a limit range of choices for matching a single character. The ‘|’ notation allows us to specifiy a set of choices for more substantial patterns. For example,

--input--
egrep "Planck|Bohr" chant.txt
--output--
Gimme Plancks constant H
Gimme the Bohr radius A

The ‘|’ operator has very low precedence. Also, spaces immediately before or after the ‘|’ operator are interpreted as part of the desired pattern. For example, the pattern “Planck | Bohr” matches different lines than the pattern “Plack|Bohr”.

--input--
egrep "Planck | Bohr" chant.txt
--output--
Gimme the Bohr radius A

Exercises¶

Use egrep and the file chant.txt for the following exercises:

Find the lines that have the word “light”.
Find the lines that start with ‘G’, and then find lines that start with any letter except ‘G’.
Find lines that end in an ‘O’ or an ‘I’.

Use egrep and the file Chicago.txt for the following exercises:

Find potential names. Two consecutive words with upper-cased first letters.
Find all dates like June 14, 1998 or August 1, 2001
Find all city/state pairs like Chicago, IL

Regular Expressions in Python¶

Regular expressions in Python are similar. Here is a look at some patterns Python can specify.

Patterns¶

As with grep, patterns can include specific sequences of characters to match, periods (match any character), or sets of characters or ranges of characters within square brackets (match any one character from the specified set). In addition to these, Python has basic patterns, repetitions, and alternations.

Basic Patterns

Here are is a useful subset of the basic patterns provided by Python.

Pattern: \d, which is equivalent to [0-9] and -matches a single digit.
Pattern: \w, which is equivalent to [a-zA-Z0-9_] and matches a single character that is a letter, a number, or an underscore.
Pattern: \s, which matches a single white space character (space, tab, etc).

Examples

A 10-digit phone number \d\d\d-\d\d\d-\d\d\d\d

Repetition

If β is a regular expression then recall that β* is a regular expression that matches 0 or more occurrences of β. Additionally, β+ is a regular expression that matches one or more occurrences of β. β+ is shorthand for ββ*. Additionally,

Pattern:β{n} will match to exactly n consecutive occurrences of β
Pattern:β{n1,n2} will match from n1 to n2 consecutive occurrences of β.

Examples

A 10-digit phone number `` d{3}-d{3}-d{4}``
State Abbreviation `` [A-Z]{2}``
IL License Plate `` [A-Z0-9]{1,6}``

Alternations

Alternations are a way to specify a choice: match this or match that.

Pattern: `` α|β``. This pattern matches the pattern α or the pattern β.
Patterns: `` β?`` This pattern matches zero or one occurrences of the pattern β

Example:

For example, Given the string, “henry@yahoo.com meg4@gmail.com lamont@gmail.com”, the regular expression [a-zA-Z]+[0-9]@gmail will match the substring “meg4@gmail.com”. In contrast, the regular expression, [a-zA-Z]+[0-9]?@gmail.com, which match the substrings “meg4@gmail.com” and “lamont@gmail.com”.

Search¶

For using regular expressions in Python, the first step is to import regular expressions using the statement:

import re

We will focus on two functions from the re library: re.search() and re.findall(). re.search(pattern,string) finds the very first instance of the pattern in the string. This expression returns a match object, if there is a match and None otherwise. If there is a match, the string that matched the pattern can be obtained by calling match.group(), where match is the result of calling re.search. Generally, it is common to use:

match=re.search(pattern,string)
if match:
    print match.group()

This example looks for an email address in the phrase. Notice that only the first email address is listed.

Email address example

--input--
email="python peanut@yahoo.com java  panda@gmail.com Thanksgiving henry@uchicago.edu peanutbutter"
match=re.search('[\w.-]+@[\w.-]+',email)
print match.group()
--output--
peanut@yahoo.com

To find every instance of a pattern in a string use the findall function, which returns a list.

Examples:

--input--
email="python peanut@yahoo.com java  panda@gmail.com Thanksgiving henry@uchicago.edu peanutbutter"
matches=re.findall('[\w.-]+@[\w.-]+',email)
print matches
--output--
['peanut@yahoo.com', 'panda@gmail.com', 'henry@uchicago.edu']

Exercises¶

Use ipython3 for the following exercises:

Find the pattern that locates the word in the email string (above) that starts with T.
Add the numbers in the string “1pe2anu3tbu5ttera7nd0je9lly”

Reading a File in Python¶

Here are some strategies for finding patterns in a file.

Read in a file, save it to a list of strings, and then close the file.

f=open(filename,'r')
text=f.read()
f.close

Then use regular expressions normally on the string text.

Code Analysis¶

Regular expressions can be used to analyze code in a large program. Examine file “play.py”

Find the names of all Python function definitions in the file play.py.
Find the names of all the nested functions in the file play.py.

Groups and Tuples¶

Matches can be grouped according to the pattern. The group function allows the whole match to be separated into groups.

match=re.search("(pattern1)(pattern2)", string)
if match:
    print match.group(1) # prints the string that matches pattern 1
    print match.group(2) #  ' '  matches pattern 2

Example

--input--
match=re.search('\w+)(\s)(\d.+)', "Apple 3.99")
print (match.group(1 ))
print (match.group(3))
    --output--
    Apple       # group 1 is (\w+)
    3.99        # group 3 is (\d.+)

The findall function returns a list of tuples that match the pattern.

tuples=re.findall('(pattern1)(pattern2)',string)
for t in tuples:
    print(t[0], t[1])  # prints pattern1, pattern2

Exercise¶

The file emails.txt contains email addresses for a set of fictional students. Write Python code to create a list of usernames associated with uchicago accounts in the file.

Step 1: read in the file.
Step 2: create a Python regular expression to identity the addresses with a uchicago domain and to extract the associated username.

Exercise¶

Generate an index of Python functions and their associated source files. Use all the Python files in the distribution as example files.

Write a function that takes the name of a Python source file and outputs a list of the names of the top-level functions defined in that file.
Using the previous function, write a function that takes a list of filenames for Python programs and outputs a list of (function name, source file) tuples.
Sort the resulting list by function name.

Patterns in Files¶

The files Macbeth.txt, RJ.txt, and errors.txt are Shakespeare plays provided from the Gutenberg Project. Examine these plays for common phrases and comparisons.

Count the number of times Romeo and Juliet’s names are used in the play RJ.txt.
Create a function that takes a prefix and a file and then creates a list of words from that file that follow the prefix. Common phrases include : “Thou art ” , “thou shalt ”, and “If thou ”.
Note: Try limiting the sizes of the words that follow the prefix that match the pattern, so to exclude short words like “the” and “as”. Define your pattern to ignore the case of the first letter in the prefix.
Also popular are comparisons like “As (1) as (2)”. Create a dictionary for these comparisons. Mapping the value 1 as the key to the value 2. If something is compared multiple times, add the word to the value in the dictionary without replacing the existing value.
```
As pretty as gentle        dictionary[pretty]=gentle
As pretty as pale          dictionary[pretty]=gentle,pale
```