Regular Expressions¶
The objective of this lab is to practice using regular expressions.
Please run git pull
on the upstream repository to pick up the necessary files for this lab. This time around not all solutions are provided but majority of them are provided. Please ask on Ed if you are having difficulty with the exercises that do not have a solution.
A formal definition of regular expressions¶
Given an alphabet Σ (a collection of symbols),
the empty string is a regular expression;
any character, c, in Σ is a regular expression that matches the character c;
if α and β are regular expressions then, αβ is a regular expression that matches any string that matches α concatenated with a string that matches β;
if α and β are regular expressions then, α|β is a regular expression that matches a string that matches α or a string that matches β, and
if α is a regular expression then, α* is a regular expression that matches zero or more occurrences of α.
To make this concept more concrete, we will first look at a Linux utility that uses regular expressions to match lines in a file or files and then see how these same concepts are expressed in Python.
Grep¶
We are going to start our exploration of regular expressions using egrep, which is part of the grep family of Linux utilities (grep, egrep, fgrep). These programs, which are invoked on the command-line, are used to determine which lines in a file (or files) match a pattern that is specified using a regular expression. We will be using egrep, because its regular expression syntax is similar to Python’s regular expression syntax.
We will be using the file test0.txt
in our examples. Before moving
on, please open the file in your favorite editor (or use cat
test0.txt
from the Linux command-line) and take a quick look at
it. You’ll see each line contains a string over the alphabet
{a,b,c,d,1,2,3,4}.
We will start with the simplest type of pattern: a sequence of
characters that should be matched exactly. For example, to match the
pattern 'aabb'
in the file "test0.txt"
run the command:
--input--
egrep 'aabb' test0.txt
--output--
ccccccccccccaabb
aaaaaaaaabb
Notice that a line is said to match a pattern if any part of the line matches the pattern.
Finding exact matches is not very interesting, so we will generalize our
pattern language. The first generalization is to add a way to match any
single character. The standard symbol for this purpose is the character
.
Try this example:
--input--
egrep 'aa.bb' test0.txt
--output--
aacbb #aacbb matches aa.bb
aacbbbaaaaaaaaaa
aaaaaaaaabb #aaabb matches aa.bb
As noted above, the *
character matches to zero of more repetitions
of the previous pattern. Try the pattern a*
:
--input--
grep "a*" test0.txt
--output--
aacbb
abbb
aaab
ccccccccccccaabb
aacbbbaaaaaaaaaa
abbccccccccaaaaaaa
a
b
aa
bb
aaaaaaaaabb
ddddddddddddd
ddddd\dddddddd
ddd4ddd
123
1.23
Notice that grep returns every line in the file, because every line has zero or more a’s. Here’s a more interesting example:
--input--
egrep 'aa.*bb' test0.txt
--output
aacbb
ccccccccccccaabb
aacbbbaaaaaaaaaa
aaaaaaaaabb
If we want to match a pattern only at the beginning of the line, we can
use the ^
symbol at the start of the pattern. To match the end of the
line, we can use the $
symbol at the end the pattern. To get every
line that begins with a c
try:
--input--
egrep '^c' test0.txt
--output--
ccccccccccccaabb
To get every line that ends with a b
try:
--input--
egrep 'b$' test0.txt
--output--
aacbb
abbb
aaab
ccccccccccccaabb
b
bb
aaaaaaaaabb
As is typical, special characters need to be escaped. For instance,
use \t
for tab. To match a period, use \.
and to match a
backslash (\
) try the pattern \\
. For example:
--input--
egrep '\.' test0.txt
--output--
1.23
--input--
egrep '\\' test0.txt
--output--
ddddd\dddddddd
A list of characters enclosed in square brackets means that we want to
match one of the specified characters exactly once. For example
[a-zA-Z]
matches a single lower or upper case letter. To look for
every line that has either a c
or a 4
try the pattern '[c4]'
:
--input--
egrep '[c4]' test0.txt
--output--
aacbb
ccccccccccccaabb
aacbbbaaaaaaaaaa
abbccccccccaaaaaaa
ddd4ddd
Try the pattern '[0-9]'
to find every line with a number.
--input--
egrep '[0-9]' test0.txt
--output--
ddd4ddd
123
Adding a ^
as the first character inside the square brackets
indicates that the pattern matches the complement of the listed
characters (i.e., the pattern matches every character except those
listed). For example, '[^0-9']
matches every character except a
digit.
The square bracket notation allows us to specify a limit range of choices for matching a single character. The ‘|’ notation allows us to specify a set of choices for more substantial patterns. For example,
--input--
egrep 'Planck|Bohr' chant.txt
--output--
Gimme Plancks constant H
Gimme the Bohr radius A
The |
operator has very low precedence. Also, spaces immediately
before or after the |
operator are interpreted as part of the desired
pattern. For example, the pattern 'Planck | Bohr'
matches different
lines than the pattern 'Plack|Bohr'
.
--input--
egrep 'Planck | Bohr' chant.txt
--output--
Gimme the Bohr radius A
Exercises¶
Use egrep and the file chant.txt for the following exercises:
Find the lines that have the word light or radius.
Find the lines that start with G, and then find lines that start with any letter except G.
Find lines that end in an O or an I.
Use egrep and the file Chicago.txt for the following exercises:
Find all dates like June 14, 1998 or August 1, 2001
Find all city/state pairs like Chicago, IL. Assume that city names start with a capital letter and are a single word and that all pairs of capital letters count as state abbreviations. So, your pattern should match
"Farmville, ZZ"
:
Regular Expressions in Python¶
Regular expressions in Python are similar. Here is a look at some patterns Python can specify.
Patterns¶
As with grep, patterns can include specific sequences of characters to match, periods (match any character), or sets of characters or ranges of characters within square brackets (match any one character from the specified set). In addition to these, Python has basic patterns, repetitions, and alternations.
Basic Patterns
Here are is a useful subset of the basic patterns provided by Python.
Pattern:
\d
, which is equivalent to[0-9]
and -matches a single digit.Pattern:
\w
, which is equivalent to[a-zA-Z0-9_]
and matches a single character that is a letter, a number, or an underscore.Pattern:
\s
, which matches a single white space character (space, tab, etc).
Examples
A 10-digit phone number
\d\d\d-\d\d\d-\d\d\d\d
Repetition
If β is a regular expression then recall that β* is a regular expression that matches 0 or more occurrences of β. Additionally, β+ is a regular expression that matches one or more occurrences of β. β+ is shorthand for ββ*. Additionally,
Pattern:
β{n}
will match to exactly n consecutive occurrences of βPattern:
β{n1,n2}
will match from n1 to n2 consecutive occurrences of β.
Examples
A 10-digit phone number
\d{3}-\d{3}-\d{4}
State Abbreviation
[A-Z]{2}
IL License Plate
[A-Z0-9]{1,6}
Alternations
Alternations are a way to specify a choice: match this or match that.
Pattern: `` α|β``. This pattern matches the pattern α or the pattern β.
Patterns: `` β?`` This pattern matches zero or one occurrences of the pattern β
Example:
For example, Given the string, “henry@yahoo.com meg4@gmail.com lamont@gmail.com”, the regular expression
[a-zA-Z]+[0-9]@gmail
will match the substring “meg4@gmail.com”. In contrast, the regular expression,[a-zA-Z]+[0-9]?@gmail.com
, which match the substrings “meg4@gmail.com” and “lamont@gmail.com”.
Search¶
For using regular expressions in Python, the first step is to import regular expressions using the statement:
import re
We will focus on two functions from the re library: re.search() and
re.findall(). re.search(pattern, string)
finds the very first
instance of the pattern in the string. This expression returns a match
object, if there is a match, and None
otherwise. If there is a match,
the string that matched the pattern is obtained by calling
match.group()
, where match
is the result of calling
re.search
. Generally, it is common to use:
match=re.search(pattern, string)
if match:
print(match.group())
This example looks for an email address in the phrase. Notice that only the first email address is listed.
Email address example
--input--
email="python peanut@yahoo.com java panda@gmail.com Thanksgiving henry@uchicago.edu peanutbutter"
match=re.search('[\w.-]+@[\w.-]+',email)
print(match.group())
--output--
peanut@yahoo.com
To find every instance of a pattern in a string use the findall
function, which returns a list.
Examples:
--input--
email="python peanut@yahoo.com java panda@gmail.com Thanksgiving henry@uchicago.edu peanutbutter"
matches=re.findall('[\w.-]+@[\w.-]+',email)
print(matches)
--output--
['peanut@yahoo.com', 'panda@gmail.com', 'henry@uchicago.edu']
Exercises¶
Use ipython3 for the following exercises:
Find the pattern that locates the word in the
email
string (above) that starts with T.Compute the sum of the numbers in the string “1pe2anu3tbu5ttera7nd0je9lly”
Reading a File in Python¶
Here are some strategies for finding patterns in a file.
Read in a file, save it to a list of strings, and then close the file.
with open(filename) as f:
text=f.read()
Then use regular expressions normally on the string text
.
Code Analysis¶
Regular expressions can be used to analyze code in a large program. Examine file “play.py”
(In-class) Find the names of all Python function definitions in the file
play.py
.Find the names of all the nested functions in the file
play.py
.
Groups and Tuples¶
Matches can be grouped according to the pattern. The group function allows the whole match to be separated into groups.
match=re.search("(pattern1)(pattern2)", string)
if match:
print(match.group(1)) # prints the string that matches pattern 1
print(match.group(2)) # ' ' matches pattern 2
Example
--input--
match=re.search('\w+)(\s)(\d.+)', "Apple 3.99")
print(match.group(1 ))
print(match.group(3))
--output--
Apple # group 1 is (\w+)
3.99 # group 3 is (\d.+)
The findall function returns a list of tuples that match the pattern.
tuples=re.findall('(pattern1)(pattern2)', string)
for t in tuples:
print(t[0], t[1]) # prints pattern1, pattern2
Exercise¶
The file emails.txt
contains email addresses for a set of fictional
students. Write Python code to create a list of usernames associated
with uchicago accounts in the file.
Step 1: read in the file.
Step 2: create a Python regular expression to identity the addresses with a uchicago domain and to extract the associated username.
Exercise¶
Generate an index of Python functions and their associated source files. Use all the Python files in the distribution as example files.
All three parts below.
Write a function that takes the name of a Python source file and outputs a list of the names of the top-level functions defined in that file.
Using the previous function, write a function that takes a list of filenames for Python programs and outputs a list of (function name, source file) tuples.
Sort the resulting list by function name.
Patterns in Files¶
The files Macbeth.txt
, RJ.txt
, and errors.txt
are Shakespeare plays
provided from the Gutenberg Project. Examine these plays for common
phrases and comparisons.
Count the number of times Romeo and Juliet’s names are used in the play
RJ.txt
.Create a function that takes a prefix and a file and then creates a list of words from that file that follow the prefix. Common phrases include : “Thou art ” , “thou shalt “, and “If thou “.
Note: Try limiting the sizes of the words that follow the prefix that match the pattern, so to exclude short words like “the” and “as”. Define your pattern to ignore the case of the first letter in the prefix.
Also popular are comparisons like “As (1) as (2)”. Create a dictionary for these comparisons. Mapping the value 1 as the key to the value 2. If something is compared multiple times, add the word to the value in the dictionary without replacing the existing value. For example, given this text:
As pretty as gentle As pretty as pale
the resulting dictionary should match
"pretty"
to["gentle", "pale"]
.