Regular Expressions

The objective of this lab is to practice using regular expressions.

A formal definition of regular expressions

Given an alphabet Σ (a collection of symbols),

  • the empty string is a regular expression;
  • any character, c, in Σ is a regular expression that matches the character c;
  • if α and β are regular expressions then, αβ is a regular expression that matches any string that marches α concatenated with a string that matches β;
  • if α and β are regular expressions then, α|β is a regular expression that matches a string that matches α or a string that matches β and
  • if α is a regular expression then, α* is a regular expression that matches zero or more occurrences of α

To make this concept more concrete, we will first look at a Linux utility that uses regular expressions to match lines in a file or files and then see how these same concepts are expressed in Python.

Grep

We are going to start our exploration of regular expressions using egrep, which is part of the grep family of Linux utilities (grep, egrep, fgrep). These programs, which are invoked on the command-line, are used to determine which lines in a file (or files) match a pattern that is specified using a regular expression. We will be using egrep, because its regular expression syntax is similar to Python’s regular expression syntax.

We will be using the file test0.txt in our examples. Before moving on, please open the file in your favorite editor (or use cat test0.txt from the Linux command-line) and take a quick look at it. You’ll see each line contains a string over the alphabet {a,b,c,d,1,2,3,4}.

We will start with the simplest type of pattern: a sequence of characters that should be matched exactly. For example, to match the pattern “aabb” in the file “test0.txt” run the command:

--input--
egrep 'aabb' test0.txt
--output--
ccccccccccccaabb
aaaaaaaaabb

Notice that a line is said to match a pattern if any part of the line matches the pattern.

Finding exact matches is not very interesting, so we will generalize our pattern language. The first generalization is to add a way to match any single character. The standard symbol for this purpose is the character ”.” Try this example:

--input--
egrep 'aa.bb' test0.txt
 --output--
aacbb            #aacbb matches aa.bb
aacbbbaaaaaaaaaa
aaaaaaaaabb      #aaabb matches aa.bb

As noted above, the “*” character matches to zero of more repetitions of the previous pattern. Try the pattern a*:

--input--
grep "a*" test0.txt
--output--
aacbb
abbb
aaab
ccccccccccccaabb
aacbbbaaaaaaaaaa
abbccccccccaaaaaaa
a
b
aa
bb
aaaaaaaaabb
ddddddddddddd
ddddd\dddddddd
ddd4ddd
123

Notice that grep returns every line in the file, because every line has zero or more a’s. Here’s a more interesting example:

--input--
egrep  'aa.*bb' test0.txt
--output
aacbb
ccccccccccccaabb
aacbbbaaaaaaaaaa
aaaaaaaaabb

If we want to match a pattern only at the beginning of the line, we can use the ‘^’ symbol at the start of the pattern. To match the end of the line, we can use the ‘$’ symbol at the end the pattern. To get every line that begins with a ‘c’ try:

--input--
egrep '^c' test0.txt
--output--
ccccccccccccaabb

To get every line that ends with a ‘b’ try:

--input--
egrep 'b$' test0.txt
--output--
aacbb
abbb
aaab
ccccccccccccaabb
b
bb
aaaaaaaaabb

As is typical, special characters need to be escaped. For instance, tab == \t. To match a ”.”, use “\.” and to match ‘\’ try the pattern ‘\\’. For example:

--input--
egrep '\\' test0.txt
--output--
ddddd\dddddddd

A list of characters enclosed in square brackets means that we want to match one of the specified characters exactly once. For example [a-zA-Z] matches a single lower or upper case letter and [ \t] matches a space or a tab. To look for every line that has either a ‘c’ or a ‘4’ try the pattern [c4]:

--input--
egrep '[c4]' test0.txt
--output--
aacbb
ccccccccccccaabb
aacbbbaaaaaaaaaa
abbccccccccaaaaaaa
ddd4ddd

Try the pattern [0-9] to find every line with a number.

--input--
egrep '[0-9]' test0.txt
--output--
ddd4ddd
123

The square bracket notation allows us to specify a limit range of choices for matching a single character. The ‘|’ notation allows us to specifiy a set of choices for more substantial patterns. For example,

--input--
egrep "Planck|Bohr" chant.txt
--output--
Gimme Plancks constant H
Gimme the Bohr radius A

The ‘|’ operator has very low precedence. Also, spaces immediately before or after the ‘|’ operator are interpreted as part of the desired pattern. For example, the pattern “Planck | Bohr” matches different lines than the pattern “Plack|Bohr”.

--input--
egrep "Planck | Bohr" chant.txt
--output--
Gimme the Bohr radius A

Exercises

Use egrep and the file chant.txt for the following exercises:

  • Find the lines that have the word “light”.
  • Find the lines that start with ‘G’, and then find lines that start with any letter except ‘G’.
  • Find lines that end in an ‘O’ or an ‘I’.

Use egrep and the file Chicago.txt for the following exercises:

  • Find potential names. Two consecutive words with upper-cased first letters.
  • Find all dates like June 14, 1998 or August 1, 2001
  • Find all city/state pairs like Chicago, IL

Regular Expressions in Python

Regular expressions in Python are similar. Here is a look at some patterns Python can specify.

Patterns

As with grep, patterns can include specific sequences of characters to match, periods (match any character), or sets of characters or ranges of characters within square brackets (match any one character from the specified set). In addition to these, Python has basic patterns, repetitions, and alternations.

Basic Patterns

Here are is a useful subset of the basic patterns provided by Python.

  • Pattern: \d, which is equivalent to [0-9] and -matches a single digit.
  • Pattern: \w, which is equivalent to [a-zA-Z0-9_] and matches a single character that is a letter, a number, or an underscore.
  • Pattern: \s, which matches a single white space character (space, tab, etc).

Examples

  • A 10-digit phone number \d\d\d-\d\d\d-\d\d\d\d

Repetition

If β is a regular expression then recall that β* is a regular expression that matches 0 or more occurrences of β. Additionally, β+ is a regular expression that matches one or more occurrences of β. β+ is shorthand for ββ*. Additionally,

  • Pattern:β{n} will match to exactly n consecutive occurrences of β
  • Pattern:β{n1,n2} will match from n1 to n2 consecutive occurrences of β.

Examples

  • A 10-digit phone number `` d{3}-d{3}-d{4}``
  • State Abbreviation `` [A-Z]{2}``
  • IL License Plate `` [A-Z0-9]{1,6}``

Alternations

Alternations are a way to specify a choice: match this or match that.

  • Pattern: `` α|β``. This pattern matches the pattern α or the pattern β.
  • Patterns: `` β?`` This pattern matches zero or one occurrences of the pattern β

Example:

Reading a File in Python

Here are some strategies for finding patterns in a file.

Read in a file, save it to a list of strings, and then close the file.

f=open(filename,'r')
text=f.read()
f.close

Then use regular expressions normally on the string text.

Code Analysis

Regular expressions can be used to analyze code in a large program. Examine file “play.py”

  • Find the names of all Python function definitions in the file play.py.
  • Find the names of all the nested functions in the file play.py.

Groups and Tuples

Matches can be grouped according to the pattern. The group function allows the whole match to be separated into groups.

match=re.search("(pattern1)(pattern2)", string)
if match:
    print match.group(1) # prints the string that matches pattern 1
    print match.group(2) #  ' '  matches pattern 2

Example

--input--
match=re.search('\w+)(\s)(\d.+)', "Apple 3.99")
print (match.group(1 ))
print (match.group(3))
    --output--
    Apple       # group 1 is (\w+)
    3.99        # group 3 is (\d.+)

The findall function returns a list of tuples that match the pattern.

tuples=re.findall('(pattern1)(pattern2)',string)
for t in tuples:
    print(t[0], t[1])  # prints pattern1, pattern2

Exercise

The file emails.txt contains email addresses for a set of fictional students. Write Python code to create a list of usernames associated with uchicago accounts in the file.

  • Step 1: read in the file.
  • Step 2: create a Python regular expression to identity the addresses with a uchicago domain and to extract the associated username.

Exercise

Generate an index of Python functions and their associated source files. Use all the Python files in the distribution as example files.

  • Write a function that takes the name of a Python source file and outputs a list of the names of the top-level functions defined in that file.
  • Using the previous function, write a function that takes a list of filenames for Python programs and outputs a list of (function name, source file) tuples.
  • Sort the resulting list by function name.

Patterns in Files

The files Macbeth.txt, RJ.txt, and errors.txt are Shakespeare plays provided from the Gutenberg Project. Examine these plays for common phrases and comparisons.

  • Count the number of times Romeo and Juliet’s names are used in the play RJ.txt.

  • Create a function that takes a prefix and a file and then creates a list of words from that file that follow the prefix. Common phrases include : “Thou art ” , “thou shalt ”, and “If thou ”.

  • Note: Try limiting the sizes of the words that follow the prefix that match the pattern, so to exclude short words like “the” and “as”. Define your pattern to ignore the case of the first letter in the prefix.

  • Also popular are comparisons like “As (1) as (2)”. Create a dictionary for these comparisons. Mapping the value 1 as the key to the value 2. If something is compared multiple times, add the word to the value in the dictionary without replacing the existing value.

    As pretty as gentle        dictionary[pretty]=gentle
    As pretty as pale          dictionary[pretty]=gentle,pale