Regular Expressions

Regular expressions provide a syntax for text processing that is supported by many programming languages and software tools. A regular expression describes a concrete or abstract pattern of characters. In its simplest form, a pattern is concrete string, like 'abc'. The advantage of regular expressions is that such string could also be described in a more abstract, like 'a sequence of three word characters', 'a lower case a followed by some character followed by a lower case c', 'three characters that are not digits', etc.

Typical regular expression operations are matching and substituting. In a matching operation, a string will be tested for a substring that corresponds to the pattern--that is, whether the string contains the pattern. In a substituting operation, a pattern--if found--will be replaced by another pattern.

Pattern Matching

Perl is famous for its regular expression capabilities. So far we have done only simple string comparisons and tested for equality or inequality. Perl can do a lot more than this, like testing strings for concrete and abstract character patterns. In its simplest form, it can test whether a string is contained in another string.

In order to do pattern matching, we need two new operators: a matching operator and a binding operator.

The matching operator looks as follows:

m/ pattern /

This operator comes with a number of modifiers (to be placed at the end after the second slash), some of which are discussed below. The m (match) before the first slash is optional but should be included to make the context of the operator clear.

The string or pattern between the slashes does not need to be put in quotation marks. In fact, a quotation mark will be considered part of the pattern, as will punctuation marks and spaces. Variable interpolation is possible.

A regular expression without a binding operator matches against the default input variable, $_.

Note: Other Perl functions accept patterns as well, like split( ).

The binding operator looks as follows:

=~

or in its negated form

!~

This operator binds a scalar expression to the pattern match. It could be paraphrased as contains or, in its negated form, does not contains. A typical application looks as follows:

if ($myVar =~ m/Hello!/)  { ... }

If the string value of $myVar contains "Hello!" in any position (beginning, end, or anywhere in between), the expression will evaluate to TRUE.

if ($myVar !~ m/Hello!/)  { ... }

would test for the contrary.

Example script

Copy the text of the script and run it in Perl. You also have to create a file, called"test.txt", in the same directory as the script. This file should contain a series of lines with numbers, some of which should include '11' or '99'. Then, change the search patterns and/or run the script taking a different file as input.

# findPattern.pl
# program opens a file, reads it, and tests the input
# whether it CONTAINS a string pattern;
# if it does, it prints the line to STDOUT

# first, create a variable that holds the path to and 
# name of source file

$source = "test.txt"; # we assume file is in same dir as script

# second, create a filehandle named 'SOURCE'

open(SOURCE, "<$source");  # < is optional for reading files

# get input from SOURCE

while (<SOURCE>) # < > is line input operator
{
     # remove trailing newline char
     chomp($_);

     # move input to var to free $_ 
     $line = $_;    
	    
     # test if $line contains '11' or '99'
     if ($line =~ m/11/ ||  $line =~ m/99/)
     {
          print $line . "\n";
     }
}

print "\n\n";

close(SOURCE);  # for good measure, recommended on Macs

Substitutions

Once a pattern is matched, it can easily be replaced by another pattern. The syntax is similar to matching: the substitution operator is prefixed by an 's' and consists of a pattern and a replacement string.

s/pattern/replacement/

In the following example, a string or substring 'John' in the string contained in $line will be replace by string 'Jane'.

$line =~ s/John/Jane/;

If a matching operator--matching or substitution--is not bound to a left-hand side expression, it operates on the default input variable.

s/John/Jane/;

is the same as

$_ =~ s/John/Jane/;

Modifiers

The behavior of matching and substitution operators can be changed by adding one or more specified modifiers to the expression:

g
matches every occurrence (otherwise, only the first occurrence will be matched; this is important for substitutions)
$line = "foot's food";
$line = ~ s/foo/bar/g;
$line is now 'bart's bard'; without the global modifier it would be 'bart's food'
i
ignores case
$line = "Bart's bard";
$line =~s/bar/foo/i";
$line is now 'foot's food'
m
matches against more than one line: normally Perl will consider a line ending with a newline character. If you want to match beyond a line boundary, you have to add the m modifier.
$line = "<i>The text consists \n of two lines</i>";
$line =~ s/<(\/?)i>/<$1b>/m;
This substitution changes the italics HTML tag to bold.

Testing Matching and Substituting Regular Expressions

Regular Expression Matching
Regular Expression Substitution