Regular Expressions Introduced - Page 2
July 16, 2001
We will be working with two types of regular expressions. The
first is the match operator and the second is the search and
replace operator. A match expression exists between a pair of
forward slashes. The match operator is expressed with an m
placed in front of a pair of forward slashes:
m/expression/
Most programmers do not put the m operator in front of a
regular expression because Perl automatically recognizes a
regular expression when it sees the pair of forward slashes. So
for example, if we wanted to search for all occurrences of my
first name in a file, it might look like the code below.
my $counter = 0;
while (<>) {
$counter++ if m/Jonathan/gi;
print "Found 2001 $counter times\n";
}
To have the script read through a file, you would pass the
filename on the command-line with the redirection operator or the
< character.
match.pl < index.html
The script will read through each line of the index.html
file searching for my name. When my name is found, it will
increment the $counter variable. After the script
has processed all the lines in the file, it will print the number
of occurrences of the text string Jonathan.
There are two characters after the trailing forward slash in the
regular expression above (g and i).
Those are called modifiers. They have a special meaning
and change the way the regular expressions work. In this case,
the g modifier tells Perl to keep searching for my
name on a line even if it has already found an occurrence of my
name. Also called the global modifier, it keeps searching for as
many matches as it can find in the string. Otherwise, it would
simply stop looking after it found the first occurrence. That
would give us an inaccurate count. The second modifier,
i, tells Perl to ignore the case of the string. So
if there was an occurrence of my name without a capital J, it
would still find a match. Or if someone made my whole name upper-
case, it would still match because the i modifier
had been turned on.
Perl regular expressions also allow us to use special character
classes that represent words, digits, and whitespace. These
character classes are represented with a back-slash and a
character: \w for alphanumeric characters, \d for
numbers, and \s for whitespace. Alphanumeric characters
include a through z and 0 through 9. These special character
classes can be used as a shorthand for building a regular
expression. For example, sometimes people spell my name Jon
instead of Jonathan. We don't want to miss a nickname when
counting the occurrences of my name, so we should have a regular
expression that catches both. Let's change the line containing
the regular expression in the example above:
$counter++ if /Jon(\w\w\w\w\w)?/gi;
There, now it will match Jon or Jon plus five characters. There's
something else new here. The character classes are surrounded by
parenthesis and there's a question mark after it. The question
mark is called a quantifier because it looks for a
specific number of occurrences of the text inside the
parenthesis. The question mark is a true or false quantifier.
That is, there will either be exactly five characters after Jon,
or there won't. There are two other common quantifiers. They are
+ and *. The + quantifier will match one or more
instances in the expression and the * quantifier will match zero
or more instances of the expression. A quantifier modifies the
expression to its immediate left. In this case, that would be the
expression inside the parenthesis.
Another way to perform the match using the + quantifier would be:
$counter++ if /Jon(\w+)?/gi;
That would match Jon plus one or more characters or just Jon. We
could have also written it a different way using the *
modifier:
$counter++ if /Jon\w*/gi;
Since the * quantifier was placed after the \w character class,
it will match zero or more characters after Jon, therefore, we
don't need to use the ? modifier because it will match Jon or Jon
plus zero or more characters. Unfortunately, both methods are
less than optimal because it might match a name other than
Jonathan. There is a way to specify the exact number of
characters.
$counter++ if /Jon(\w{5})?/gi;
That will match Jon or Jon plus exactly five characters. Of
course, that might not work either. There might be a Jonothon,
which is a different person. So to be exact, we actually need to
specify the characters that may or may not occur:
$counter++ if /Jon(athan)?/gi;
There we go. Now the only two things that we can match are Jon or
Jonathan. Another problem we may need to solve is alternate
spellings of my name. Sometimes people will end my name with 'on'
instead of 'an'. To make sure we match the correct and incorrect
endings, we need to add a custom character class. Above we used
special character classes that are internal to Perl, but a
character class can also be a list of characters that you
specify. Character classes in a regular expression are surrounded
by square brackets:
$counter++ if /Jon(ath[ao]n)?/gi;
So in the regular expression above, a match will occur if it
finds Jon, Jonathan, or Jonathon. As you can see, regular
expressions are very flexible and powerful. There are additional
features in regular expressions that allow you to match any
expression you can dream up. Later, I'll show you an example of a
very complex regular expression that someone developed for
parsing
XML.
Weaving Magic With Regular Expressions
Replacing Strings in Files - Page 3
|