How the Engine Works - Page 12
March 9, 2001
We've now seen most of the syntax behind regular expression
matching and plenty of examples of it in action. The code that
does all the matching is called perl's 'regular expression
engine'. You might now be wondering about the exact rules applied
by this engine when determining whether or not a piece of text
matches. And how much of it matches what. From what our examples
have shown us, let us make some deductions about the engine's
operation.
Our first expression, ([a-z]+) plucked out a
set of one-or-more lower-case letters. The first such set that
perl came across was 'silly'. The next character
after 'y' was a space, and so no longer matched the
expression.
Rule one: Once the engine starts matching, it will keep
matching a character at a time for as long as it can. Once it
sees something that doesn't match, however, it has to stop. In
this example, it can never get beyond a character that is not a
lower case letter. It has to stop as soon as it encounters one.
Next, we looked for a series of word characters, using
(\w+). The engine started looking at the beginning
of the string and found one, '1'. The next character was not a
word character (it was a colon), and so the engine had to stop.
Rule two: Unlike me, the engine is eager. It's eager
to start work and eager to finish, and it starts matching as soon
as possible in the string; if the first character doesn't match,
try and start matching from the second. Then take every
opportunity to finish as quickly as possible.
Then we tried this: ([a-z]+)(.*)([a-z]+). The result
we got with this was a little strange. Let's look at it again:
> perl matchtest2.plx
Enter a regular expression: ([a-z]+)(.*)([a-z]+)
The text matches the pattern '([a-z]+)(.*)([a-z]+)'.
$1 is 'silly'
$2 is ' sentence (495,a) *BUT* one which will be usefu'
$3 is 'l'
>
Our first group was the same as what matched before - nothing new
there. When we could no longer match lower case letters, we
switched to matching anything we could. Now, this could take up
the rest of the string, but that wouldn't allow a match for the
third group. We have to leave at least one lower-case letter.
So, the engine started to reverse back along the string, giving
characters up one by one. It gave up the closing bracket, the 3,
then the opening bracket, and so on, until we got to the first
thing that would satisfy all the groups and let the match go
ahead - namely a lower-case letter: the 'l' at the end of
'useful'.
From this, we can draw up the third rule:
Rule three: Like me, in this case, the engine is greedy. If
you use the + or * operators, they will
try and steal as much of the string as possible. If the rest of
the expression does not match, it grudgingly gives up a character
at a time and tries to match again, in order to find the fullest
possible match.
We can turn a greedy match into a non-greedy match by putting the
? operator after either the plus or star. For
instance, let's turn this example into a non-greedy version:
([a-z]+)(.*?)([a-z]+). This gives us an entirely
different result:
> perl matchtest2.plx
Enter a regular expression: ([a-z]+)(.*?)([a-z]+)
The text matches the pattern '([a-z]+)(.*?)([a-z]+)'.
$1 is 'silly'
$2 is ' '
$3 is 'sentence'
>
Now we've shut off rule three, rule two takes over. The smallest
possible match for the second group was a single space. First, it
tried to get nothing at all, but then the third group would be
faced with a space. This wouldn't match. So, we grudgingly accept
the space and try and finish again. This time the third group has
some lower case letters, and that can match as well.
What if we turn off greediness in all three groups, and say this:
([a-z]+?)(.*?)([a-z]+?)
> perl matchtest2.plx
Enter a regular expression: ([a-z]+?)(.*?)([a-z]+?)
The text matches the pattern '([a-z]+?)(.*?)([a-z]+?)'.
$1 is 's'
$2 is ''
$3 is 'i'
>
What about this? Well, the smallest possible match for the first
group is the 's' of silly. We asked it to find one character or
more, and so the smallest it could find was one. The second group
actually matched no characters at all. This left the third group
facing an 'i', which it took to complete the match.
Our last example included an alternation:
> perl matchtest2.plx
Enter a regular expression: e(\w|n\w+)
The text matches the pattern 'e(\w|n\w+)'.
$1 is 'n'
>
The engine took the first branch of the alternation and matched a
single character, even though the second branch would actually
satisfy greed. This leads us onto the fourth rule:
Rule four: Again like me, the regular expression engine
hates decisions . If there are two branches, it will
always choose the first one, even though the second one might
allow it to gain a longer match.
To summarize:
The regular expression engine starts as soon as it
can, grabs as much as it can, then tries to finish as soon as it
can, while taking the first decision available to
it.
Try It Out - Page 11
Beginning Perl
Working with RegExps - Page 13
|