Repetition - Page 9
March 9, 2001
We've now moved from matching a specific character to a more
general type of character - when we don't know (or don't care)
exactly what the character will be. Now we're going to see what
happens when we want to talk about a more general quantity of
characters: more than three digits in a row; two to four capital
letters, and so on. The metacharacters that we use to deal with a
number of characters in a row are called quantifiers.
Indefinite Repetition
The easiest of these is the question mark. It should suggest
uncertainty - something may be there, or it may not. That's
exactly what it does: stating that the immediately preceding
character(s) - or metacharacter(s) - may appear once, or not at
all. It's a good way of saying that a particular character or
group is optional. To match the word 'he or she', you can put:
> perl matchtest.plx
Enter some text to find: \bs?he\b
The text matches the pattern '\bs?he\b'.
>
To make a series of characters (or metacharacters) optional,
group them in parentheses as before. Did he say 'what the Entish
is' or 'what the Entish word is'? Either will do:
> perl matchtest.plx
Enter some text to find: what the Entish (word )?is
The text matches the pattern 'what the Entish (word )?is'.
>
Notice that we had to put the space inside the group: otherwise
we end up with two spaces between 'Entish' and 'is', whereas our
text only has one:
> perl matchtest.plx
Enter some text to find: what the Entish (word)? is
'what the Entish (word)? is' was not found.
>
As well as matching something one or zero times, you can match
something one or more times. We do this with the plus sign - to
match an entire word without specifying how long it should be,
you can say:
> perl matchtest.plx
Enter some text to find: \b\w+\b
The text matches the pattern '\b\w+\b'.
>
In this case, we match the first available word - I.
If, on the other hand, you have something which may be there any
number of times but might not be there at all - zero or one or
many - you need what's called 'Kleene's star': the *
quantifier. So, to find a capital letter after any - but possibly
no - spaces at the start of the string, what would you do? The
start of the string, then any number of whitespace characters,
then a capital:
> perl matchtest.plx
Enter some text to find: ^\s*[A-Z]
'^\s*[A-Z]' was not found.
>
Of course, our test string begins with a quote, so the above
pattern won't match, but, sure enough, if you take away that
first quote, the pattern will match fine. Let's review the three
qualifiers:
/bea?t/
|
Matches either 'beat' or 'bet' |
/bea+t/
|
Matches 'beat', 'beaat', 'beaaat'… |
/bea*t/
|
Matches 'bet', 'beat', 'beaat'… |
Novice Perl programmers tend to go to town on combinations of dot
and star, and the results often surprise them, particularly when
it comes to searching-and-replacing. We'll explain the rules of
the regular expression matcher shortly, but bear the following in
mind:
A regular expression should hardly ever start or
finish with a starred character.
You should also consider the fact that .* and
.+ in the middle of a regular expression will match
as much of your string as they possibly can. We'll look more at
this 'greedy' behavior later on.
Posix and Unicode Classes - Page 8
Beginning Perl
Well-Defined Repetition - Page 10
|