Reoccurring Character Classes - Page 7
February 23, 2001
Some character classes are going to come up again and again: the
digits, the letters, and the various types of whitespace. Perl
provides us with some neat shortcuts for these. Here are the most
common ones, and what they represent:
| Shortcut |
Expansion |
Description |
\d |
[0-9] |
Digits 0 to 9. |
\w |
[0-9A-Za-z_] |
A 'word' character allowable in a Perl variable name. |
\s |
[ \t\n\r] |
A whitespace character that is, a space, a tab, a newline or a return. |
also, the negative forms of the above:
| Shortcut |
Expansion |
Description |
\D |
[^0-9] |
Any non-digit. |
\W |
[^0-9A-Za-z_] |
A non-'word' character. |
\S |
[^ \t\n\r] |
A non-blank character. |
So, if we wanted to see if there was a five-letter word in the
sentence, you might think we could do this:
> perl matchtest.plx
Enter some text to find: \w\w\w\w\w
The text matches the pattern '\w\w\w\w\w'.
>
But that's not right - there are no five-letter words in the
sentence! The problem is, we've only asked for five letters in a
row, and any word with at least five letters contains five in a
row will match that pattern. We actually matched 'wonde', which
was the first possible series of five letters in a row. To
actually get a five-letter word, we might consider deciding that
the word must appear in the middle of the sentence, that is,
between two spaces:
> perl matchtest.plx
Enter some text to find: \s\w\w\w\w\w\s
'\s\w\w\w\w\w\s' was not found.
>
Word Boundaries
The problem with that is, when we're looking at text, words
aren't always between two spaces. They can be followed by or
preceded by punctuation, or appear at the beginning or end of a
string, or otherwise next to non-word characters. To help us
properly search for words in these cases, Perl provides the
special \b metacharacter. The interesting thing
about \b is that it doesn't actually match any
character in particular. Rather, it matches the point between
something that isn't a word character (either \W or
one of the ends of the string) and something that is (a word
character), hence \b for boundary. So, for example,
to look for one-letter words:
> perl matchtest.plx
Enter some text to find: \s\w\s
'\s\w\s' was not found.
> perl matchtest.plx
Enter some text to find: \b\w\b
The text matches the pattern '\b\w\b'.
As the I was preceded by a quotation mark, a space wouldn't match
it - but a word boundary does the job. Later, we'll learn how to
tell perl how many repetitions of a character or group of
characters we want to match without spelling it out directly.
What, then, if we wanted to match anything at all? You might
consider something like [\w\W] or
[\s\S], for instance. Actually, this is quite a
common operation, so Perl provides an easy way of specifying it -
a full stop. What about an 'r' followed by two characters - any
two characters - and then a 'h'?
> perl matchtest.plx
Enter some text to find: r..h
The text matches the pattern 'r..h'.
>
Is there anything after the full stop?
> perl matchtest.plx
Enter some text to find: \..
'\..' was not found.
>
What's that? One backslashed full stop to mean a full stop, then
a plain one to mean 'anything at all'.
Try it out: Rhyming Dictionary - Page 6
Beginning Perl
Posix and Unicode Classes - Page 8
|