Posix and Unicode Classes - Page 8
February 23, 2001
Perl 5.6.0 introduced a few more character classes into the mix -
first, those defined by the POSIX (Portable Operating Systems
Interface) standard, which are therefore present in a number of
other applications. The more common character classes here are:
| Shortcut |
Expansion |
Description |
[[:alpha:]] |
[a-zA-Z] |
An alphabetic character. |
[[:alnum:]] |
[0-9A-Za-z] |
An alphabetic or numeric character. |
[[:digit:]] |
\d |
A digit, 0-9. |
[[:lower:]] |
[a-z] |
A lower case letter. |
[[:upper:]] |
[A-Z] |
An upper case letter. |
[[:punct:]] |
["#$%&'()*+,-./:;<=>?@\[\\\]^_`{|}~] |
A punctuation character - note the escaped characters [,
\, and ]. |
The Unicode standard also defines 'properties', which apply to
some characters. For instance, the 'IsUpper'
property can be used to match any upper-case character, in
whichever language or alphabet. If you know the property you are
trying to match, you can use the syntax \p{} to
match it, for instance, the upper-case character is
\p{IsUpper}.
Alternatives
Instead of giving a series of acceptable characters, you may want
to say 'match either this or that'. The 'either-or' operator in a
regular expression is the same as the bitwise 'or' operator,
|. So, to match either 'yes' or 'maybe' in our
example, we could say this:
> perl matchtest.plx
Enter some text to find: yes|maybe
The text matches the pattern 'yes|maybe'.
>
That's either 'yes' or 'maybe'. But what if we wanted either
'yes' or 'yet'? To get alternatives on part of an expression, we
need to group the options. In a regular expression, grouping is
always done with parentheses:
> perl matchtest.plx
Enter some text to find: ye(s|t)
The text matches the pattern 'ye(s|t)'.
>
If we have forgotten the parentheses, we would have tried to
match either 'yes' or 't'. In this case, we'd still get a
positive match, but it wouldn't be doing what we want - we'd get
a match for any string with a 't' in it, whether the words 'yes'
or 'yet' were there or not.
You can match either 'this' or 'that' or 'the other' by adding
more alternatives:
> perl matchtest.plx
Enter some text to find: (this)|(that)|(the other)
'(this)|(that)|(the other)' was not found.
>
However, in this case, it's more efficient to separate out the
common elements:
> perl matchtest.plx
Enter some text to find: th(is|at|e other)
'th(is|at|e other)' was not found.
You can also nest alternatives. Say you want to match one of
these patterns:
- 'the' followed by whitespace or a letter,
- 'or'
You might put something like this:
> perl matchtest.plx
Enter some text to find: (the(\s|[a-z]))|or
The text matches the pattern '(the(\s|[a-z]))|or'.
>
It looks fearsome, but break it down into its components. Our two
alternatives are:
The second part is easy, while the first contains 'the' followed
by two alternatives: \s and [a-z] .
Hence 'either "the" followed by either a whitespace or a lower
case letter, or "or". We can, in fact, tidy this up a little, by
replacing (\s|[a-z]) with the less cluttered
[\sa-z].
> perl matchtest.plx
Enter some text to find: (the[\sa-z])|or
The text matches the pattern '(the[\sa-z])|or'.
>
Reoccurring Character Classes - Page 7
Beginning Perl
Repetition - Page 9
|