Transliteration - Page 16
March 23, 2001
While we're looking at regular expressions, we should briefly
consider another operator. While it's not directly associated
with regexps, the transliteration operator has a lot in common
with them and adds a very useful facility to the matching and
substitution techniques we've already seen.
What this does is to correlate the characters in its two
arguments, one by one, and use these pairings to substitute
individual characters in the referenced string. It uses the
syntax tr/one/two/ and (as with the matching and
substitution operators) references the special variable
$_ unless otherwise specified with =~
or !~. In this case, it replaces all the 'o's in the
referenced string with 't's, all the 'n's with 'w's, and all the
'e's with 'o's.
Let's say you wanted to replace, for some reason, all the numbers
in a string with letters. You might say something like this:
$string =~ tr/0123456789/abcdefghij/;
This would turn, say, "2011064" into
"cabbage". You can use ranges in transliteration but
not in any of the character classes. We could write the above as:
$string =~ tr/0-9/a-j/;
The return value of this operator is, by default, the number of
characters matched with those in the first argument. You can
therefore use the transliteration operator to count the number of
occurrences of certain characters. For example, to count the
number of vowels in a string, you can use:
my $vowels = $string =~ tr/aeiou//;
Note that this will not actually substitute any of the vowels in
the variable $string. As the second argument is
blank, there is no correlation, so no substitution occurs.
However, the transliteration operator can take the
/d modifier, which will delete occurrences on the
left that do not have a correlating character on the right. So,
to get rid of all spaces in a string quickly, you could use this
line:
$string =~ tr/ //d;
Common Blunders
There are a few common mistakes people tend to make when writing
regexps. We've already seen that /a*b*c*/ will
happily match any string at all, since it matches each letter
zero times. What else can go wrong?
Forgetting To Group: /Bam{2}/ will match
'Bamm', while /(Bam){2}/ will match 'BamBam', so be
careful when choosing which one to use. The same goes for
alternation: /Simple|on/ will match 'Simple' and
'on', while /Sim(ple|on)/ will match both 'Simple'
and 'Simon' Group each option separately.
Getting The Anchors Wrong: ^ goes at the
beginning, $ goes at the end. A dollar anywhere else
in the string makes perl try and interpolate a variable.
Forgetting To Escape Special Characters: Do you want them
to have a special meaning? These are the characters to be careful
of: . * ? + [ ] ( ) { } ^ $ | and of course
\ itself.
Not Counting from Zero: The first entry in an array is
given the index zero.
Counting from Zero: I know, I know! All along I've been
telling you that computers start counting from zero.
Nevertheless, there's always the odd exception - the first
backreference is $1. Don't blame Perl though - it
took this behavior from a language called awk which
used $1 as the first reference variable.
Split - Page 15
Beginning Perl
More Advanced Topics - Page 17
|