Try it out: Rhyming Dictionary - Page 6
February 23, 2001
Let's see one more example of this, where we'll combine looking
for matches with looking through the lines in a file:
Imagine yourself as a poor poet. In fact, not just poor, but
downright bad - so bad, you can't even think of a rhyme for
'pink'. So, what do you do? You do what every sensible poet does
in this situation, and you write the following Perl program:
#!/usr/bin/perl
# rhyming.plx
use warnings;
use strict;
my $syllable = "ink";
while (<>) {
print if /$syllable$/;
}
We can now feed it a file of words, and find those that end in
'ink':
>perl rhyming.plx wordlist.txt
blink
bobolink
brink
chink
clink
>
For a really thorough result, you'll need to use a file
containing every word in the dictionary - be prepared to wait
though if you do! For the sake of the example however, any text-
based file will do (though it'll help if it's in English). A
bobolink, in case you're wondering, is a migratory American
songbird, otherwise known as a ricebird or reedbird.
How It Works
With the loops and tests we learned in the last chapter, this
program is really very easy:
while (<>) { print if /$syllable$/;}
We've not looked at file access yet, so you may not be familiar
with the while(<>){...} construction used
here. In this example it opens a file that's been specified on
the command line, and loops through it, one line at a time,
feeding each one into the special variable $_ -
this is what we'll be matching.
Once each line of the file has been fed into $_, we
test to see if it matches the pattern, which is our syllable,
'ink', anchored to the end of the line (with $ ). If
so, we print it out.
The important thing to note here is that perl treats the 'ink' as
the last thing on the line, even though there is a new line at
the end of $_. Regular expressions typically ignore
the last new line in a string - we'll look at this behavior in
more detail later.
Shortcuts and Options
All this is all very well if we know exactly what it is we're
trying to find, but finding patterns means more than just
locating exact pieces of text. We may want to find a three-digit
number, the first word on the line, four or more letters all in
capitals, and so on.
We can begin to do this using character classes - these
aren't just single characters, but something that signifies that
any one of a set of characters is acceptable. To specify this, we
put the characters we consider acceptable inside square brackets.
Let's go back to our matchtest program, using the
same test string:
$_ = q("I wonder what the Entish is for 'yes'
and 'no'," he thought.);
> perl matchtest.plx
Enter some text to find: w[aoi]nder
The text matches the pattern 'w[aoi]nder'.
>
What have we done? We've tested whether the string contains a
'w', followed by either an 'a', an 'o', or an 'i', followed by
'nder'; in effect, we're looking for either of 'wander',
'wonder', or 'winder'. Since the string contains 'wonder', the
pattern is matched.
Conversely, we can say that everything is acceptable except a
given sequence of characters - we can 'negate the character
class'. To do this, the character class should start with a
^, like so:
> perl matchtest.plx
Enter some text to find: th[^eo]
'th[^eo]' was not found.
>
So, we're looking for 'th' followed by something that is neither
an 'e' or an 'o'. But all we have is 'the' and 'thought', so this
pattern does not match.
If the characters you wish to match form a sequence in the
character set you're using - ASCII or Unicode, depending on your
perl version - you can use a hyphen to specify a range of
characters, rather than spelling out the entire range. For
instance, the numerals can be represented by the character class
[0-9]. A lower case letter can be matched with
[a-z]. Are there any numbers in our quote?
> perl matchtest.plx
Enter some text to find: [0-9]
'[0-9]' was not found.
>
You can use one or more of these ranges alongside other
characters in a character class, so long as they stay inside the
brackets. If you wanted to match a digit and then a letter from
'A' to 'F', you would say [0-9][A-F]. However, to
match a single hexadecimal digit, you would write [0-9A-
F] or [0-9A-Fa-f] if you wished to include
lower-case letters.
Escaping Special Characters - Page 5
Beginning Perl
Reoccurring Character Classes - Page 7
|