Pulling Tags Like Taffy: TokeParser
August 9, 1999
Reading the current temperature from the Weather Underground
web page was fairly easy, partly because there was only one
piece of data to grab and because it was easily located within
an HTML comment tag. By now our confidence is swelling and
ambition rising ...
The act of analyzing an HTML document (or any document, for
that matter) and trying to make some sense out of it,
programmatically speaking, is known as parsing. In
simple terms, we can say that we parsed the Weather
Underground page for the current temperature comment tag.
Parsing, though, can be a very complex business. For
one, parsing a document requires an understanding of that
document's structure -- for instance, HTML documents
are built using markup rules inherent in the HTML standard.
But this is the Perl you need to know and, thankfully,
you don't really need to know any of that. All
you do need to know, and befriend, is the Perl module
HTML::TokeParser
("token parser", with a mytersiously missing
"n"). Using this parsing module we can write ginsu
knife Perl scripts that slice and dice through a web page,
carving out just the data we're looking for. In doing
so, we'll cook up a sample web page analyzer that produces a
summary of a given page's code.
TokeParser is not childproof, but it is about as simple as
a complex parser can be, and that's a good thing. As usual,
we must first include the HTML::TokeParser module at the top
of the our Perl script.
#!/usr/bin/perl
use CGI;
use LWP::Simple;
use HTML::TokeParser;
|
Now that TokeParser is chomping at the bit for some HTML to
parse we must feed it. The easiest way to feed some
source to TokeParser is by retrieving a web page into a local
variable, as we've seen with each earlier example.
Construct a URL and retrieve page into local variable.
#retrieve web page
$fetchURL=$cgiobject->param("url");
unless ($fetchURL)
{$fetchURL=""}
$webPage=get($fetchURL);
|
This snippet of code looks for a parameter named
"url" supplied by a form submission (we'll see why
later), but if none, it defaults to the URL of this very fine
publication. Then, leveraging the LWP::Simple module
that we've come to cherish, the specified page is retrieved
into the variable $webPage.
The stage is set ... time to parse. But parse for what? Well,
we could parse any HTML tag for a variety of reasons,
but let's consider a simple example: the title of this web
page. Of course, the HTML <TITLE> tag encloses a page's
title, so we can parse for this tag and extract its text:
Parse for the title of a web page.
sub parse_title{
#parse and output page title
$parser=HTML::TokeParser->new(\$webPage);
$parser->get_tag("title");
print "<p><h2>Page title</h2> ".
$parser->get_trimmed_text."</p>";
}
|
We begin by creating a new instance of a TokeParser object.
This sets the parser to begin parsing at the start
of the given document. Notice in the new() construct
that we supply a reference to the variable that contains
the HTML document. A reference is indicated by the leading
backslash, and simply acts as a pointer to the named
variable. Alternatively, had we specified the variable without
the reference marker (->new($webPage), TokeParser
would have taken $webPage to be a filehandle connected
to an open local file on our hard drive, which isn't
what we want in this case.
A call to TokeParser's get_tag() method specifies
which tag to snag, as they say. TokeParser will grab
the first instance of the <TITLE> tag that it finds
and then continue with the Perl programming -- imagine
that TokeParser is standing there tapping its toes with
tag in hand saying "and what shall I do now, master?".
Simply, we spit out the text of the tag wrapped inside some
output HTML. The get_trimmed_text() method
of TokeParser delivers the text of the tag (in this case,
the title of the web page), with extra whitespace removed.
Alternatively, the get_text() method would return
the tag text untouched.
Grasping for Tags
The Perl You Need to Know
Parsing Attributes with Ease
|