Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Pulling Tags Like Taffy: TokeParser

August 9, 1999

Reading the current temperature from the Weather Underground web page was fairly easy, partly because there was only one piece of data to grab and because it was easily located within an HTML comment tag. By now our confidence is swelling and ambition rising ...

The act of analyzing an HTML document (or any document, for that matter) and trying to make some sense out of it, programmatically speaking, is known as parsing. In simple terms, we can say that we parsed the Weather Underground page for the current temperature comment tag. Parsing, though, can be a very complex business. For one, parsing a document requires an understanding of that document's structure -- for instance, HTML documents are built using markup rules inherent in the HTML standard.

But this is the Perl you need to know and, thankfully, you don't really need to know any of that. All you do need to know, and befriend, is the Perl module HTML::TokeParser ("token parser", with a mytersiously missing "n"). Using this parsing module we can write ginsu knife Perl scripts that slice and dice through a web page, carving out just the data we're looking for. In doing so, we'll cook up a sample web page analyzer that produces a summary of a given page's code.

TokeParser is not childproof, but it is about as simple as a complex parser can be, and that's a good thing. As usual, we must first include the HTML::TokeParser module at the top of the our Perl script.

#!/usr/bin/perl
use CGI;
use LWP::Simple;
use HTML::TokeParser;

Now that TokeParser is chomping at the bit for some HTML to parse we must feed it. The easiest way to feed some source to TokeParser is by retrieving a web page into a local variable, as we've seen with each earlier example.

Construct a URL and retrieve page into local variable.
#retrieve web page
$fetchURL=$cgiobject->param("url");
unless ($fetchURL) 
 {$fetchURL=""}
$webPage=get($fetchURL);

This snippet of code looks for a parameter named "url" supplied by a form submission (we'll see why later), but if none, it defaults to the URL of this very fine publication. Then, leveraging the LWP::Simple module that we've come to cherish, the specified page is retrieved into the variable $webPage.

The stage is set ... time to parse. But parse for what? Well, we could parse any HTML tag for a variety of reasons, but let's consider a simple example: the title of this web page. Of course, the HTML <TITLE> tag encloses a page's title, so we can parse for this tag and extract its text:

Parse for the title of a web page.
sub parse_title{
#parse and output page title
$parser=HTML::TokeParser->new(\$webPage);
$parser->get_tag("title");
print "<p><h2>Page title</h2> ".
      $parser->get_trimmed_text."</p>";
}

We begin by creating a new instance of a TokeParser object. This sets the parser to begin parsing at the start of the given document. Notice in the new() construct that we supply a reference to the variable that contains the HTML document. A reference is indicated by the leading backslash, and simply acts as a pointer to the named variable. Alternatively, had we specified the variable without the reference marker (->new($webPage), TokeParser would have taken $webPage to be a filehandle connected to an open local file on our hard drive, which isn't what we want in this case.

A call to TokeParser's get_tag() method specifies which tag to snag, as they say. TokeParser will grab the first instance of the <TITLE> tag that it finds and then continue with the Perl programming -- imagine that TokeParser is standing there tapping its toes with tag in hand saying "and what shall I do now, master?".

Simply, we spit out the text of the tag wrapped inside some output HTML. The get_trimmed_text() method of TokeParser delivers the text of the tag (in this case, the title of the web page), with extra whitespace removed. Alternatively, the get_text() method would return the tag text untouched.


Grasping for Tags
The Perl You Need to Know
Parsing Attributes with Ease


Up to => Home / Authoring / Languages / Perl / PerlfortheWeb




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers