The Proof is in the Parsing: A Web Page Summarizer
August 9, 1999
Cobbling together the TokeParser techniques we've seen, we
now offer a moderately cute little Perl program which,
given a valid URL, will parse for a variety of tags and
output a summary. This program relies on most of the primary
technologies we've encountered in The Perl You Need to
Know series, including
CGI, creating on-the-fly HTML
output including forms, as well as the LWP::Simple module
and the newly enthralling TokeParser module.
This program is by no means super-intelligent, robust, or
even easy on the eyes -- but it illustrates the core
techniques of this month's focus and provides plenty of
reusable code for your own parsing excavations, much of
which we've already seen earlier in this article. You can,
of course,
view a live demo of
parsepage.cgi
in action.
parsepage.cgi: Analyzes a given URL and produces a page
summary, thanks to TokeParser.
#!/usr/bin/perl
use CGI;
use LWP::Simple;
use HTML::TokeParser;
$cgiobject=new CGI;
$cgiobject->use_named_parameters;
print $cgiobject->header;
print $cgiobject->start_html
(-title=>'Page Parser',
-bgcolor=>'white');
print $cgiobject->startform
(-method=>'get',
-action=>'parsepage.cgi');
print "URL to Analyze:".$cgiobject->textfield
(-name=>'url',
-size=>'40');
print "<br>".$cgiobject->submit(-value=>'Analyze');
print $cgiobject->endform;
print "<hr>";
#retrieve web page
$fetchURL=$cgiobject->param("url");
unless ($fetchURL)
{$fetchURL=""}
$webPage=get($fetchURL);
print <<ENDHTML;
<center><h2>$fetchURL<br>
has been sliced and diced,
thus revealing:</h2></center>
ENDHTML
&parse_title;
&parse_meta_description;
&parse_meta_keywords;
&parse_images;
&parse_hyperlinks;
print $cgiobject->end_html;
sub parse_title{
#parse and output page title
$parser=HTML::TokeParser->new(\$webPage);
$parser->get_tag("title");
print "<p><h2>Page title</h2> ".
$parser->get_trimmed_text."</p>";
}
sub parse_meta_keywords{
#parse and output meta data
$parser=HTML::TokeParser->new(\$webPage);
while (my $token=$parser->get_tag("meta"))
{ if ($token->[1]{name}=~/keywords/i)
{ print "<p><h2>Meta Keywords</h2> ".
$token->[1]{content}."</p>" }
}
}
sub parse_meta_description{
#parse and output meta data
$parser=HTML::TokeParser->new(\$webPage);
while (my $token=$parser->get_tag("meta"))
{ if ($token->[1]{name}=~/description/i)
{ print "<p><h2>Meta Description</h2> ".
$token->[1]{content}."</p>" }
}
}
sub parse_images{
#parse and count images
$parser=HTML::TokeParser->new(\$webPage);
my $imageTotal=0;
while ($parser->get_tag("img"))
{ $imageTotal++ }
print "<p><h2>Image Count</h2> ".
"Total = $imageTotal</p>";
}
sub parse_hyperlinks{
#parse and output hyperlinks
$parser=HTML::TokeParser->new(\$webPage);
print "<p><h2>Hyperlink Summary</h2>";
while (my $token = $parser->get_tag("a"))
{ my $linkURL = $token->[1]{href} || "-";
my $linkText = $parser->get_trimmed_text("/a");
if ($linkText=~/<image/i) {$linkText="image"}
print "<small>$linkText</small> ".
"<b>links to</b> $linkURL<br>"
}
}
|
Additional Resources
Parsing Attributes with Ease
The Perl You Need to Know
|