Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


The Proof is in the Parsing: A Web Page Summarizer

August 9, 1999

Cobbling together the TokeParser techniques we've seen, we now offer a moderately cute little Perl program which, given a valid URL, will parse for a variety of tags and output a summary. This program relies on most of the primary technologies we've encountered in The Perl You Need to Know series, including CGI, creating on-the-fly HTML output including forms, as well as the LWP::Simple module and the newly enthralling TokeParser module.

This program is by no means super-intelligent, robust, or even easy on the eyes -- but it illustrates the core techniques of this month's focus and provides plenty of reusable code for your own parsing excavations, much of which we've already seen earlier in this article. You can, of course, view a live demo of parsepage.cgi in action.

parsepage.cgi: Analyzes a given URL and produces a page summary, thanks to TokeParser.
#!/usr/bin/perl
use CGI;
use LWP::Simple;
use HTML::TokeParser;

$cgiobject=new CGI;
$cgiobject->use_named_parameters;
print $cgiobject->header;

print $cgiobject->start_html
                  (-title=>'Page Parser',
                   -bgcolor=>'white');
                          

print $cgiobject->startform
                  (-method=>'get',
                   -action=>'parsepage.cgi');
print "URL to Analyze:".$cgiobject->textfield
                                    (-name=>'url',
                                     -size=>'40');
print "<br>".$cgiobject->submit(-value=>'Analyze');
print $cgiobject->endform;
print "<hr>";                                     


#retrieve web page
$fetchURL=$cgiobject->param("url");
unless ($fetchURL) 
 {$fetchURL=""}

$webPage=get($fetchURL);

print <<ENDHTML;
<center><h2>$fetchURL<br>
has been sliced and diced,
 thus revealing:</h2></center>
ENDHTML
                                      
&parse_title;
&parse_meta_description;
&parse_meta_keywords;
&parse_images;
&parse_hyperlinks;
print $cgiobject->end_html;


sub parse_title{
#parse and output page title
$parser=HTML::TokeParser->new(\$webPage);
$parser->get_tag("title");
print "<p><h2>Page title</h2> ".
      $parser->get_trimmed_text."</p>";
}

sub parse_meta_keywords{
#parse and output meta data
$parser=HTML::TokeParser->new(\$webPage);
while (my $token=$parser->get_tag("meta"))
 { if ($token->[1]{name}=~/keywords/i)
    { print "<p><h2>Meta Keywords</h2> ".
            $token->[1]{content}."</p>" }
 }
}

sub parse_meta_description{
#parse and output meta data
$parser=HTML::TokeParser->new(\$webPage);
while (my $token=$parser->get_tag("meta"))
 { if ($token->[1]{name}=~/description/i)
    { print "<p><h2>Meta Description</h2> ".
            $token->[1]{content}."</p>" }
 }
}


sub parse_images{
#parse and count images
$parser=HTML::TokeParser->new(\$webPage);
my $imageTotal=0;
while ($parser->get_tag("img"))
 { $imageTotal++ }
print "<p><h2>Image Count</h2> ".
      "Total = $imageTotal</p>";
} 

sub parse_hyperlinks{
#parse and output hyperlinks
$parser=HTML::TokeParser->new(\$webPage);
print "<p><h2>Hyperlink Summary</h2>";
while (my $token = $parser->get_tag("a")) 
 { my $linkURL = $token->[1]{href} || "-";
   my $linkText = $parser->get_trimmed_text("/a");
   if ($linkText=~/<image/i) {$linkText="image"}
   print "<small>$linkText</small> ".
         "<b>links to</b> $linkURL<br>"
  }
}

Additional Resources

Parsing Attributes with Ease
The Perl You Need to Know


Up to => Home / Authoring / Languages / Perl / PerlfortheWeb




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers