Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Parsing Attributes with Ease

August 9, 1999

Parsing the <TITLE> tag is particularly easy because it is a simple tag with no extra attributes. Many tags, though, do possess modifying attributes which are crucial to parsing. Consider, for example, the <META> tag which possesses two attributes: NAME and CONTENT. Seen in the wild, a <META> tag may look like this in its native habitat:

<META 
	 NAME="Keywords" 
	 CONTENT="food,cuisine,cooking,recipes">

Imagine that we're writing code to parse a document for its meta keywords. Since the <META> tag can contain information other than keywords (e.g. description, author, copyright, etc.), we can't simply grab the first <META> tag we find and call it a day. Rather, we need to analyze the NAME attribute of each <META> tag until we find the "Keywords"-specific tag, and then we can harvest the information contained in the CONTENT attribute.

Parse only for the "Keywords" META tag.
sub parse_meta_keywords{
#parse and output meta data
$parser=HTML::TokeParser->new(\$webPage);
while (my $token=$parser->get_tag("meta"))
 { if ($token->[1]{name}=~/keywords/i)
    { print "<p><h2>Meta Keywords</h2> ".
            $token->[1]{content}."</p>" }
 }
}

Once again, we reset TokeParser to the start of the document contained in $webPage. The while loop ensures that TokeParser will find each <META> tag in the document, not merely the first. We want to analyze each tag to see if it's a "Keyword" tag. Although there shouldn't be more than one such tag, and we could justifiably exit the loop once the tag has been found, it is possible that somebody thoughtlessly placed multiple <META> Keyword tags in a single document.

TokeParser's get_tag() method grab the next <META> tag it sees and returns its various components as array reference assigned to $token. Array references can be confusing, and some of the syntax that follows is weird because we're dealing with an array reference. It may be best not to worry about why in this case and focus on how -- you can simply replicate this syntax in your own code without too much heartache over why the syntax looks as it does.

Suffice it to say, we can access the attributes of the tag as a hash of $token->[1]. Thus, $token->[1]{name} returns the value of the NAME attribute in this tag. Similarly, $token->[1]{content} will return the CONTENT attribute, and you can extend this syntax to any attribute for whatever tag you are parsing.

In our example, we check to see if the NAME attribute for the snagged tag contains "Keywords" (case insensitive). If yes, the CONTENT attribute of this tag is output to the screen; otherwise, this is not a Keywords <META> tag and we move on to find the next <META> tag, until parsing has completed.


Pulling Tags Like Taffy: TokeParser
The Perl You Need to Know
The Proof is in the Parsing: A Web Page Summarizer


Up to => Home / Authoring / Languages / Perl / PerlfortheWeb




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers