Parsing Attributes with Ease
August 9, 1999
Parsing the <TITLE> tag is particularly easy because
it is a simple tag with no extra attributes. Many
tags, though, do possess modifying attributes which are
crucial to parsing. Consider, for example, the <META>
tag which possesses two attributes: NAME and CONTENT. Seen
in the wild, a <META> tag may look like this in
its native habitat:
<META
NAME="Keywords"
CONTENT="food,cuisine,cooking,recipes">
Imagine that we're writing code to parse a document for its
meta keywords. Since the <META> tag can contain
information other than keywords (e.g. description, author,
copyright, etc.), we can't simply grab the first <META>
tag we find and call it a day. Rather, we need to analyze the
NAME attribute of each <META> tag until we
find the "Keywords"-specific tag, and then
we can harvest the information contained in the CONTENT
attribute.
Parse only for the "Keywords" META tag.
sub parse_meta_keywords{
#parse and output meta data
$parser=HTML::TokeParser->new(\$webPage);
while (my $token=$parser->get_tag("meta"))
{ if ($token->[1]{name}=~/keywords/i)
{ print "<p><h2>Meta Keywords</h2> ".
$token->[1]{content}."</p>" }
}
}
|
Once again, we reset TokeParser to the start of the document
contained in $webPage. The while
loop ensures that TokeParser will find each <META> tag
in the document, not merely the first. We want to
analyze each tag to see if it's a "Keyword" tag.
Although there shouldn't be more than one such tag,
and we could justifiably exit the loop once the tag has been
found, it is possible that somebody thoughtlessly
placed multiple <META> Keyword tags in a single document.
TokeParser's get_tag() method grab the next
<META> tag it sees and returns its various components
as array reference assigned to $token.
Array references can be confusing, and
some of the syntax that follows is weird because we're
dealing with an array reference. It may be best not to worry
about why in this case and focus on how -- you can simply
replicate this syntax in your own code without too much
heartache over why the syntax looks as it does.
Suffice it to say, we can access the attributes of the tag as
a hash of $token->[1]. Thus,
$token->[1]{name}
returns the value of the NAME attribute in this tag.
Similarly, $token->[1]{content} will return the
CONTENT attribute, and you can extend this syntax to any
attribute for whatever tag you are parsing.
In our example, we check to see if the NAME attribute for the
snagged tag contains "Keywords" (case
insensitive). If yes, the CONTENT attribute of this tag is
output to the screen; otherwise, this is not a Keywords
<META> tag and we move on to find the next
<META> tag, until parsing has completed.
Pulling Tags Like Taffy: TokeParser
The Perl You Need to Know
The Proof is in the Parsing: A Web Page Summarizer
|