Different Log File Formats - Page 6
December 11, 2001
It's fairly easy to modify this script to accept either the
common or the extended log format. We do that by adding a
configuration variable near the top of the script that looks like
this:
my $log_format = 'common'; # 'common' or 'extended'
Then we modify the part of the script where the regular
expression parsing occurs to include some logic to check that
$log_format variable, along with a second version of
the regular expression to be used on logs that are in the
extended format:
if ($log_format eq 'common') {
($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+)$/
or next;
} elsif ($log_format eq 'extended') {
($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\]
"(\S+) (.+?) (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;
} else {
die "unrecognized log format '$log_format'";
}
I think this probably qualifies as the ugliest block of code in
this entire book. This is not the sort of code that anybody wants
to have to make sense of more than once, but fortunately, once we
get it right, we aren't likely to need to modify it. Anyway,
you'll notice that the new regular expression for extended-format
logs has a couple of new chunks at the end, both of which look
like "([^"]+)". By now that should be
an easy one for you: it means "match a literal double quote,
then capture one or more characters that are anything but a
double quote, then match another literal double quote."
These two new chunks capture into the new $referer
and $agent variables that we've added at the end of
the parenthetical list being assigned to. We've also added an
else block, which just does a quick sanity check,
dying with an error message if the $log_format
variable was inadvertently set to an unexpected value. You may
have noticed that there is no my declaration before
the list of variables in either branch of the if-
elsif construct. That's because declaring those variables
as my variables here, inside the curly braces
of the if-elsif block, would limit their visibility
later on in the while block, where they need to be
visible. As a result, we've moved the my
declaration for the variables above the if-elsif
construct, just after the while
(<>) line:
my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);
We should also add the $referer and
$agent variables to the list of variables that the
debugging print statement should print out. This
will give us some extra blank lines in the output if our log file
is actually in the common format, but that print
statement is just a quick debugging tool anyway; the real output
that the script produces later will be implemented more
intelligently:
print join "\n", $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status,
$bytes, $referer, $agent, "\n";
The Mammoth Regular Expression (con't) - Page 5
Perl for Web Site Management
Different Log File Formats (con't) - Page 6
|