The Mammoth Regular Expression (con't) - Page 5
December 11, 2001
The second interesting thing about this chunk of code is how
we're taking advantage of the fact that a regular expression that
contains capturing parentheses, if you place it in a list
context, returns a list of all the elements captured by those
parentheses. This means you can stick a list of scalar variables
inside parentheses on the left side of the assignment operator,
and a regular expression containing capturing parentheses on the
right, and assign all the captured substrings in one fell swoop.
Let's go through the regular expression search pattern one chunk
at a time:
/^(\S+) (\S+) (\S+)
The first thing in the expression is the beginning-of-string
anchor (^). Next comes one or more nonwhitespace
characters (\S+), which will match only if they
are the first thing on the line (thanks to that beginning-of-
string anchor). Because they are enclosed by parentheses,
whatever matches will be captured into the $host
variable in the corresponding list on the left side of the
assignment (and into $1, too, though we're not doing
anything with that). Next comes a literal space, then another
sequence of one or more nonwhitespace characters, which is
captured into $ident_user. Next comes another
literal space, and (again) one or more nonwhitespace characters,
which are captured into $auth_user. After that comes
another literal space, then the following interesting chunk:
\[([^:]+):(\d+:\d+:\d+) ([\[([^:]+):(\d+:\d+:\d+) ([^\]]+)\]#92;]]+)\]
This part of the pattern starts off by matching a literal left
bracket ([). Next it captures one or more characters
that are anything but colons (:). Then comes a
colon, and then it captures a string consisting of three sets of
one or more digits separated by colons. Next there is a space,
after which the pattern captures one or more characters that are
anything but a right bracket (]). Finally, the
pattern matches a right bracket. In other words, it matches a
date string that looks like:
[06/Jul/1999:00:09:12 -0700]
and in doing so captures the date, time, and time zone offset
into the $date, $time, and
$time_zone variables, respectively. After that part
of the pattern comes another literal space, and then this:
"(\S+) (.+?) (\S+)"
This part matches the request as sent by the web browser to the
server. As mentioned earlier, that request typically looks
something like this:
"GET / HTTP/1.0"
The tricky thing about this part of the pattern is the stuff
inside the middle set of capturing parentheses. That's where you
match the path of the actual page requested. At first glance you
would probably be tempted to use (\S+) to match
that part, on the theory that the requested path is unlikely to
contain spaces, but occasionally a space will creep into the
path, if only because a user accidentally typed one in when
manually specifying the URL. You could use something like
([^"]+) to match the URL part of the request,
which would match all the way out to the double quote, and count
on the fact that Perl would then backtrack to match the time
zone, which needs to have a double quote after it. The problem
with this, though, is that it would be relatively inefficient
because you'd be making Perl do a lot of backtracking on almost
every line. The solution given here is better. By using
.+? to match the URL, you say "match one or
more of anything, but don't be greedy." This means the
expression will match only as much as it has to in order to make
the rest of the expression match. Once it has matched all of the
requested URL, the rest of the expression should match, meaning
you'll get what you were looking for without a lot of
backtracking. The last part of the regular expression is fairly
simple, capturing the next two one-or-more-nonwhitespace chunks
into $status and $bytes, with an end-
of-string anchor at the end. And that's it. The only remaining
part of the script is the debugging print statement
that outputs each captured item on its own line, with an extra
newline at the end to put a blank line after each line's data has
been printed out:
print join "\n", $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status,
$bytes, "\n";
Notice, by the way, how we've just stuck a print
function in front of the join function. This
chaining together of two or more functions, where the function on
the right returns something that serves as the argument for the
function on the left, is a handy shortcut you'll see experienced
Perl users using all the time. Before we test the script, we
should think for a minute about what will happen in the case
where a line from the log file doesn't match that monster regular
expression. There's a fairly good chance that all the lines will
fail to match the first time we try it because that expression is
big and complicated, and one typo will mess the whole thing up.
Even after we have it working properly, though, there's always
the chance that a screwy line will show up in the log and fail to
be parsed properly. What will happen in that case? What will
happen is that the match will fail, and nothing will be assigned
to all those variables. If they were global variables, and we
were counting on the successful match to replace whatever was
already in them from processing the previous line, our script
would now go on to print out the previous line's data all over
again, which would be a problem. Since we are using
my to give us a fresh batch of variables each time
through the loop, though, we don't have to worry about that. Even
so, we still have a problem. Since none of the variables were
successfully assigned for this trip through the loop, the -
w switch will cause our script to emit a bunch of
"Use of uninitialized
variable" warnings as soon as it comes to that
print statement. Perhaps the best thing to do in
cases where a line doesn't match is to just bail out and go on to
the next line. This behavior turns out to be very easy to add to
the script: we just put or next at the
end of the line containing the regular expression assignment:
($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+)$/
or next;
How does this work? The next function tells Perl to
go on to the next iteration of the loop we're in. Putting
or something at the end of a line of a Perl
expression causes everything to the left of the or
to be evaluated in a Boolean context (that is, evaluated
to see if it yields a true or false value). If our regular
expression fails to match, it will return an empty list. That not
only means that all those variables will get undef
assigned to them; it also means the whole expression will be
false, which means the stuff to the right of the or
will be executed. In general, you can put or
(something) on the right side of the regular expression,
and whatever you put there will fire off only in cases where the
expression fails to match.
The Log-Analysis Script - Page 4
Perl for Web Site Management
Different Log File Formats - Page 6
|