Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


The Mammoth Regular Expression (con't) - Page 5

December 11, 2001

The second interesting thing about this chunk of code is how we're taking advantage of the fact that a regular expression that contains capturing parentheses, if you place it in a list context, returns a list of all the elements captured by those parentheses. This means you can stick a list of scalar variables inside parentheses on the left side of the assignment operator, and a regular expression containing capturing parentheses on the right, and assign all the captured substrings in one fell swoop.

Let's go through the regular expression search pattern one chunk at a time:

/^(\S+) (\S+) (\S+)

The first thing in the expression is the beginning-of-string anchor (^). Next comes one or more nonwhitespace characters (\S+), which will match only if they are the first thing on the line (thanks to that beginning-of- string anchor). Because they are enclosed by parentheses, whatever matches will be captured into the $host variable in the corresponding list on the left side of the assignment (and into $1, too, though we're not doing anything with that). Next comes a literal space, then another sequence of one or more nonwhitespace characters, which is captured into $ident_user. Next comes another literal space, and (again) one or more nonwhitespace characters, which are captured into $auth_user. After that comes another literal space, then the following interesting chunk:

\[([^:]+):(\d+:\d+:\d+) ([\[([^:]+):(\d+:\d+:\d+) ([^\]]+)\]#92;]]+)\]

This part of the pattern starts off by matching a literal left bracket ([). Next it captures one or more characters that are anything but colons (:). Then comes a colon, and then it captures a string consisting of three sets of one or more digits separated by colons. Next there is a space, after which the pattern captures one or more characters that are anything but a right bracket (]). Finally, the pattern matches a right bracket. In other words, it matches a date string that looks like:

[06/Jul/1999:00:09:12 -0700]

and in doing so captures the date, time, and time zone offset into the $date, $time, and $time_zone variables, respectively. After that part of the pattern comes another literal space, and then this:

"(\S+) (.+?) (\S+)"

This part matches the request as sent by the web browser to the server. As mentioned earlier, that request typically looks something like this:

"GET / HTTP/1.0"

The tricky thing about this part of the pattern is the stuff inside the middle set of capturing parentheses. That's where you match the path of the actual page requested. At first glance you would probably be tempted to use (\S+) to match that part, on the theory that the requested path is unlikely to contain spaces, but occasionally a space will creep into the path, if only because a user accidentally typed one in when manually specifying the URL. You could use something like ([^"]+) to match the URL part of the request, which would match all the way out to the double quote, and count on the fact that Perl would then backtrack to match the time zone, which needs to have a double quote after it. The problem with this, though, is that it would be relatively inefficient because you'd be making Perl do a lot of backtracking on almost every line. The solution given here is better. By using .+? to match the URL, you say "match one or more of anything, but don't be greedy." This means the expression will match only as much as it has to in order to make the rest of the expression match. Once it has matched all of the requested URL, the rest of the expression should match, meaning you'll get what you were looking for without a lot of backtracking. The last part of the regular expression is fairly simple, capturing the next two one-or-more-nonwhitespace chunks into $status and $bytes, with an end- of-string anchor at the end. And that's it. The only remaining part of the script is the debugging print statement that outputs each captured item on its own line, with an extra newline at the end to put a blank line after each line's data has been printed out:

print join "\n", $host, $ident_user, $auth_user, $date, $time,
  $time_zone, $method, $url, $protocol, $status,
  $bytes, "\n";

Notice, by the way, how we've just stuck a print function in front of the join function. This chaining together of two or more functions, where the function on the right returns something that serves as the argument for the function on the left, is a handy shortcut you'll see experienced Perl users using all the time. Before we test the script, we should think for a minute about what will happen in the case where a line from the log file doesn't match that monster regular expression. There's a fairly good chance that all the lines will fail to match the first time we try it because that expression is big and complicated, and one typo will mess the whole thing up. Even after we have it working properly, though, there's always the chance that a screwy line will show up in the log and fail to be parsed properly. What will happen in that case? What will happen is that the match will fail, and nothing will be assigned to all those variables. If they were global variables, and we were counting on the successful match to replace whatever was already in them from processing the previous line, our script would now go on to print out the previous line's data all over again, which would be a problem. Since we are using my to give us a fresh batch of variables each time through the loop, though, we don't have to worry about that. Even so, we still have a problem. Since none of the variables were successfully assigned for this trip through the loop, the - w switch will cause our script to emit a bunch of "Use of uninitialized variable" warnings as soon as it comes to that print statement. Perhaps the best thing to do in cases where a line doesn't match is to just bail out and go on to the next line. This behavior turns out to be very easy to add to the script: we just put or next at the end of the line containing the regular expression assignment:

($host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+)$/
    or next;

How does this work? The next function tells Perl to go on to the next iteration of the loop we're in. Putting or something at the end of a line of a Perl expression causes everything to the left of the or to be evaluated in a Boolean context (that is, evaluated to see if it yields a true or false value). If our regular expression fails to match, it will return an empty list. That not only means that all those variables will get undef assigned to them; it also means the whole expression will be false, which means the stuff to the right of the or will be executed. In general, you can put or (something) on the right side of the regular expression, and whatever you put there will fire off only in cases where the expression fails to match.

The Log-Analysis Script - Page 4
Perl for Web Site Management
Different Log File Formats - Page 6


Up to => Home / Authoring / Languages / Perl / Manage




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers