Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Converting IP Addresses (con't) - Page 3

December 4, 2001

The next thing in clf_lookup.plx is a my variable declaration for the %hostname hash. This is going to be used to cache hostname lookups while the script is running. That way, each IP address will have to be looked up only once rather than every time it appears in the log. It is important to initialize the %hostname hash out here, before the while loop that actually processes each line from the log file, because putting my %hostname inside the loop block would make it so that a new copy of the hash was created each time through the loop.

Let's get to the loop now. The beginning of the loop takes the form:

while (<>) {

Here we're beginning a while loop, which you'll remember means we're going to run a block of code repeatedly as long as whatever is inside those parentheses evaluates to a true value. But what a weird thing we've got inside that logical test. It looks somewhat like the angle-input operator we use to read lines from a filehandle, but there's no filehandle inside it.

What the <> (which is sometimes called the diamond operator) is doing is this: it looks at the @ARGV array (which you'll remember from Chapter 4 is the special variable holding your script's command-line arguments) and assumes that those arguments represent the names of one or more text files. The <> operator then returns the text from those files, one line at a time, so you can work with those lines in the body of your while block. It keeps feeding you lines of text until it has exhausted all of the first file mentioned in @ARGV, then goes on to the second file, and so on, until it has exhausted all the files mentioned in @ARGV. After the last line from the last file has been delivered, it returns undef (the undefined value), which is false, ending the loop.

You get an interesting extra feature with the <> operator. If you don't give your script any command-line arguments, such that there are no files mentioned in @ARGV, <> instead will read from standard input (that is, from the STDIN filehandle your script gets by default when it is started up). This in turn lets you do cool things like use your script in a shell pipeline to process the input or output for another program. In fact, we'll be using that feature with this script a little later.

Where does the <> operator put each line of text as it is working its way through the files mentioned in @ARGV? In the special scalar variable $_. As I mentioned previously, many of Perl's operators and functions are designed to work with $_ by default, and this ends up being really handy because it lets you write certain common operations very quickly.

In this case, though, we're going to go ahead and stick the contents of $_ into something a little more memorable. That happens in the next line:

my $line = $_;

Next comes the following:

my($host, $rest) = split / /, $line, 2;
if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {
    # looks vaguely like an IP address

Here you are using the split function to take the current line from the log file and separate it into everything before the first space character (which goes into the scalar variable $host) and everything after the first space character (which goes into $rest). This takes advantage of an optional third argument to the split function, with that argument being a number telling split how many fields to split the string into (in this case, two, because we don't need to keep splitting once we've split off the first field).

Next comes an if statement with a regular expression in the logical test. With your new understanding of regular expressions it should be pretty easy to decipher the meaning of /^\d+\.\d+\.\d+\.\d+$/: it matches a string consisting of four sets of one or more numbers each, separated by periods. This is not the exact same thing as an IP address (in which the component numbers can fall only within a certain range); this pattern is naïve, in that it would accept as IP addresses things like 98765.1234.1.1, but it's close enough for our current purpose.

Next come these two lines:

unless (exists $hostname{$host}) {
    # no key, so haven't processed this IP before

As discussed earlier, we're going to use $hostname{$host} to keep track of the IP addresses we've already looked up. Sometimes, though, we will attempt to look up an IP address and find that it can't be resolved to a hostname. In such cases, we'll stick undef into the value corresponding to $hostname{$host}. The hash will still have a key corresponding to that IP address, but there won't be an associated value. By testing for the existence of a particular hash key (which is what the exists function lets us do), we can avoid entering this if block if we come across those hosts again.

Next comes the actual looking up of the hostname, which is quite simple, thanks to the Socket.pm module:

    $hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);
}

gethostbyaddr is a Perl function that provides an interface to the computer's underlying hostname lookup function. The two arguments of inet_aton($host) and AF_INET are a little bit of magic provided courtesy of Socket.pm.

Next comes this:

    if ($hostname{$host}) {
        # only processes IPs with successful lookups
        $line = "$hostname{$host} $rest";
    }
}

If the gethostbyaddr returned a false value for this particular $hostname{$host} (meaning this IP address couldn't be resolved to a hostname), the script will skip over this block. Otherwise, it will re-create the $line variable (which corresponds to the current line from the log file) by interpolating the looked-up hostname in $hostname{$host} into a string, along with the $rest variable (which you will recall holds the rest of the line).

The two closing curly braces end the two if blocks we were in, after which we print out just the current value of $line:

    print $line;
}

This script jumps through a number of hoops in the interest of cutting down the actual work it does -- caching the lookups and avoiding rebuilding $line unless it has to make the script a little more involved -- but it is worth taking that sort of care with a program like this because it may end up having to process some very big log files. Even with these tricks, because the gethostbyaddr function normally takes a certain amount of time to give up on an IP address that can't be resolved, this script will tend to take a long time to process large log files.

One use of clf_lookup.plx that is kind of fun is to put it in a pipeline to convert your log file's IP addresses into hostnames on the fly. For example, if your log file is called access.log, you could use the tail command with the -f switch to watch that log growing in real time, piping the output through clf_lookup.plx to convert the hostnames, like this:

	[jbc@andros .logs]$ tail -f access.log | clf_lookup.plx

Converting IP Addresses - Page 2
Perl for Web Site Management
The Log-Analysis Script - Page 4


Up to => Home / Authoring / Languages / Perl / Manage




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers