Converting IP Addresses (con't) - Page 3
December 4, 2001
The next thing in clf_lookup.plx is a
my variable declaration for the
%hostname hash. This is going to be used to cache
hostname lookups while the script is running. That way, each IP
address will have to be looked up only once rather than every
time it appears in the log. It is important to initialize the
%hostname hash out here, before the
while loop that actually processes each line from
the log file, because putting my %hostname inside
the loop block would make it so that a new copy of the hash was
created each time through the loop.
Let's get to the loop now. The beginning of the loop takes the
form:
while (<>) {
Here we're beginning a while loop, which you'll
remember means we're going to run a block of code repeatedly as
long as whatever is inside those parentheses evaluates to a true
value. But what a weird thing we've got inside that logical test.
It looks somewhat like the angle-input operator we use to read
lines from a filehandle, but there's no filehandle inside it.
What the <> (which is sometimes called the
diamond operator) is doing is this: it looks at the
@ARGV array (which you'll remember from Chapter 4 is
the special variable holding your script's command-line
arguments) and assumes that those arguments represent the names
of one or more text files. The <> operator
then returns the text from those files, one line at a time, so
you can work with those lines in the body of your
while block. It keeps feeding you lines of text
until it has exhausted all of the first file mentioned in
@ARGV, then goes on to the second file, and so on,
until it has exhausted all the files mentioned in
@ARGV. After the last line from the last file has
been delivered, it returns undef (the undefined
value), which is false, ending the loop.
You get an interesting extra feature with the
<> operator. If you don't give your script any
command-line arguments, such that there are no files mentioned in
@ARGV, <> instead will read from
standard input (that is, from the STDIN
filehandle your script gets by default when it is started up).
This in turn lets you do cool things like use your script in a
shell pipeline to process the input or output for another
program. In fact, we'll be using that feature with this script a
little later.
Where does the <> operator put each line of
text as it is working its way through the files mentioned in
@ARGV? In the special scalar variable
$_. As I mentioned previously, many of Perl's
operators and functions are designed to work with $_
by default, and this ends up being really handy because it lets
you write certain common operations very quickly.
In this case, though, we're going to go ahead and stick the
contents of $_ into something a little more
memorable. That happens in the next line:
my $line = $_;
Next comes the following:
my($host, $rest) = split / /, $line, 2;
if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {
# looks vaguely like an IP address
Here you are using the split function to take the
current line from the log file and separate it into everything
before the first space character (which goes into the scalar
variable $host) and everything after the first space
character (which goes into $rest). This takes
advantage of an optional third argument to the split
function, with that argument being a number telling
split how many fields to split the string into (in
this case, two, because we don't need to keep splitting once
we've split off the first field).
Next comes an if statement with a regular expression
in the logical test. With your new understanding of regular
expressions it should be pretty easy to decipher the meaning of
/^\d+\.\d+\.\d+\.\d+$/:
it matches a string consisting of four sets of one or more
numbers each, separated by periods. This is not the exact same
thing as an IP address (in which the component numbers can fall
only within a certain range); this pattern is naïve, in that
it would accept as IP addresses things like
98765.1234.1.1, but it's close enough for our
current purpose.
Next come these two lines:
unless (exists $hostname{$host}) {
# no key, so haven't processed this IP before
As discussed earlier, we're going to use
$hostname{$host} to keep track of the IP addresses
we've already looked up. Sometimes, though, we will attempt to
look up an IP address and find that it can't be resolved to a
hostname. In such cases, we'll stick undef into the
value corresponding to $hostname{$host}. The hash
will still have a key corresponding to that IP address, but there
won't be an associated value. By testing for the existence of a
particular hash key (which is what the exists
function lets us do), we can avoid entering this if
block if we come across those hosts again.
Next comes the actual looking up of the hostname, which is quite
simple, thanks to the Socket.pm module:
$hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);
}
gethostbyaddr is a Perl function that provides an
interface to the computer's underlying hostname lookup function.
The two arguments of inet_aton($host) and
AF_INET are a little bit of magic provided courtesy
of Socket.pm.
Next comes this:
if ($hostname{$host}) {
# only processes IPs with successful lookups
$line = "$hostname{$host} $rest";
}
}
If the gethostbyaddr returned a false value for this
particular $hostname{$host} (meaning this IP address
couldn't be resolved to a hostname), the script will skip over
this block. Otherwise, it will re-create the $line
variable (which corresponds to the current line from the log
file) by interpolating the looked-up hostname in
$hostname{$host} into a string, along with the
$rest variable (which you will recall holds the rest
of the line).
The two closing curly braces end the two if blocks
we were in, after which we print out just the current value of
$line:
print $line;
}
This script jumps through a number of hoops in the interest of
cutting down the actual work it does -- caching the lookups and
avoiding rebuilding $line unless it has to make the
script a little more involved -- but it is worth taking that sort
of care with a program like this because it may end up having to
process some very big log files. Even with these tricks, because
the gethostbyaddr function normally takes a certain
amount of time to give up on an IP address that can't be
resolved, this script will tend to take a long time to process
large log files.
One use of clf_lookup.plx that is kind of fun is to
put it in a pipeline to convert your log file's IP addresses into
hostnames on the fly. For example, if your log file is called
access.log, you could use the tail
command with the -f switch to watch that log growing
in real time, piping the output through
clf_lookup.plx to convert the
hostnames, like this:
[jbc@andros .logs]$ tail -f access.log | clf_lookup.plx
Converting IP Addresses - Page 2
Perl for Web Site Management
The Log-Analysis Script - Page 4
|