The "Visit" Data Structure - Page 8
December 14, 2001
Trying to track individual visitors via the entries in a web
server's access log is something of an exercise in futility. With
things like proxy servers and client-side caching getting in the
way, the series of accesses that show up in the log from a
particular hostname or IP address can give only an approximate
picture of what individual visitors are doing. Multiple users
sharing the same IP address can have their activity merged into
what looks like a single, very active visitor. Conversely, a
single visitor can show up in the logs via a different IP address
on each request, defying efforts to abstract those requests into
a meaningful "visit." A proxy server at a major ISP can cache the
site's pages, then satisfy hundreds of requests that never get
recorded in the server's logs. Even so, it's hard not to wonder
what a log file would reveal if we could pluck out the requests
corresponding to specific hosts and string them together to see
what patterns emerge. Many users still browse from individual
host addresses without intervening proxy servers; for these
users, at least, the resulting "visit" tracking provides a
fascinating look at the paths being followed through the site.
It's also interesting to see how many incoming requests are
actually being generated by robot "spider" programs, and to study
the behavior of those programs as they interact with the server.
Finally, it's an interesting programming exercise to see how we
can assemble and present information on these "visits." As with
the data structure we used to create the SprocketExpo exhibitor
directory in Chapters and , we could really benefit in this case
by taking advantage of Perl's support for multilevel data
structures. A hash of hashes (that is,
a hash whose values are themselves hash variables) would make the
task of storing and accessing information on these visits
significantly easier. As it is, though, we won't be learning how
to use multilevel data structures for several more chapters.
That's okay; we can fake it by using the conventional variables
we've been using already, just as we did for the SprocketExpo
example. For the purposes of this script, we're going to define a
"visit" as a series of one or more requests received from the
same host, with no more than 15 minutes elapsing between one
request and the next. If we get another request from the same
host but more than 15 minutes has elapsed since the last one, we
will treat the new request as the start of a new "visit,"
counting it separately in our statistics. We may as well make
that 15-minute visit timeout a configuration variable up at the
top of the script and store it in seconds to make our
computations easier:
my $expire_time = 900; # seconds of inactivity to consider a
# "visit" ended (0 = forever)
Notice how the comment tells us we can set the
$expire_time variable to 0 to make the
expiration time "forever." We'll see how this works in a minute.
A number of other variables, visible throughout the script and
declared with my near the beginning, will be used to
store the information on individual visits:
$total_visits This scalar will be incremented
by one for each new visit processed. Besides being used in the
script's report to tell us how many visits there were in all,
this count will be used to generate a unique visit number
for each visit.
%visit_num This hash will have keys consisting
of hostnames or IP addresses, and values consisting of the
currently "working" visit number corresponding to that host.
All of the following hash variables will have keys consisting of
the visit number described previously:
%host Key is visit number, value is the
hostname or IP address corresponding to that visit number.
%first_time Key is visit number, value is the
date and time of that visit's first access.
%last_time Key is visit number, value is the
date and time of that visit's last (that is to say, most recent)
access.
%last_seconds Key is visit number, value is
the number of seconds returned by the
&get_seconds subroutine for the date and time of
that visit's last access.
%referer Key is visit number, value is the
HTTP_REFERER environment variable supplied for that
visit's first access.
%agent Key is visit number, value is the user-
agent string supplied for that visit's first access.
We'll add all these new variables to the big my
declaration up at the top of the script:
my($begin_time, $end_time, $total_hits, $total_mb, $total_views,
$total_visits, %visit_num, %host, %first_time, %last_time,
%last_seconds, %page_sequence, %referer, %agent);
Storing the Data - Page 7
Perl for Web Site Management
The &store_line Subroutine - Page 9
|