Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Parsing Web Access Logs

December 4, 2001

Web server access logs are an excellent source of information about what your site's visitors are up to. The information on separate visitors is all mixed together, though, and for all but the smallest sites the raw access logs are too large to read directly. What you need is log analysis software to make the information in the log more easily accessible. You can buy commercial log analysis software to do this, but Perl makes it easy to write your own. The next three chapters describe how to build such a home-grown log-analysis tool.

This chapter focuses on the first part of the process: extracting and storing the information we're interested in. We talk about log file structure, converting IP addresses, and creating regular expressions capable of parsing web access logs. We also talk about creating a suitable data structure for storing the extracted data, so we can answer interesting questions about what our site's visitors have been doing. Along the way we discuss the difficulty of identifying those visitors in the web server's log entries and devise an approach for extracting at least an approximate version of that information.

The example continues in Chapter 9, which focuses on how to do computations involving dates and times, and finishes in Chapter 10, which covers the specifics of how we manipulate the "visit" information from our logs, as well as the actual output of the finished report.

Log File Structure

Most web servers store their access log in what is called the "common log format." Each time a user requests a file from the server, a line containing the following fields is added to the end of the log file:

  • host: This is either the IP address (like 207.71.222.231) or the corresponding hostname (like pm9-31.sba1.avtel.net) of the remote user requesting the page. For performance reasons, many web servers are configured not to do hostname lookups on the remote host. This means that all you end up with in the log file is a bunch of IP addresses. A bit later in this chapter, you'll develop a Perl script that you can use to convert those IP addresses into hostnames.

  • identd result: This is a field for logging the response returned by the remote user's identd server. Almost no one actually uses this; in every web log I've ever seen, this field is always just a dash (-).

  • authuser: If you are using basic 'ecHTTP authentication (which we'll be talking about in Chapter 19) to restrict access to some of your web documents, this is where the username of the authenticated user for this transaction will be recorded. Otherwise, it will be just a dash (-).

  • date and time: Next comes a date and time string inside square brackets, like: [06/Jul/1999:00:09:12 - 0700]. That's the day of the month, the abbreviated month name, and the four-digit year, all separated by slashes. Next come the time (expressed in 24-hour format, so 11:30 P.M. would be 23:30:00) and a time-zone offset (in this example, -0700, because the web server this log was from was using Pacific Daylight Time, which is seven hours behind Universal Time/Greenwich Mean Time).

  • request: This is the actual request sent by the remote user, enclosed in double quotes. Normally it will look something like: "GET / HTTP/1.0". The GET part means it is a GET request (as opposed to a POST or a HEAD request). The next part is the path of the URL requested; in this case, the default page in the server's top- level directory, as indicated by a single slash (/). The last part of the request is the protocol being used, at the time of this writing typically HTTP/1.0 or HTTP/1.1.

  • status code: This is the status code returned by the server; by definition this will be a three-digit number. A status code of 200 means everything was handled okay, 304 means the document has not changed since the client last requested it, 404 means the document could not be found, and 500 indicates that there was some sort of server-side error. (More detail on the various status codes can be found in RFC 1945, which describes the HTTP/1.0 protocol. See http://www.w3.org/Protocols/rfc1945/rfc1945.)

  • bytes sent: The amount of data returned by the server, not counting the header line.

An extended version of this log format, often referred to as the "combined" format, includes two additional fields at the end:

  • referer: The referring page, if any, as reported by the remote user's browser. Note that referer is consistently misspelled (with a single "r" in the middle) in the HTTP specification, and in the name of the corresponding environment variable.

  • user agent: The user agent reported by the remote user's browser. Typically, this is a string describing the type and version of browser software being used.

Assuming you have control over your web server's configuration, or can get your ISP to modify it for you, the combined format's extra fields can provide some very interesting information about the users visiting your site. The log analysis script described in this chapter will work with either format, however.

Perl for Web Site Management
Converting IP Addresses - Page 2


Up to => Home / Authoring / Languages / Perl / Manage




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers