This is a hands-on guide for the user who maintains his web pages on a UNIX shell account on a host using the popular NCSA httpd or Apache web servers. It explains how to extract personal records (that is, the accesses for one persons's pages) from a server-wide access_log, and how to do long-term archiving of that information for analysis by any popular log analyzer. Such extraction and archiving creates smaller, focused logs that can be retained for longer periods than a server-wide log.
Since you've got a UNIX shell account, you've probably got access to the Lynx web browser. Visit your homepage using Lynx and hit the # key to see the httpd headers, which will include a header identifying the server software. (If you don't have access to Lynx, visit the Netcraft Server Survey and enter your server's name.) If the server response doesn't include the words "NCSA" or "Apache", the rest of this tutorial is useless to you. Sorry.
NCSA and Apache servers record file requests in a file named access_log, usually in a subdirectory of the server daemon's home directory. You can find your server's access_log using the find(1) command, i.e. find / -name access_log* -print
find(1) may actually locate the server access_log more than once. That's OK. Either your server has symbolic links to the access_log (pick whichever one is easiest to type) or it's saving old logs for a short time after closing them. Neither case causes any problems, and might even make your work easier.
No ISP can afford the diskspace to save old access_log files forever, so you've got to learn to get your information out of the file before it's deleted and restarted. The restart interval varies widely from server to server, with busy servers having to reset more often than light-load machines. Some servers do, however, retain recent logs for a few days after closing them, in case the webmaster has to track down an old error.
If your ISP retains old access_logs, the modification times on saved files will probably tip you off right away -- they're the times those logs were closed. Otherwise, you'll have to monitor the access_log directory until you learn the interval. The first line of the access_log states when the log was opened. Set up a simple shell script (like the one below) and run it from crontab(1) fairly often (once an hour, at least) to build a log of restart times. Hopefully, you'll see a pattern form.
#!/bin/sh mv ~/restart_log ~/restart_log.tmp head -1 /var/logs/www/access_log >> ~/restart_log.tmp uniq ~/restart_log.tmp > ~/restart_log rm ~/restart_log.tmp
Once you've figured when and how often your server's access_log is reset, you have to decide when's the best time to extract your information. The "best time" will depend on whether or not the server is retaining closed logs after a restart.
If your server is restarting logs on a regular basis, but isn't saving the logs afterwards, you'll have to run your extraction program just before every reset. I run mine about 10 minutes before reset, because grepping large logs can take a while. Of course, I potentially lose 10 minutes of stats every month, but that's life.
If your server is saving access_logs for a reasonable length of time, you can get complete access logging by waiting until just after the log restart, and extracting your information from the the just-closed log.
All you need to extract your personal information from an access_log is the grep(1) command. If the access_log is uncompressed, just use this command to extract your information to a new access_log in your home directory, substituting your account name for mine, and the location of your server access_log where appropriate.grep islander access_log >> ~/access_log
Compressed access_log files require slightly more work. Uncompress the log to stdout and pipe it directly to grep(1). For a gzip'ed log, the command (which should be on one line, but your browser may wrap it) should resemble:gzip -dc access_log.gz | grep islander >> ~/access_log
Most of the time, grepping for your userid is sufficient, although you may pick up some bogus entries if your userid is a common word. A more complex rexgp may be used, but be careful -- you can't just grep for ~userid, because some browsers will escape ~ as %7E, and you'll miss those request in the access_log. For directories using ~, try egrep "\/(~|%7E)userid/".
Now that you've learned how to save your information, you need to archive it. An access_log can grow large quickly, but they compress very well. I recommend using gzip(1), because it can work in a command pipe and append to already-archived files. That reduces the number of large files kept on disk at one time, avoiding "disk quota exceeded" errors that can lose your log.
Here are the extraction commands used above, altered to compress the personal access_log:grep islander /var/logs/www/access_log | gzip -9 >> ~/access_log.gz
andgzip -dc access_log.gz | grep islander | gzip -9 >> ~/access_log
gzip(1) can provide 80-90% compression on a log file. In my case, 22 months of access_log entries compresses to less than 900,000 kilobytes. Not bad, eh?
Now that you've got a personal access_log, reconfigure your log analyzer to use that log instead of the site-wide log. If the analyzer can't decompress logs on its own, it can probably read logs from standard input, allowing you to "feed" the ~/access_log to the analyzer in a pipe. For example:gzip -dc ~/access_log.gz | analyzer
(At various times, I've used personal access_logs (created using the techniques on this page) with Analog, Getstats, W3Olista, and wusage 3.2.)
You now know where your servers's access_log is, how often it's reset, how to save the information for long-term use, and how to feed the personal log to your log analyzer. Now you have to put it all together in a shell script, use crontab(1) to run it at the time you chose in Step 2.5, and your logging will be automated. An example shell script, using Analog to process the logs:
#!/bin/sh SERVER_LOG=/var/www/logs/access_log MY_LOG=$HOME/logs/access_log.gz grep islander $LOG | gzip -9 >> $MY_LOG gzip -dc $MY_LOG | analog