Traffic tracking confusion

The main metric of the success of a website is its traffic. The number of people who visit your site per day, month, year, whatever. For some bizarre reason, I had always assumed that we (and by that I mean everybody connected with the web business) had decided on how we were going to count visitors to a site.

I guess I was wrong all along.

I mean, how hard could that be? All you need to do is read the log file, and count your traffic. There are several stellar applications that do this. (AWSTATS is my current favourite)

A log file is a simple thing. In theory, at least. Each time the server gets a request for a resource - a page, a picture, a css file, whatever - it writes a line in the log file. Like so.

59.x.y.z - - [25/Nov/2009:15:27:04 +0530] "GET / HTTP/1.1" 200 7463 "http://twitter.com/aadvaark" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.15) Gecko/2009102815 Ubuntu/9.04 (jaunty) Firefox/3.0.15"

This particular line is a typical line from Apache's access log. AWSTATS analyses each line and creates graphs and tables which should make you feel great or not so great depending on if the graph is snaking upwards or downward.

But then, there's page tagging. Page tagging involves adding a snippet of (usually) Javascript. Everytime the page is rendered in a browser, the Javascript is invoked, which sends a bunch of information to a web server. Usually not your own. You are then provided pretty graphs and tables which should make you feel great or not so great depending on if the graph is snaking upwards or downward.

Common sense would suggest that you should get at least comparable results so that your feeling of greatness or not so greatness is not dependent on the tool you use. But we all know that the Intertubes has no place for that particular line of reasoning.

Last week, I took a peek at a client's traffic using Google Analytics and AWSTATS. The results were off. Way off. Off by a factor of 10 or more. Not good.

In the sample we analysed, the AWSTATS figure was way higher than the Google Analytics number.

Googling was very revealing - lots of people have noticed this, but very few of the explanations were even remotely convincing.

Most arguments centered around

  1. Browsers with no Javascript would not show up in Google analytics: Accepted. But this should account for a 10% dip at the most. Not the huge difference we were seeing
  2. AWSTATS does not record a hit when the resource is served from the cache: For example, when the Back button is pressed. But this should mean that the AWSTATS number should be less than the Google Analytics number
  3. Google Analytics uses cookies to track visitors: So what? Shouldn't the results be comparable, at the very least?

I guess the way to answer this question is to set up a controlled experiment and check between the two.