Prefetching Hints - Helping Firefox and Google speed up your site

The Prefetching Problem

Read. Click. Wait...

Read. Click. Wait...

Read. Click. Wait...

Why do we have to wait?

Wouldn't it be better to download the next page we'll want to click while we're reading the one before? That's the thinking behind prefetching, whether it's done by the Firefox browser or the Google Web Accelerator.

When someone accesses one of our web pages, their browser can download not only the web page they asked for, but also the pages they might visit next. Well-behaved clients like Firefox will only do this with pages that are specificly marked for prefetching by the author or system administrator. Unless the user has installed the Fasterfox extension, in which case it will rampage around your site grabbing anything that looks like a static page, whether the user has asked for it or not.

There's been a lot of controversy about whether browsers should do this kind of thing. If a site is on fast enough hardware and has a lot of bandwidth to spare, it makes sense to let users download pages they're likely to want in advance. On the other hand, for a site with limited resources, a bunch of clients downloading pages they may not even look at will only slow things down for everybody.

Clearly the problem is not with prefetching itself, it's with deciding which pages to prefetch. The browser has no idea how busy the server is or how much spare bandwidth it has. Not only that, it also has no reliable way of telling a link the user is likely to click from a link that nobody cares about.

As web server administrators, on the other hand, we know about all these things. We have data about how much bandwidth our sites are allowed and how much they are using, which pages are cheap to deliver and which ones involve expensive database queries, how much memory we're using, how much strain the CPU is under - everything we need to judge whether prefetching our pages will make things better or worse for our readers. Not only that, we also have our web server logs, giving us real data from real people about which pages our users like to click, and where they are likely to go next.

I will suggest a couple of things we can do to take control of the prefetching process, discourage badly-behaved clients from prefetching too much, and give the browser the information it needs to make our users' experience better.

Getting prefetching to show up in our logs

First of all, how do we know a prefetch when we see one?

Firefox puts a header in each prefetching request, like this:

X-moz: prefetch

So we'll need to ask our web server to trap that information and log it somewhere useful. The options are:

  1. Make a separate log file, just for prefetching requests.
  2. Add an extra field to our log file format.
  3. Mush something about prefetching into an existing field in our log file.

I have enough log files as it is, and I don't want to confuse my log analysis software by adding a custom field, so I'm going to squidge the X-Moz header onto the end of the User-Agent field of my current logs. (They're in "combined" format, which includes a field for the referer). Log analysis software will usually ignore crap tagged on the end of the User-Agent field, so this will tell me which hits have been prefetched without breaking anything else.

Let's tell Apache about the new log format we're inventing. We'll call the format "combined_with_prefetching_hack".

Somewhere in our apache configuration file (httpd.conf or apache2.conf) we should have a line like this.

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

Underneath that, we'll add another line like this:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i %{X-Moz}i\"" combined_with_prefetching_hack

Then we'll find the place where we are currently telling apache to use the "combined" format for our site, and tell it to use "combined_with_prefetching_hack" instead.

Comment out a line a bit like this:

CustomLog /var/log/apache2/access.log combined

...and replace it with something more like this:

CustomLog /var/log/apache2/access.log combined_with_prefetching_hack

...then restart apache or tell it to reload its configuration.

Now if we want a version of our log file without the prefetched stuff, we can do this:

grep -v prefetch access.log

Stopping Fasterfox Running Amok

Once we've had prefetch logging turned on for a couple of days, we may find a small number of over-excitable clients rampaging around our site prefetching anything they can get their greedy little hands on. These are probably Firefox users with the Fasterfox extension installed. We almost certainly don't want every link on every page to be prefetched, but luckily even Fasterfox will behave itself, if we talk to it sternly enough.

If you haven't already got one, create a robots.txt file in the top directory of your site. In it, add the following lines:

User-agent: Fasterfox
Disallow: /

Providing prefetching hints

Now we've dealt with browsers trying to do things the wrong way, let's provide some hints to help the ones that are trying to do it right.

The conventional way to tell the browser to prefetch something is to put a <link> tag in the body of our page. For example, if I think there's a good chance someone reading this page will want to go and look at my top page as well, I can stick something like this in the head of my HTML document:

<link rel="prefetch" href="/index.htm">

That's fine if we know what we'll want people to prefetch when we make the page. But we probably don't. People won't necessarily click what we think they're going to click, and we want to be able to adjust how much is prefetched according to how much spare capacity we've got on our server.

So instead, let's keep our prefetching rules separate from our website content. Rather than putting <link> tags in every page on our site, we'll inject prefetching hints into the headers of the responses that our server sends to the browser. That way we can easily regenerate the rules to keep up with changes in usage patterns, and scale back or turn off prefetching altogether if our server gets too busy. (Many thanks to Darin Fisher for his help with this.)

If we haven't already done so, we'll need to turn on apache's mod_headers.

In apache2, we can do that like this:

a2enmod headers

...then get apache to reload itself with:

/etc/init.d/apache2 force-reload

...or similar.

Now let's try making a prefetching hint for Firefox. When we're done, the following will tell the Firefox to prefetch the top page of my website when it's finished downloading this page:

<IfModule mod_headers.c>
   <Location /programming/pf.htm>
      Header append Link "</index.htm>; rel=prefetch"

I've called this file prefetch.conf and stuck it in my apache2 configuration directory (/etc/apache2). To tell Apache to read it, we need an Include statement in the configuration file like this:

Include /etc/apache2/prefetch.conf

Once we've reloaded apache, we should be able to check our logs and find that requests for /programming/pf.htm are immediately followed by prefetch requests for /index.htm.

If this doesn't seem to be working, you may want to check whether the Link header is really being set. You can either use Firefox's Live HTTP Headers Extension or do it the old-fashioned way with wget -S. When testing, bear in mind that the file we're pre-fetching may already have been cached by the browser. It might be easier to test this by telling Firefox to pre-fetch a non-existent file, then checking for the resulting 404 in the server logs.

Generating the hints from our access logs

OK, all we have to do now is go through all the pages on our site, figure out which pages our users will want to fetch in which order and put that information in our prefetch.conf file.

How you decide what you want pre-fetched will be up to you. I'm going to assume that we only want to prefetch static pages, and not bother with anything with an extension other than .htm(l) or anything with a query string in it. Other than that, I'll count any record of someone arriving at B.html from A.html as one vote for putting a prefetching hint for B.html in the A.html headers. I'll only prefetch a page if at least 2% of the people accessing the page click the link to go to the next one, and I'll limit the hints provided with each page to the five most popular subsequent clicks.

This will only work for links within my site. My server logs don't tell me anything about where people go after they visit my pages, so I'll leave that out of my prefetch.conf, and just hard-code the occasional <link> tag in my HTML for very compelling external links.

Here's a PERL script to generate the prefetch.conf file from combined-format server logs. You can download it here. You will probably want to tweak it according to your own personal prefetching preferences.

You need to pass it the name of your log file and a comma-delimited list of your domains. My defaults err on the "lots of spare bandwidth" side. On a busy server, you'll also want to add arguments for the minimum percentage of hits going to the next page you'll want before you prefetch it, and the maximum number of prefetching hints it should provide for each page. A fairly conservative incantation would be something like:

perl /var/log/apache2/access.log, 40 3 > prefetch.conf

This is pretty crude and there's plenty of room for improvement, but I've found that it's enough to make the static parts of my site feel noticeably snappier. Give it a try, and let me know how you get on.