Published Nov 26, 2003

Today, you can read this entry at http://wadearmstrong.com. In a few days (or, if I’m not being so productive, weeks), it’ll be pushed off the front page and into an archive page, when it will be found at an address like http://wadearmstrong.com/archives/000316.html. Links that went to the first address will start being, well, lies, since the referred content will no longer be there. And what use is that on the internet?

Apparently, not much. Every day, information is lost because links “rot” and either start pointing to different content or no content at all (the dreaded “error 404”). We’re not just talking broken links on My First Webpage; real professionals and researchers are using Web sites in footnotes and reference lists. Links are rotting at a rate of up to 15% in one year and 50% in four years.

Links rot because of how URLs point to resources. In http://www.wadearmstrong.com/archives/000316.html you have the protocol used to connect (http://), the domain name (wadearmstrong.com), the specific server to connect to (www), the folder on that server to look in (/archives/), the page to look for (000316), and the type of file that page is (.html). A lot of that information is redundant, a lot once was stylish to include at one time or another, and therefore a lot can be removed. For instance, most Web browsers will assume the http://. The www is — and this may come as a surprise to many — also redundant. In most cases the file type is unnecessary as well — is there any difference between a static .html page and a dynamically-generated .php or .asp one? Not to the Web browser, which must ultimately show the information on the page, but changing a page’s name from 000316.html to 000316.py will break any link to that page. Some folks suggest getting rid of the file type, which is easy to do with most Web servers.

So I could fix the URL of this page to http://wadearmstrong.com/archives/000316 (or even the more useful http://wadearmstrong.com/archives/urls). But will that help the linkrot problem? Only to a moderate extent. Suppose I decide to buy a new domain name — this page might move to http://example.com/archives/urls because I like the new domain name better.

There are a few ways to find a page that has totally disappeared. You can check out http://archive.org, a site that saves old pages. Google maintains a “cache”: http://www.google.com/help/features.html#cached of how pages looked the last time they were visited. These are both convenient, but there’s no guarantee that any page will be saved. URLs are not unique or guaranteed; “URNs” http://www.ietf.org/rfc/rfc2141.txt are. Not that anybody uses URNs. Domain names (like wadearmstrong.com) could conceivably be supplemented by ISSNs. But who would want to visit a Web site with the URL issn://231982331698763/archives/url? Or, even worse, URN:wadearmstrong:foo:47? Nope, I like my vanity domain.

Vanity domains are all well and fine for scribblings like this, but what of the mission-critical links mentioned at the beginning? A lost reference in a scientific article can throw the conclusions of that paper into doubt. A central registry, using ugly-looking identifiers like the ones above, may be well-suited as a tool to maintain permanent links to important documents; authors could pay a small fee to enter the URL of their document in the registry and then keep it updated manually. Authors citing these documents could then use the registry address — for instance, registry://w/a/47/aurngeyw/123756 — in cites, and be assured that this identifier would forward to http://harvard.edu/mypaper/mypaper.htm or http://nsa.gov/research/4747.asp or to wherever the document travels.

Somebody should take out a patent!