On archiving/preserving websites

SB and I have been chatting about the whys, whens and hows involved in archiving a website. Archiving is always an uphill battle. It’s hard to take care of things as they age no matter what the material, and ageing code comes with a specific set of worries.

Each part of the code ecosystem surrounding a website improves at varying rate over time. Many browsers now automatically update, and good hosting providers tend to keep their server environments reasonably up-to-date. On the other hand, the code used to serve up a website is frequently a little more stagnant. Perhaps it’s been a while since the content management system was updated or the front end is looking a little iffy since it hasn’t been revisited since it launched. Problems arise when the various parts of the code ecosystem fall vastly out of sync with one another. The problems may be innocuous, such as anomalies in the site’s appearance, or more severe. Big problems include fatal errors when accessing the database or serious vulnerabilities caused by deprecated code.

If a site is going to be left to its own devices for the foreseeable future as an untouched archive, ideally some steps should be taken preemptively to give it a little more longevity and allow the site owner to be a little more hands-off with maintenance.

SB raised a good point that hadn’t occurred to me initially: it may be best to preemptively convert a to-be-archived database-driven site into a static site to reduce potential future headache. If the the content will rarely if ever be updated, you don’t really need a database and you certainly don’t need the responsibility of keeping another content management system up-to-date. Case in point, this static site is nearly a decade old and is famously well-preserved.

I suppose something like SiteSucker would be useful for generating a quick static version of most sites, though there would probably have to be some manual content and URL grooming as well. Incidentally, I’ve used it pretty successfully to generate local versions of online-only docsets that I can’t find on Dash for those rare-but-annoying times I want to get a lot of work done offline (long flights, mainly).

Flash-based sites pose a particular problem, as the author of this April ‘15 Motherboard article outlines pretty thoroughly. I’d love to know more about these “rare internet files” collected by interviewee Jason Scott… At any rate, Shumway and Swiffy look pretty useful for Flash-to-HTML conversion.

It’s unusual for a site and/or it’s content to remain relatively unscathed over time in the face of potential host billing lapses, hacked databases, poor backup practice, etc. If you find yourself the steward of a useful, valuable, and reasonably presentable archive of digital content – an enviable position – it’s probably worth preserving that archive. It seems there’s no best way to go about it, but it’s a fun thing to consider.