Wednesday, April 29, 2009

Extending the lifetime of data

I happened to be over at my wife's office last week.

She works at a small law firm, and one of the attorneys was trying to locate some information on an old case that the company had worked on.

Well, not that old, really: the case was from 2003. Their law firm has been around for decades, as have many of their attorneys, so to them this was actually a "recent" case.

But not to the computer, it wasn't.

Consider just some of the things that have happened since they worked on that case:
  • The company has moved their offices.
  • The company has upgraded all their client machines from Windows 98 to Windows XP.
  • The company has upgraded all their server software from Novell to Windows 2000 Server to Windows 2003 Server.
  • The company has switched their out-sourced IT provider.
  • The company has upgraded their server hardware, twice.
  • The company has upgraded the legal case-annotation software that they use, at least twice, and has moved from a per-laptop configuration of that software to a server-based configuration.
  • The company has changed the vendor of their backup software, and has upgraded that backup software several times, and has changed their policies for how they handle backups.
  • Probably lots of other changes that I've forgotten.
The point is that all this change has occurred in just the last 5 years. That's an astonishing rate of change, and they aren't even a very aggressive shop about changing things; they're actually pleasantly conservative and careful in how they approach their internal IT infrastructure.

I'm sure this same sort of thing is happening around the world; every organization must be finding that their data is being blasted by an equivalent tornado of change. How do we, as the people in the world who care about IT, provide any sort of assurance that the important data of the world will survive, for not just 5 years, but for decades or centuries?

The current technique appears to basically involve a continual copying forward of all known information from the previous generation of hardware/software to the current generation of hardware/software, which is a righteous annoyance, as well as appearing to be a strategy which requires ever-increasing time as the volume of data grows. Furthermore, it doesn't deal with the issue of how to safely "archive" information that one no longer wishes to update or modify, but wants to preserve for a long time, as in my wife's example ("delete this from the active cases, but keep a backup that we can reliably recover data from for at least 15 years").



  1. An interesting problem. If industry could agree on a small number of basic data storage protocols, then one could separate the data from the software that manipulates it. So the updates would affect only the software and not the underlying data.

    Or am I totally missing the point?


  2. Isn't that what XML is for? :)

    This is probably the lifeblood of the software industry. If you didn't have to constantly upgrade to keep your old data accessible and to share new data with others, software companies would probably make a fraction of the revenue they do.

    So from a business perspective, this is a feature, not a bug.