Friday, May 14, 2010

Studying failure

In my day job, we care a lot about server uptime, reliability, and availability. Any little disruption in the server is pored over and studied in great detail. You have to sweat the small stuff.

So I loved this essay in the Usenix blog comparing the IT attitude toward server availability to the efforts of the aviation profession. As the author notes, the aviation industry has a century's more experience with these topics, and so has much to teach us about what is needed in order to achieve the reliability levels that we desire:

I’m talking about the methods, the drive, and the sheer determination to discover, at all costs, the root cause of the issues that occur in the aviation profession.

Don't miss this link to the author's more detailed essay about redundancy and disaster planning. Beware the fiber backhoe!

