At my day job, we've been engaged in a fascinating discussion recently regarding the limits of software trust. Our product is used to store objects of incredible value, and our users depend completely on us to preserve and retain that value. And when you try to build software that approaches the level of reliability required to preserve tens of millions of files holding dozens of petabytes of content, you start encountering situations where simple measures of trust break down.
If your program writes a block of data to the disk, how confident are you that the block was actually written, with exactly the contents you provided, to exactly the location you specified?
More intriguingly, if your program reads a block of data from the disk, how confident are you that it is the same block of data that your program wrote there, some time ago?
Unfortunately, hardware fails, software fails, and users make mistakes. Bits rot, gamma rays and voltage spikes occur, cables fray and wiggle, subtle bugs afflict drivers, and well-meaning administrators mis-type commands while signed on with high privileges. So once you have fixed all the easy bugs, and closed all the simple design holes, and built all the obvious tests, and created all the routine infrastructure, a program which really, really cares about these issues finds itself facing these core issues of trust.
There's only one technique for addressing these problems, and it's two-part:
- Provide recoverability of your data via backups and/or replication,
- and have a way to detect damaged data.
The first part ensures that you are able to repair damaged data; the second part ensures that you know which data need to be repaired.
Most people know a lot about backups and replication, but by themselves that's not enough. If you can't detect damaged data, then your backups have no value.
Most organizations that I've dealt with depend, without really realizing it, on detecting damaged data at the time that it's next used. Unfortunately, that is far, far too late:
- By then, the damage is impacting your business; you've got downtime due to the damage.
- Worse, by then, the damaged data may have already made it into your backups; your backups themselves may be damaged. And then you're done for.
The sooner you detect damaged data, and the closer to the source of the damage that you detect it, the faster you can repair it, and the more likely you are to be able to track down the source of the damage (buggy software, user error, intermittent power fluctuation, overheating DIMM board, etc.).
Computer science has one basic technique for detecting damaged data, though it goes by several names: checksums, CRCs, hashes, digests, fingerprints, and signatures among others. Here's a nice summary from Wikipedia:
A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value. The data to be encoded is often called the "message", and the hash value is sometimes called the message digest or simply digest.
The basic idea is that:
- At some point, when you are confident that your data is valid, you compute the checksum and save it away.
- Later, when you want to know if your data has been damaged, you re-compute the checksum and compare it against the checksum that you saved before. If they don't match, your data has been damaged (or, perhaps, your checksum has been damaged)
Note that there is a natural symmetry to the use of checksums:
- Checksum the data when you write it out, then verify the checksum when you read it back in.
- Checksum the data when you send it over the net, then verify the checksum when you receive the data over the net.
Given that there are reliable, free, widely-available libraries of checksum code available out there, this should be the end of the discussion. But of course life is never so simple. Programs fail to checksum data at all, or they checksum it but don't later verify the checksum, or they checksum the data after the damage has already happened. Or they use a weak checksum, which doesn't offer very extensive protection. Or they do all the checksumming properly, but don't tie the error checking into the backup-recovery system, so they don't have a way to prevent damaged data from "leaking" into the backups, leading to damaged and useless backups and permanent data loss. There are a world of details, and it takes years to work through them and get them all right.
If you've got this far, and you're still interested in learning more about this subject, here's two next resources to explore:
- Stone and Partridge's detailed analysis of error patterns in TCP traffic on the Internet: When The CRC and TCP Checksum Disagree. Here's an excerpt from the conclusion:
In the Internet, that means that we are sending large volumes of incorrect data without anyone noticing. ... The data suggests that, on average, between one packet in 10 billion and one packet in a few millions will have an error that goes undetected. ... Our conclusion is that vital applications should strongly consider augmenting the TCP checksum with an application sum.
- Valerie Aurora's position paper, An Analysis of Compare-by-Hash, on the use of cryptographic hashes as surrogates for data, particularly in places such as disk de-dup, file replication, and backup. Here's an excerpt from her paper:
On a more philosophical note, should software improve on hardware reliability or should programmers accept hardware reliability as an upper bound on total system reliability? What would we think of a file system that had a race condition that was triggered less often than disk I/O errors? What if it lost files only slightly less often than users accidentally deleted them? Once we start playing the game of error relativity, where does we stop? Current software practices suggest that most programmers believe software should improve reliability -- hence we have TCP checksums, asserts for impossible hardware conditions, and handling of I/O errors. For example, the empirically observed rate of undetected errors in TCP packets is about 0.0000005%. We could dramatically improve that rate by sending both the block and its SHA-1 hash, or we could slightly worsen that rate by sending only the hash.
Have a good time out there, and keep that precious data safe!