Journal of a Programmer: The rsync algorithm

Saturday, May 8, 2010

The rsync algorithm

Andrew Tridgell and Paul Mackerras's 1996 paper on the rsync algorithm is a masterpiece. The algorithm is a masterpiece, and the paper is a masterpiece. In 5 clear and readable pages, the authors describe a brilliant and elegant algorithm which combines difference computation, cryptography, and network communications to provide a remote file update service.

Here's how the paper starts:

Imagine you have two files, A and B, and you wish to update B to be the same as A. The obvious method is to copy A onto B.

Now imagine that the two files are on machines connected by a slow communications link, for example a dial up IP link. If A is large, copying A onto B will be slow. To make it faster you could compress A before sending it, but that will usually only gain a factor of 2 to 4.

Now assume that A and B are quite similar, perhaps both derived from the same original file. To really speed things up you would need to take advantage of this similarity. A common method is to send just the differences between A and B down the link and then use this list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn't need the differences). This is the problem that rsync addresses.

The essence of the rsync algorithm is its insight into the idea that cryptographic signatures of sections of the text are much smaller than the text itself, but contain enough information to locate sections of common text:

The basic strategy is to compute the 32-bit rolling checksum for a block of length S starting at each byte of A in turn, and for each checksum, search the list for a match.
...
If no match is found at a given offset in the file, the rolling checksum is updated to the next offset and the search proceeds. If a match is found, the search is restarted at the end of the matched block. This strategy saves a considerable amount of computation for the common case where the two files are nearly identical.

Andrew Tridgell's work on rsync eventually become the basis for his PhD thesis, Efficient Algorithms for Sorting and Searching. It's one of those rare PhD theses that is both readable and worth reading (though I haven't made it through all of it). Tridgell is most famous for his work on Samba, the open source implementation of the Microsoft remote-file-and-print protocols that is the basis for most data center inter-operability in almost every machine room I've ever been around. He's also somewhat famous for being the inspiration of one of the more famous breaking-up letters in computer software history, an event which resulted in the creation of Git.