Friday, May 7, 2010

The other Myers diff paper

The other Myers diff paper is An O(ND) Difference Algorithm and its Variations. It was published in the journal Algorithmica in 1986. These two papers, together, are really the core of understanding how all modern diff algorithms work. I happened to read A File Comparison Program first, though it probably would be better to read them in the opposite order. Well, who knows; read them both!

An O(ND) Difference Algorithm and its Variations appears to be the paper where the famous "snakes" are first discussed. The image that I get in my mind while reading the paper and looking at the edit graph, is of the diff algorithm examining the two files in parallel, searching for the common bits, and stitching together the different bits, "snaking" its way back and forth between the two files as it goes. I'm not sure if this is the image that Myers had in mind, but it works for me. A snake is a section of common content between the two files, represented diagrammatically as a diagonal portion of the edit graph, and snakes are important because, the longer the snakes are in your path, the more compact your diff is; the total length of all the snakes is the length of the common subsequence that the dynamic algorithm has located:

The problem of finding a shortest edit script reduces to finding a path from (0,0) to (N,M) with the fewest number of horizonal edges. Let a D-path be a path starting at (0,0) that has exactly D non-diagonal edges. A 0-path must consist solely of diagonal edges. By a simple induction, it follows that a D-path must consist of a (D-1)-path followed by a non-diagonal edge and then a possibly empty sequence of diagonal edges called a snake.

Snakes are all over the rest of the paper, as you end up reading sentences like:

The final snake must be maximal, as the D-path would not be furthest reaching if the snake could be extended.

After a little while, you start to get pretty comfortable with the snake notion, though my wife still doesn't want me extending any maximal snakes around the house!

Later in the paper, there is also a fantastic diagram explaining the "contours" notion that you find in other papers, in which the dynamic programming algorithm effectively explores the space of all possible edit scripts which have the same edit distance. The Myers paper calls these "edit graph boundaries".

Both the Myers papers have stood up very well over time. I've seen a number of more recent papers which attempt to re-present and re-analyze the problem, but I haven't seen anything which is superior to the two Myers papers, and most other presentations are in fact far weaker, so reading these two papers is clearly the place to start.

Now I'm off to go track down a reference to Bram Cohen's Patience Diff algorithm, which a poster on Stack Overflow claimed was the new diff algorithm in the bazaar version control system. (Bram Cohen is a famously confident programmer, but it still seems rather bold to title your essay "The diff problem has been solved.")

1 comment:

1. For the latest thoughts on patience diff, see here:

http://bramcohen.livejournal.com/73318.html

I suspect it's possible to implement patience diff more efficiently than LCS-based diff, but the current implementations are optimized for understandability and correctness, so one optimized for performance would be much appreciated.

And I'm not quite that cocky, I just like writing attention-grabbing headlines :-)