*An O(ND) Difference Algorithm and its Variations*. It was published in the journal

*Algorithmica*in 1986. These two papers, together, are really the core of understanding how all modern diff algorithms work. I happened to read

*A File Comparison Program*first, though it probably would be better to read them in the opposite order. Well, who knows; read them both!

*An O(ND) Difference Algorithm and its Variations*appears to be the paper where the famous "snakes" are first discussed. The image that I get in my mind while reading the paper and looking at the edit graph, is of the diff algorithm examining the two files in parallel, searching for the common bits, and stitching together the different bits, "snaking" its way back and forth between the two files as it goes. I'm not sure if this is the image that Myers had in mind, but it works for me. A

*snake*is a section of common content between the two files, represented diagrammatically as a diagonal portion of the edit graph, and snakes are important because, the longer the snakes are in your path, the more compact your diff is; the total length of all the snakes is the length of the common subsequence that the dynamic algorithm has located:

The problem of finding a shortest edit script reduces to finding a path from(0,0)to(N,M)with the fewest number of horizonal edges. Let aD-pathbe a path starting at(0,0)that has exactlyDnon-diagonal edges. A0-pathmust consist solely of diagonal edges. By a simple induction, it follows that aD-pathmust consist of a(D-1)-pathfollowed by a non-diagonal edge and then a possibly empty sequence of diagonal edges called asnake.

Snakes are all over the rest of the paper, as you end up reading sentences like:

The final snake must be maximal, as theD-pathwould not be furthest reaching if the snake could be extended.

After a little while, you start to get pretty comfortable with the snake notion, though my wife still doesn't want me extending any maximal snakes around the house!

Later in the paper, there is also a fantastic diagram explaining the "contours" notion that you find in other papers, in which the dynamic programming algorithm effectively explores the space of all possible edit scripts which have the same edit distance. The Myers paper calls these "edit graph boundaries".

Both the Myers papers have stood up very well over time. I've seen a number of more recent papers which attempt to re-present and re-analyze the problem, but I haven't seen anything which is superior to the two Myers papers, and most other presentations are in fact far weaker, so reading these two papers is clearly the place to start.

Now I'm off to go track down a reference to Bram Cohen's

*Patience Diff*algorithm, which a poster on Stack Overflow claimed was the new diff algorithm in the

*bazaar*version control system. (Bram Cohen is a famously confident programmer, but it still seems rather bold to title your essay "The diff problem has been solved.")

For the latest thoughts on patience diff, see here:

ReplyDeletehttp://bramcohen.livejournal.com/73318.html

I suspect it's possible to implement patience diff more efficiently than LCS-based diff, but the current implementations are optimized for understandability and correctness, so one optimized for performance would be much appreciated.

And I'm not quite that cocky, I just like writing attention-grabbing headlines :-)