Tuesday, May 11, 2010

Patience Diff

Bram Cohen's Patience Diff algorithm is really quite clever. The core insight of Patience Diff is that most LCS-based diff algorithms, such as the Hunt-McIlroy and Miller-Myers diff algorithms that I've been studying, are easily misled by certain high-frequency, low-content lines of text which occur in programming language texts.

For example, many program source files are filled with lines like:

  • blank line

  • {

  • }

  • return;

  • /*

  • */


and so forth.

LCS-based diff programs concentrate a lot of energy in finding long sequences of common lines like these, and then trying to match up those common sequences of uninteresting lines with a few content-heavy diff blocks.

Patience Diff, instead, focuses its energy on the low-frequency high-content lines which serve as markers or signatures of important content in the text. It is still an LCS-based diff at its core, but with an important difference, as it only considers the longest common subsequence of the signature lines:

Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.


Here's a superb writeup, with side-by-side examples, that does a great job of illustrating why these signatures lines produce a much clearer diff.

It's not obviously clear whether Patience Diff is necessarily faster than the classic diff algorithms, or whether it has its own set of pathological cases; I suspect all diff algorithms are somewhat susceptible to such weaknesses. But it's a very clearly-presented idea, with a powerful intuitive basis, and it will be very interesting to see how it spreads as more people become familiar with it.

No comments:

Post a Comment