For example, many program source files are filled with lines like:
- blank line
- {
- }
- return;
- /*
- */
and so forth.
LCS-based diff programs concentrate a lot of energy in finding long sequences of common lines like these, and then trying to match up those common sequences of uninteresting lines with a few content-heavy diff blocks.
Patience Diff, instead, focuses its energy on the low-frequency high-content lines which serve as markers or signatures of important content in the text. It is still an LCS-based diff at its core, but with an important difference, as it only considers the longest common subsequence of the signature lines:
Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
Here's a superb writeup, with side-by-side examples, that does a great job of illustrating why these signatures lines produce a much clearer diff.
It's not obviously clear whether Patience Diff is necessarily faster than the classic diff algorithms, or whether it has its own set of pathological cases; I suspect all diff algorithms are somewhat susceptible to such weaknesses. But it's a very clearly-presented idea, with a powerful intuitive basis, and it will be very interesting to see how it spreads as more people become familiar with it.
No comments:
Post a Comment