Journal of a Programmer: Patience Diff

Tuesday, May 11, 2010

Patience Diff

Bram Cohen's Patience Diff algorithm is really quite clever. The core insight of Patience Diff is that most LCS-based diff algorithms, such as the Hunt-McIlroy and Miller-Myers diff algorithms that I've been studying, are easily misled by certain high-frequency, low-content lines of text which occur in programming language texts.

For example, many program source files are filled with lines like:

blank line

{

}

return;

/* */

and so forth. LCS-based diff programs concentrate a lot of energy in finding long sequences of common lines like these, and then trying to match up those common sequences of uninteresting lines with a few content-heavy diff blocks. Patience Diff, instead, focuses its energy on the low-frequency high-content lines which serve as markers or signatures of important content in the text. It is still an LCS-based diff at its core, but with an important difference, as it only considers the longest common subsequence of the signature lines: Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up. Here's a superb writeup, with side-by-side examples, that does a great job of illustrating why these signatures lines produce a much clearer diff. It's not obviously clear whether Patience Diff is necessarily faster than the classic diff algorithms, or whether it has its own set of pathological cases; I suspect all diff algorithms are somewhat susceptible to such weaknesses. But it's a very clearly-presented idea, with a powerful intuitive basis, and it will be very interesting to see how it spreads as more people become familiar with it.

Posted by Bryan Pendleton at 7:14 AM

No comments: Post a Comment

Newer Post Older Post Home Subscribe to: Post Comments (Atom)

About Me Bryan Pendleton View my complete profile Blog Archive ► 2025 (5) ► February (4) ► January (1) ► 2024 (32) ► October (4) ► September (1) ► August (4) ► July (4) ► June (1) ► May (4) ► April (5) ► March (3) ► February (3) ► January (3) ► 2023 (44) ► December (2) ► November (1) ► October (1) ► September (7) ► August (7) ► July (4) ► June (1) ► May (6) ► April (4) ► March (2) ► February (3) ► January (6) ► 2022 (61) ► December (10) ► November (5) ► October (6) ► September (4) ► August (3) ► July (6) ► June (4) ► May (2) ► April (4) ► March (8) ► February (4) ► January (5) ► 2021 (93) ► December (6) ► November (8) ► October (6) ► September (7) ► August (5) ► July (7) ► June (5) ► May (8) ► April (7) ► March (9) ► February (9) ► January (16) ► 2020 (81) ► December (9) ► November (8) ► October (6) ► September (5) ► August (8) ► July (7) ► June (10) ► May (10) ► April (6) ► March (1) ► February (8) ► January (3) ► 2019 (85) ► December (9) ► November (3) ► October (8) ► September (8) ► August (6) ► July (9) ► June (8) ► May (6) ► April (9) ► March (4) ► February (7) ► January (8) ► 2018 (109) ► December (12) ► November (21) ► October (9) ► September (7) ► August (4) ► July (13) ► June (10) ► May (8) ► April (7) ► March (7) ► February (3) ► January (8) ► 2017 (152) ► December (10) ► November (9) ► October (7) ► September (22) ► August (15) ► July (8) ► June (11) ► May (7) ► April (17) ► March (11) ► February (18) ► January (17) ► 2016 (199) ► December (20) ► November (28) ► October (15) ► September (15) ► August (18) ► July (16) ► June (17) ► May (17) ► April (13) ► March (12) ► February (12) ► January (16) ► 2015 (187) ► December (22) ► November (7) ► October (11) ► September (11) ► August (22) ► July (17) ► June (18) ► May (21) ► April (21) ► March (8) ► February (15) ► January (14) ► 2014 (198) ► December (17) ► November (18) ► October (15) ► September (21) ► August (21) ► July (14) ► June (16) ► May (12) ► April (17) ► March (16) ► February (14) ► January (17) ► 2013 (239) ► December (22) ► November (28) ► October (21) ► September (17) ► August (19) ► July (16) ► June (17) ► May (17) ► April (18) ► March (24) ► February (19) ► January (21) ► 2012 (395) ► December (23) ► November (30) ► October (33) ► September (34) ► August (29) ► July (39) ► June (27) ► May (48) ► April (32) ► March (30) ► February (33) ► January (37) ► 2011 (298) ► December (25) ► November (24) ► October (23) ► September (22) ► August (21) ► July (23) ► June (32) ► May (34) ► April (25) ► March (31) ► February (18) ► January (20) ▼ 2010 (282) ► December (22) ► November (27) ► October (34) ► September (30) ► August (18) ► July (15) ► June (29) ▼ May (19) Search Engine Optimization is a deep subject Professional awards VoltDB general availability Draw odds armageddon Google I/O seemed kind of underwhelming this year Country gone lurid 13 days to go! Hard Core Derby 10.6 is released! A compendium of all knowledge Studying failure Stock market mysteries Patience Diff Tie-breakers in Chess? Not necessary this time... The rsync algorithm Modern mysteries confound The other Myers diff paper The future of web browsers. Deep deep dive into multi-threading and concurrency ► April (24) ► March (27) ► February (14) ► January (23) ► 2009 (142) ► December (16) ► November (20) ► October (15) ► September (13) ► August (12) ► July (14) ► June (13) ► May (25) ► April (14) Pages My Backpacking Trips with Mike