Recently, I've been reading two papers by Walter Tichy:
The first paper is almost 30 years old, and dates from Tichy's work at Purdue during the development of RCS. From the introduction:
The string-to-string correction problem is to find a minimal sequence of edit operations for changing a given string into another given string. The length of the edit sequence is a measure of the differences between the two strings.
At the time, the best-known diff algorithm was Doug McIlroy's Unix diff algorithm (more on that in a future post), which is based on the detection of the Longest Common Subsequence. As Tichy shows, the LCS-based algorithms, while computationally related to the edit sequence programs, are not necessarily the best for use in difference construction.
Tichy's basic algorithm is surprisingly simple to state:
Start at the left end of the target string T, and try to find prefixes of T in S. If no prefix of T occurs in S, remove the first symbol from T and start over. If there are prefixes, choose the longest one and record it as a block move. Then remove the matched prefix from T and try to match a longest prefix of the remaining tail of T, again starting at the beginning of S. This process continues until T is exhausted. The recorded block moves constitute a minimal covering set of block moves.
After working through a proof of the basic algorithm, Tichy briefly touches on two variations:
Program text and prose have the property of few repeated lines. ... To speed up comparisons, the program should use hashcodes for lines of text rather than performing character-by-character comparisons.
An important element in the Knuth-Morris-Pratt algorithm is an auxiliary array N which indicates how far to shift a partially matched pattern or block move after a mismatch. ... Fortunately, N can also be computed incrementally.
The first variation finds an interesting expression 15 years later in the work of Andrew Tridgell on the rsync algorithm, which I'll discuss in a future post.
Delta Algorithms: An Empirical Analysis describes Tichy's work in benchmarking diff algorithms. The paper contains dozens of scatter-plot diagrams of the various benchmark tests, as well as a fine high-level discussion of the complexity of building a suitable benchmark for diff:
The first problem encountered when defining a benchmark is finding an appropriate data set that is both large enough for the results to be statistically significant and representative of real world applications. For delta algorithms, the most important quality of any benchmark is that it contain a wide spectrum of change examples. This means that both the size of the changes represented and the size of the files involved should vary considerably. Large changes on small files and small changes on large files should be included as well as small changes on small files and large changes on large files.
Furthermore, the benchmark should contain a variety of formats, in particular pure text, pure object code, and pseudo text.
The paper also describes a diff algorithm variation which they call vdelta:
Vdelta is a new technique that combines both data compression and data differencing. It is a refinement of W.F. Tichy's block-move algorithm, in that, instead of a suffix tree, vdelta uses a hash table approach inspired by the data parsing scheme in the 1978 Ziv-Lempel compression technique. Like block-move, the Ziv-Lempel technique is also based on a greedy approach in which the input string is parsed by longest matches to previously seen data. ... Vdelta generalizes Ziv-Lempel and block-move by allowing for string matching to be done both within the target data and between a source data and a target data. For efficiency, vdelta relaxes the greedy parsing rule so that matching prefixes are not always maximally long.
Over the years, there have been a number of fundamental attempts to construct differencing algorithms. It would be an ideal world if the "best" algorithm always became the best known and most widely-used. However, the discussion and analysis of algorithms is a complex intellectual activity and many factors other than the qualities of the actual algorithm come in to play. Perhaps most importantly, if an algorithm is not well-presented and well-described, it can be over-looked and under-used, even if it is valuable and effective.
With diff algorithms, it is becoming clear that two things are true:
- There have been a variety of diff algorithms discovered and re-discovered over the years, but many of them are not well-described nor easy to find: the papers are scattered, hard to locate, and behind ACM or IEEE paywalls; and when the papers are tracked down, they are confusing and hard to read.
- The two Myers papers ("A file comparison program" and "An O(ND) difference algorithm and its variations") are so well-written and so well-known that they have pretty much dominated the discussion.
I've still got a number of other papers to study, but for now I think I've learned what I can from Tichy's work.