Sunday, April 11, 2010

A close reading of "diff"

In literary studies, "close reading" is a technique in which the reader conducts an extremely thorough and detailed study of a relatively short passage. It's not an easy skill to learn, but, once learned, it can be very successful at revealing insights about the work you're studying.

Here are a few nice introductions to the notion of close reading. Lynch describes the purpose of close reading quite nicely:

That means reading every word: it's not enough to have a vague sense of the plot. Maybe that sounds obvious, but few people pay serious attention to the words that make up every work of literature. Remember, English papers aren't about the real world; they're about representations of the world in language. Words are all we have to work with, and you have to pay attention to them.

Although most software has centuries of development to go before it reaches the level of accomplishment that great works of literature have achieved, the serious study of software has many aspects in common with close reading.

Recently, I've been engaged in a close reading of "diff".

You know diff, of course: it's the program which computes the differences between a pair of files, and displays them to you.

Diff has been around a long while, and has been very closely studied. It's also a great example of a program where the proper choice of algorithm makes an enormous difference. So it's important to start, before reading the code, by understanding the algorithm.

In the case of diff, it turns out that there are some great papers that, over the years, have done a wonderful job of explaining how diff works. Here's three papers that tend to be most commonly credited as the most important:

  • The String-to-String Correction problem, Robert Wagner and Michael Fischer. This paper is 35 years old, and is certainly one of the great early efforts in algorithm description and analysis.

  • The String-to-String Correction Problem with Block Moves, Walter Tichy. Tichy is well-known not only for this work, but also for the development of RCS, one of the first widely-successful version control programs.

  • An O(ND) Difference Algorithm and its Variations, Eugene Myers. This is the paper which introduced the famous notion of "snakes". Here's a great pair of articles working through this algorithm in detail.

I'm planning to get to all of these papers, but for the last week, I've been painstakingly and minutely making my way, word by word, through

  • A File Comparison Program, Webb Miller and Eugene Myers

This last paper is in the journal Software, Practice and Experience, and was published in 1985. It's somewhat more recent than the other papers, and is not therefore "beginning at the beginning", but it's where I chose to start, and so here I am.

There are (at least) three aspects to a diff algorithm:

  1. How good is the result?

  2. How fast does the algorithm run?

  3. How hard is the algorithm to understand?

To the first point, there are many different ways to describe the differences between two files; some are short and precise, others are long and verbose. We speak about the differences between two files by talking about an "edit script", which is a sequence of commands (insert this line here, remove that line there) that can be applied to transform the one file into the other. There is such a thing as the shortest possible edit script, and any diff algorithm worth respecting should produce such a script. There are usually more than one shortest possible edit script, however, so diff programs may vary in their output while all still producing the "best" result.

To the second point, it is very important that we be able to diff two files efficiently. Modern systems spend stupendous amounts of time computing differences between pairs of files, so improvements in execution time can have breakthough results on larger systems.

To the last point, which is my current focus, some algorithms are simply harder to understand than others. So, I'm glad I started with the SP&E presentation, as it is very clear and readable. The algorithm described by Miller and Myers is a dynamic programming technique, inductively proved correct, that computes edit (sub-)scripts of length N + 1 by applying 3 basic rules to edit scripts of length N (greatly summarized here):

  1. Move right: Given a script that converts A[1:i] to B[1:j-1], adding the command 'Insert B[j]' produces the next step in the edit script

  2. Move down: Given a script that converts A[1:i-1] to B[1:j], adding the command 'Delete A[i]' produces the next step in the edit script

  3. Slide down the diagonal: Given a script that converts A[1:i-1] to B[1:j-1], and given that A[i] equals B[j], the edit script that we already have already produces the next step in the edit script, converting A[1:i] to B[1:j].

I haven't really done a very good job of describing the algorithm. Hopefully, though, I've motivated you to go track down the references and read the algorithm(s) for yourself.

Enjoy your close reading!

No comments:

Post a Comment