Friday, September 9, 2022

Progress in diff-land

Years ago, in an earlier stage of my career, I was totally and obsessively fascinated with the internals of 'diff'.

Then I wandered away from that extremely deep, but extremely narrow, field of study, and lost track of what was going on with 'diff'. The last thing I had really paid any attention to was Patience Diff, which is decades old at this point.

Browsing the net this week, I find out that, of course, progress never stops, and big steps continue to be taken:

  • About 5 years ago, Tristan Hume published an article on an interesting new approach to tree diffing: Designing a Tree Diff Algorithm Using Dynamic Programming and A*.
    After 2+ days of research, discussing ideas with my mentor, and sitting down in a chair staring out at the nice view of London while thinking and sketching out cases of the algorithm in a notebook, I had something. It was a recursive dynamic programming algorithm that checked every possible way of placing the :by-date blocks and chose the best, but used memoization (that’s the dynamic programming part) so that it re-used sub-problems and had polynomial instead of exponential complexity.
  • A year or so later, Russell McQueeney built on this tree-diffing algorithm to build a new structural diff algorithm for Clojure: Autochrome - Structural diffs for Clojure source code
    in order to frame tree diffing as a pathfinding problem, you need to extend the concepts of location, cost, and adjacency to tree diffs. Location is clearly needed to know where you are, but in addition locations need to be comparable, so you know not to bother when you already have a better path to the same place. Cost is what makes some paths preferred over others. For pathfinding on a road network, this would be the total distance traveled along the roads used. Adjacency is what states are reachable from a particular state. For roads you might say that intersections are the nodes and adjacency means there is a road connecting them.
  • And now, Wilfred Hughes takes a further step, abstracting out the language syntax to allow the algorithm to support a variety of programming languages: Difftastic, the Fantastic Diff
    Autochrome and difftastic represent diffing as a shortest path problem on a directed acyclic graph. A vertex represents a pair of positions: the position in the left-hand side s-expression (before), and the position in the right-hand side s-expression (after).

    The goal is to find the shortest route from the start vertex (where both positions are before the first item in the programs) to the end vertex (where both positions are after the last item in the program).

If I could just live another 50 years, what a marvelous 'diff' tool I could look forward to!

No comments:

Post a Comment