Tuesday, November 29, 2011

Apache, Subversion, and Git

Over the long weekend, a number of people seem to have picked up and commented on Mikeal Rogers's essay about Apache and its adoption of the source code control tool, Git. For example, Chris Aniszczyk pointed to the essay, and followed it up with some statistics and elaboration. Aniszczyk, in turn, points to a third essay (a year old), by Josh Berkus, describing the PostgresQL community's migration to git, and a fourth web page describing the Eclipse community's migration to git. (Note: Both Eclipse and PostgresQL migrated from CVS to git).

I find the essays by Rogers and Aniszczyk quite puzzling, full of much heat and emotion, and I'm not sure what to take from them.

Rogers seems to start out on a solid footing:

For a moment, let's put the git part of GitHub on the back burner and talk about the hub.

On GitHub the language is not code, as it is often characterized, it is contribution. GitHub presents a person to person communication system for contributions. Documentation, issues, and of course code, travel between personal repositories.

The communication medium is the contribution itself. Its value, its merit, its intention, all laid naked for the world to see. There is no hierarchy or politic embedded in the system. The creator of a project has a clear first mover advantage but the possibility is always there for its position to be supplanted by a fork, creating a social imperative to manage contributions in a satisfactory manor [sic] to her community.

This is all well-written and clear, I think. But I don't understand how this is a critique of Apache. In my seven years of experience with the Derby project at Apache, this is exactly how an Apache software project works:

  • Issues are raised in the Apache issue-tracking system;
  • discussion is held in the issue comments and on mailing lists;
  • various contributors suggest ideas;
  • someone "with an itch to scratch" dives into the problem and constructs a patch;
  • the patch is proposed by attaching it to the issue-tracking system;
  • further discussion and testing occurs, now shaped by the concrete nature of the proposed patch;
  • a committer who becomes persuaded of the desirability of the patch commits it to the repository;
  • eventually a release occurs and the change becomes widely distributed.

This is the process as I have seen it and participated in it, since back in 2004, and, I believe, was how it was done for years before that.

So what, precisely, is it that Apache is failing at?

Here is where Rogers's essay seems to head into the wilderness, starting with this pronouncement:

Many of the social principles I described above are higher order manifestations of the design principles of git itself.

[ ... ]

The problem here is less about git and more about the chasm between Apache and the new culture of open source. There is a growing community of young new open source developers that Apache continues to distance itself from and as the ASF plants itself firmly in this position the growing community drifts farther away.

I don't understand this at all. What, precisely, is it that Apache is doing to distance itself from these developers, and what does this have to do with git?

Rogers offers as evidence this email thread (use the "next message by thread" links to read the thread), but from what I can tell, it seems like a very friendly, open, and productive discussion about the mechanics of using git to manage projects at Apache, with several commenters welcoming newcomers into the community and encouraging them to get involved.

This seems like the Apache way working successfully, from what I can tell.

Aniszczyk's follow-on essay, unfortunately, doesn't shed much additional light. He states that "what has been happening recently regarding the move to a distributed version control system is either pure politicking [sic] or negligence in my opinion."

So, again, what is it that he is specifically concerned about? Here, again, the essay appears to head into the wilderness. "Let's try to have some fun with statistics," says Aniszczyk, and he presents a series of charts and graphs showing that:

  1. git is very popular
  2. lots of job sites, such as LinkedIn, are advertising for developers who know git
  3. There is no 3.
At this point, Aniszczyk says "I knew it was time to stop digging for statistics."

But again, I am confused about what he finds upsetting. The core message of his essay appears to be:

The first is simple and deals with my day job of facilitating open source efforts at Twitter. If you’re going to open source a new project, the fact that you simply have to use SVN at Apache is a huge detterent [sic] from even going that route.

[ ... ]

All I’m saying is that it took a lot of work to start the transition and the eclipse community hasn’t even fully completed it yet. Just ask the PostgreSQL community how quick it was moving to Git. The key point here is that you have to start the transition soon as it’s going to take awhile for you to implement the move (especially since Apache hosts a lot of projects).

Once again, I'm lost. Why, exactly, is it a huge deterrent to use svn? And why, exactly, does Apache need to convert its existing projects from svn to git? Just because LinkedIn is advertising more jobs that use git as a keyword? That doesn't seem like a valid reason, to me.

Note that, as I mentioned at the start of this article, the PostgresQL team migrated from CVS to git, not from Subversion to git. I can completely understand this. The last time I used CVS was in 2001, 10 full years ago; even at that time, CVS had some severe technical shortcomings and there was sufficient benefit to switching that it was worth the effort. So I'm not at all surprised by the PostgresQL community's decision. The article by Berkus, by the way, is definitely worth reading, full of wisdom about platform coverage, tool and infrastructure support, workflow design, etc.

So, to summarize (as I understand it):

  • PostgresQL and Eclipse are migrating from CVS to git, successfully (although it is taking a significant amount of time and resources)
  • Apache is working to integrate git into its policies and infrastructure, but still uses Subversion as its primary scm system
  • Some people seem to feel like Apache is making the wrong decision about this
But what I don't understand, at the end of it all, is in what way this is opposed to "the Apache way?" From everything I can see, the Apache way is alive and well in these discussions.

UPDATE:Thomas Koch, in the comments, provides a number of substantial, concrete examples in which git's powerful functionality can be very helpful. The most important one that Thomas provides, I think, is this:

It is much easier to make a proper integration between review systems, Jenkins and Jira, if the patch remains in the VCS as a branch instead of leaving it.
I completely agree. Working with patch files in isolation is substantially worse than making reference to a branched change that is under SCM control. Certainly in my work with Derby I have seen many a contributor make minor technical errors while manipulating a patch file, that on the whole just adds friction to the overall process. Good point, Thomas!


  1. Some problems with SVN one sees after having worked a while with Git:

    - The whole process of getting any change of any size in an Apache process is very time consuming and complicated:
    - Open an issue (have a jira account)
    - export a patch file from your VCS
    - upload patch file (make sure you name it right so that Jenkins picks it up)
    - open a review request on review board (have an account there)
    - make sure you fill in all details correctly at reviewboard so that linking to the Jira and mailinglist work
    - wait for Jenkins to approve your patch, repeat the process if anything goes wrong
    - If your patch contain binary file changes, you're lost, since Jenkins does not apply them correctly
    - highlight/announce your patch on the mailing list, because nobody can follow all the jira/reviewboard/jenkins spam on the list
    - If some other patches get committed to trunk, redo your patch to make it apply against current trunk

    There are a couple of things that become much, much easier with Git and a proper infrastructure around Git:

    - Git is much smarter then SVN to merge patches when trunk changed
    - Managing dozens of patches as separate branches in Git is effortless
    - It is much easier to make a proper integration between review systems, Jenkins and Jira, if the patch remains in the VCS as a branch instead of leaving it. Gerrit is one example of a working, proven solution to reduce the above workflow to a one-liner: git push.
    - The fact that working with the current infrastructure is so cumbersome means that people don't like to do small clean up changes or even accept such contributions
    - People don't like to rework their patches for small clean ups, because it's that much work. In Git it doesn't even feel like work.
    - Somebody got angry at me, because a low priority change of mine got committed to trunk and forced him to redo his high priority change. Don't blame me! Blame your VCS!
    - SVN user don't use the log of their VCS, because it's so slow. With Git you really make use of the log to bisect, to understand the genesis of code or just to find out whom the blame. :-)
    - SVN users are afraid of trying something out because they don't use branches that they can throw away.
    - SVN users are so occupied dealing with the shortcomings of their VCS that they don't have time to learn Git...

  2. Thanks Thomas.

    Having concrete examples like that definitely helps me understand the discussion better.