Saturday, August 9, 2014

git rebase

I've been studying the heck out of git rebase recently.

I mean, I've really been making an effort. It's no exaggeration to say that I've spent the last 3 months of my life having the primary focus of my (professional) life being to really, deeply, truly understand git rebase.

This is no trivial task. To start with, about all you can say about the documentation for git rebase is that it's a disaster. For a piece of software as powerful, flexible, and astonishingly amazing as git rebase, its official documentation is a mirror image of that. It's more than just badly written and misleading; it's almost as though it actively defies your attempt to understand.

Sigh.

With any piece of truly sophisticated software, I've found that there are always three levels of understanding:

  1. What does it do?
  2. How does it do that?
  3. WHY would you do that?

I think that software shares this with many other human creations, like cooking, or home repair, or jet aircraft design. At one level, there are recipes, and you can learn to follow the recipes, and make a nice Creme Brulee: now you know what the recipe does.

And then, you can study some more, and you can learn how the recipe works: that you cook the custard in a water bath to insulate the cream-egg mixture from the oven's heat and prevent the eggs from cooking too fast, and that you carmelize the sugar not just to change its taste, but also to provide a different dimension to the dessert by forming a hard shell over the smooth creamy custard below.

But you still don't know why you should use this recipe, when it is the right thing to do, and when it is not.

Well, anyway, I don't want to talk about cooking, because I'm an engineer at heart, not a chef.

Rather, I want to talk about git rebase.

And how hard it is to achieve that third level of understanding.

Because, even though the documentation is horrific (did I mention that already?), you can fairly quickly pick up the ideas about what you can do with git rebase:

  • Bring a branch up to date with its parent
  • Re-arrange the work in a branch, perhaps splitting it into several branches, or re-parenting it on a different parent branch
  • Revise the work in a branch, perhaps eliminating some of the work, excising some of the clutter left by mis-steps or dead ends or false summits, or re-ordering it, or collapsing and removing some of the intermediate steps

And, although the documentation doesn't help with this, you can, with a bit more study, understand how git rebase accomplishes these tasks.

To get a feel for how git rebase does what it does, let me recommend these resources:

  1. the Git Rebasing chapter in Scott Chacon's book
    In this section you’ll learn what rebasing is, how to do it, why it’s a pretty amazing tool, and in what cases you won’t want to use it.
  2. Git for Computer Scientists
    Quick introduction to git internals for people who are not scared by words like Directed Acyclic Graph.
  3. Git from the bottom up
    This state of affairs most directly represents what we’d like done: for our local, development branch Z to be based on the latest work in the main branch D. That’s why the command is called “rebase”, because it changes the base commit of the branch it’s run from.

Of course, there are many more resources for git, but these are among my favorites.

So after some amount of time reading, and experimenting, and thinking, you will find that you can understand what git rebase does, and how it does it.

But why? Why would you choose to forward-port commits? Why would you choose to re-order, re-word, squash, split, or fixup commits? What is the purpose of this incredibly powerful tool, and how are you ever going to understand how to use it properly, and not instead shy away from it in complete terror?

Unfortunately, I can no longer recall where I stumbled across this resource, but somehow I found Norman Yarvin's web site, where he has collected random collections of stuff.

And, one of those collections is an amazing series of email messages from Linus Torvalds (and a few others): git rebase.

Now, this is not easy stuff to read. Don't just plunge into it, until you've gone through the other resources first.

But when you're ready, and you've done your homework, go and read what Linus has to say about when and why you should use rebase, and think deeply about passages like:

But if you do a true merge, the bug is clearly in the merge (automatedly clean or not), and the blame is there too. IOW, you can blame me for screwing up. Now, I will say "oh, me bad, I didn't realize how subtle the interaction was", so it's not like I'll be all that contrite, but at least it's obvious where the blame lies.

In contrast, when you rebase, the same problem happens, but now a totally innocent commit is blamed just because it happened to no longer work in the location it was not tested in. The person who wrote that commit, the people who tested it and said it works, all that work is now basically worthless: the testing was done with another version, the original patch is bad, and the history and _reason_ for it being bad has been lost.

And there's literally nothing left to indicate the fact that the patch and the testing _used_ to be perfectly valid.

That may not sound like such a big deal, but what does that make of code review and tested-by, and the like? It just makes a mockery of trying to do a good job testing any sub-trees, when you know that eventually it will all quite possibly be pointless, and the fact that maybe the networking tree was tested exhaustively is all totally moot, because in the end the stuff that hit the main tree is something else altogether?

and
Don't get me wrong at all. Rebasing is fine for stuff you have committed yourself (which I assume was the case here).

Rebasing is also a fine conflict resolution strategy when you try to basically turn a "big and complex one-time merge conflict" into "multiple much smaller ones by doing them one commit at a time".

But what rebasing is _not_ is a fine "default strategy", especially if other people are depending on you.

and
What I do try to encourage is for people to think publicising their git trees as "version announcements". They're obviously _development_ versions, but they're still real versions, and before you publicize them you should try to make sure that they make sense and are something you can stand behind.

And once you've publicized them, you don't know who has that tree, so just from a sanity and debugging standpoint, you should try to avoid mucking with already-public versions. If you made a mistake, add a patch on top to fix it (and announce the new state), but generally try to not "hide" the fact that the state has changed.

But it's not a hard rule. Sometimes simple cleanliness means that you can decide to go "oops, that was *really* wrong, let's just throw that away and do a whole new set of patches". But it should be something rare - not normal coding practice.

Because if it becomes normal coding practice, now people cannot work with you sanely any more (ie some random person pulls your tree for testing, and then I pull it at some other time, and the tester reports a problem, but now the commits he is talking about don't actually even exist in my tree any more, and it's all really messy!).

and
Rebasing branches is absolutely not a bad thing for individual developers.

But it *is* a bad thing for a subsystem maintainer.

So I would heartily recommend that if you're a "random developer" and you're never going to have anybody really pull from you and you *definitely* don't want to pull from other peoples (except the ones that you consider to be "strictly upstream" from you!), then you should often plan on keeping your own set of patches as a nice linear regression.

And the best way to do that is very much by rebasing them.

That is, for example, what I do myself with all my git patches, since in git I'm not the maintainer, but instead send out my changes as emails to the git mailing list and to Junio.

So for that end-point-developer situation "git rebase" is absolutely the right thing to do. You can keep your patches nicely up-to-date and always at the top of your history, and basically use git as an efficient patch-queue manager that remembers *your* patches, while at the same time making it possible to efficiently synchronize with a distributed up-stream maintainer.

So doing "git fetch + git rebase" is *wonderful* if all you keep track of is your own patches, and nobody else ever cares until they get merged into somebody elses tree (and quite often, sending the patches by email is a common situation for this kind of workflow, rather than actually doing git merges at all!)

So I think 'git rebase' has been a great tool, and is absolutely worth knowing and using.

*BUT*. And this is a pretty big 'but'.

BUT if you're a subsystem maintainer, and other people are supposed to be able to pull from you, and you're supposed to merge other peoples work, then rebasing is a *horrible* workflow.

Why?

It's horrible for multiple reasons. The primary one being because nobody else can depend on your work any more. It can change at any point in time, so nobody but a temporary tree (like your "linux-next release of the day" or "-mm of the week" thing) can really pull from you sanely. Because each time you do a rebase, you'll pull the rug from under them, and they have to re-do everything they did last time they tried to track your work.

But there's a secondary reason, which is more indirect, but despite that perhaps even more important, at least in the long run.

If you are a top-level maintainer or an active subsystem, like Ingo or Thomas are, you are a pretty central person. That means that you'd better be working on the *assumption* that you personally aren't actually going to do most of the actual coding (at least not in the long run), but that your work is to try to vet and merge other peoples patches rather than primarily to write them yourself.

And that in turn means that you're basically where I am, and where I was before BK, and that should tell you something. I think a lot of people are a lot happier with how I can take their work these days than they were six+ years ago.

So you can either try to drink from the firehose and inevitably be bitched about because you're holding something up or not giving something the attention it deserves, or you can try to make sure that you can let others help you. And you'd better select the "let other people help you", because otherwise you _will_ burn out. It's not a matter of "if", but of "when".

Now, this isn't a big issue for some subsystems. If you're working in a pretty isolated area, and you get perhaps one or two patches on average per day, you can happily basically work like a patch-queue, and then other peoples patches aren't actually all that different from your own patches, and you can basically just rebase and work everything by emailing patches around. Big deal.

But for something like the whole x86 architecture, that's not what te situation is. The x86 merge isn't "one or two patches per day". It easily gets a thousand commits or more per release. That's a LOT. It's not quite as much as the networking layer (counting drivers and general networking combined), but it's in that kind of ballpark.

And when you're in that kind of ballpark, you should at least think of yourself as being where I was six+ years ago before BK. You should really seriously try to make sure that you are *not* the single point of failure, and you should plan on doing git merges.

And that absolutely *requires* that you not rebase. If you rebase, the people down-stream from you cannot effectively work with your git tree directly, and you cannot merge their work and then rebase without SCREWING UP their work.

and
The PCI tree merged the suspend branch from the ACPI tree. You can see it by looking at the PCI merge in gitk:

 gitk dc7c65db^..dc7c65db
and roughly in the middle there you'll find Jesse's commit 53eb2fbe, in which he merges branch 'suspend' from Len's ACPI tree.

So Jesse got these three commits:


 0e6859d... ACPI PM: Remove obsolete Toshiba workaround
 8d2bdf4... PCI ACPI: Drop the second argument of platform_pci_choose_state
 0616678... ACPI PM: acpi_pm_device_sleep_state() cleanup

from Len's tree. Then look at these three commits that I got when I actually merged from you:


 741438b... ACPI PM: Remove obsolete Toshiba workaround
 a80a6da... PCI ACPI: Drop the second argument of platform_pci_choose_state
 2fe2de5... ACPI PM: acpi_pm_device_sleep_state() cleanup

Look familiar? It's the same patches - just different commit ID's. You rebased and moved them around, so they're not really the "same" at all, and they don't show the shared history any more, and the fact that they were pulled earlier into the PCI tree (and then into mine).

This is what rebasing causes.

and
So rebasing and cleanups may indeed result in a "simpler" history, but it only look that way if you then ignore all the _other_ "simpler" histories. So anybody who rebases basically creates not just one simple history, but a _many_ "simple" histories, and in doing so actually creates a potentially much bigger mess than he started out with!

As long as you never _ever_ expose your rewriting of history to anybody else, people won't notice or care, because you basically guarantee that nobody can ever see all those _other_ "simpler" histories, and they only see the one final result. That's why 'rebase' is useful for private histories.

But even then, any testing you did in your private tree is now suspect, because that testing was done with the old history that you threw away. So even if you delete all the old histories and never show them, they kind of do exist conceptually - they existed in the sense that you tested them, and you've just hidden the fact that what you release is different from what you tested.

Well.

That was a lot of quoting, and I'm sorry to do that.

But so many of the web pages out there only point to Linus's Final Word On The Subject.

You know, the one which reads:

I want clean history, but that really means (a) clean and (b) history.
Now, that last essay is indeed brilliant, and you should print it out, and post it on your wall, and read it every morning, and think about what it is he's trying to say.

But if you just can't figure it out, well, go digging in the source material.

And then, I believe, it will finally all make sense.

At least, it finally did, to me.

No comments:

Post a Comment