I've been studying the heck out of git rebase recently.
I mean, I've really been making an effort. It's no exaggeration to say that I've spent the last 3 months of my life having the primary focus of my (professional) life being to really, deeply, truly understand git rebase.
This is no trivial task. To start with, about all you can say about the documentation for git rebase is that it's a disaster. For a piece of software as powerful, flexible, and astonishingly amazing as git rebase, its official documentation is a mirror image of that. It's more than just badly written and misleading; it's almost as though it actively defies your attempt to understand.
Sigh.
With any piece of truly sophisticated software, I've found that there are always three levels of understanding:
- What does it do?
- How does it do that?
- WHY would you do that?
I think that software shares this with many other human creations, like cooking, or home repair, or jet aircraft design. At one level, there are recipes, and you can learn to follow the recipes, and make a nice Creme Brulee: now you know what the recipe does.
And then, you can study some more, and you can learn how the recipe works: that you cook the custard in a water bath to insulate the cream-egg mixture from the oven's heat and prevent the eggs from cooking too fast, and that you carmelize the sugar not just to change its taste, but also to provide a different dimension to the dessert by forming a hard shell over the smooth creamy custard below.
But you still don't know why you should use this recipe, when it is the right thing to do, and when it is not.
Well, anyway, I don't want to talk about cooking, because I'm an engineer at heart, not a chef.
Rather, I want to talk about git rebase.
And how hard it is to achieve that third level of understanding.
Because, even though the documentation is horrific (did I mention that already?), you can fairly quickly pick up the ideas about what you can do with git rebase:
- Bring a branch up to date with its parent
- Re-arrange the work in a branch, perhaps splitting it into several branches, or re-parenting it on a different parent branch
- Revise the work in a branch, perhaps eliminating some of the work, excising some of the clutter left by mis-steps or dead ends or false summits, or re-ordering it, or collapsing and removing some of the intermediate steps
And, although the documentation doesn't help with this, you can, with a bit more study, understand how git rebase accomplishes these tasks.
To get a feel for how git rebase does what it does, let me recommend these resources:
- the Git Rebasing chapter in Scott Chacon's book
In this section you’ll learn what rebasing is, how to do it, why it’s a pretty amazing tool, and in what cases you won’t want to use it.
- Git for Computer Scientists
Quick introduction to git internals for people who are not scared by words like Directed Acyclic Graph.
- Git from the bottom up
This state of affairs most directly represents what we’d like done: for our local, development
branch Z to be based on the latest work in the main branch D. That’s why the command is called
“rebase”, because it changes the base commit of the branch it’s run from.
Of course, there are many more resources for git, but these are among my favorites.
So after some amount of time reading, and experimenting, and thinking, you will find that you can understand what git rebase does, and how it does it.
But why? Why would you choose to forward-port commits? Why would you choose to re-order, re-word, squash, split, or fixup commits? What is the purpose of this incredibly powerful tool, and how are you ever going to understand how to use it properly, and not instead shy away from it in complete terror?
Unfortunately, I can no longer recall where I stumbled across this resource, but somehow I found Norman Yarvin's web site, where he has collected random collections of stuff.
And, one of those collections is an amazing series of email messages from Linus Torvalds (and a few others): git rebase.
Now, this is not easy stuff to read. Don't just plunge into it, until you've gone through the other resources first.
But when you're ready, and you've done your homework, go and read what Linus has to say about when and why you should use rebase, and think deeply about passages like:
But if you do a true merge, the bug is clearly in the merge
(automatedly clean or not), and the blame is there too. IOW, you can blame
me for screwing up. Now, I will say "oh, me bad, I didn't realize how
subtle the interaction was", so it's not like I'll be all that contrite,
but at least it's obvious where the blame lies.
In contrast, when you rebase, the same problem happens, but now a totally
innocent commit is blamed just because it happened to no longer work in
the location it was not tested in. The person who wrote that commit, the
people who tested it and said it works, all that work is now basically
worthless: the testing was done with another version, the original patch
is bad, and the history and _reason_ for it being bad has been lost.
And there's literally nothing left to indicate the fact that the patch and
the testing _used_ to be perfectly valid.
That may not sound like such a big deal, but what does that make of code
review and tested-by, and the like? It just makes a mockery of trying to
do a good job testing any sub-trees, when you know that eventually it will
all quite possibly be pointless, and the fact that maybe the networking
tree was tested exhaustively is all totally moot, because in the end the
stuff that hit the main tree is something else altogether?
and
Don't get me wrong at all. Rebasing is fine for stuff you have committed
yourself (which I assume was the case here).
Rebasing is also a fine conflict resolution strategy when you try to
basically turn a "big and complex one-time merge conflict" into "multiple
much smaller ones by doing them one commit at a time".
But what rebasing is _not_ is a fine "default strategy", especially if
other people are depending on you.
and
What I do try to encourage is for people to think publicising their git
trees as "version announcements". They're obviously _development_
versions, but they're still real versions, and before you publicize them
you should try to make sure that they make sense and are something you can
stand behind.
And once you've publicized them, you don't know who has that tree, so just
from a sanity and debugging standpoint, you should try to avoid mucking
with already-public versions. If you made a mistake, add a patch on top to
fix it (and announce the new state), but generally try to not "hide" the
fact that the state has changed.
But it's not a hard rule. Sometimes simple cleanliness means that you can
decide to go "oops, that was *really* wrong, let's just throw that away
and do a whole new set of patches". But it should be something rare - not
normal coding practice.
Because if it becomes normal coding practice, now people cannot work with
you sanely any more (ie some random person pulls your tree for testing,
and then I pull it at some other time, and the tester reports a problem,
but now the commits he is talking about don't actually even exist in my
tree any more, and it's all really messy!).
and
Rebasing branches is absolutely not a bad thing for individual developers.
But it *is* a bad thing for a subsystem maintainer.
So I would heartily recommend that if you're a "random developer" and
you're never going to have anybody really pull from you and you
*definitely* don't want to pull from other peoples (except the ones that
you consider to be "strictly upstream" from you!), then you should often
plan on keeping your own set of patches as a nice linear regression.
And the best way to do that is very much by rebasing them.
That is, for example, what I do myself with all my git patches, since in
git I'm not the maintainer, but instead send out my changes as emails to
the git mailing list and to Junio.
So for that end-point-developer situation "git rebase" is absolutely the
right thing to do. You can keep your patches nicely up-to-date and always
at the top of your history, and basically use git as an efficient
patch-queue manager that remembers *your* patches, while at the same time
making it possible to efficiently synchronize with a distributed up-stream
maintainer.
So doing "git fetch + git rebase" is *wonderful* if all you keep track of
is your own patches, and nobody else ever cares until they get merged into
somebody elses tree (and quite often, sending the patches by email is a
common situation for this kind of workflow, rather than actually doing git
merges at all!)
So I think 'git rebase' has been a great tool, and is absolutely worth
knowing and using.
*BUT*. And this is a pretty big 'but'.
BUT if you're a subsystem maintainer, and other people are supposed to be
able to pull from you, and you're supposed to merge other peoples work,
then rebasing is a *horrible* workflow.
Why?
It's horrible for multiple reasons. The primary one being because nobody
else can depend on your work any more. It can change at any point in time,
so nobody but a temporary tree (like your "linux-next release of the day"
or "-mm of the week" thing) can really pull from you sanely. Because each
time you do a rebase, you'll pull the rug from under them, and they have
to re-do everything they did last time they tried to track your work.
But there's a secondary reason, which is more indirect, but despite that
perhaps even more important, at least in the long run.
If you are a top-level maintainer or an active subsystem, like Ingo or
Thomas are, you are a pretty central person. That means that you'd better
be working on the *assumption* that you personally aren't actually going
to do most of the actual coding (at least not in the long run), but that
your work is to try to vet and merge other peoples patches rather than
primarily to write them yourself.
And that in turn means that you're basically where I am, and where I was
before BK, and that should tell you something. I think a lot of people
are a lot happier with how I can take their work these days than they
were six+ years ago.
So you can either try to drink from the firehose and inevitably be bitched
about because you're holding something up or not giving something the
attention it deserves, or you can try to make sure that you can let others
help you. And you'd better select the "let other people help you", because
otherwise you _will_ burn out. It's not a matter of "if", but of "when".
Now, this isn't a big issue for some subsystems. If you're working in a
pretty isolated area, and you get perhaps one or two patches on average
per day, you can happily basically work like a patch-queue, and then other
peoples patches aren't actually all that different from your own patches,
and you can basically just rebase and work everything by emailing patches
around. Big deal.
But for something like the whole x86 architecture, that's not what te
situation is. The x86 merge isn't "one or two patches per day". It easily
gets a thousand commits or more per release. That's a LOT. It's not quite
as much as the networking layer (counting drivers and general networking
combined), but it's in that kind of ballpark.
And when you're in that kind of ballpark, you should at least think of
yourself as being where I was six+ years ago before BK. You should really
seriously try to make sure that you are *not* the single point of failure,
and you should plan on doing git merges.
And that absolutely *requires* that you not rebase. If you rebase, the
people down-stream from you cannot effectively work with your git tree
directly, and you cannot merge their work and then rebase without SCREWING
UP their work.
and
The PCI tree merged the suspend branch from the ACPI tree. You can see it
by looking at the PCI merge in gitk:
gitk dc7c65db^..dc7c65db
and roughly in the middle there you'll find Jesse's commit 53eb2fbe, in
which he merges branch 'suspend' from Len's ACPI tree.
So Jesse got these three commits:
0e6859d... ACPI PM: Remove obsolete Toshiba workaround
8d2bdf4... PCI ACPI: Drop the second argument of platform_pci_choose_state
0616678... ACPI PM: acpi_pm_device_sleep_state() cleanup
from Len's tree. Then look at these three commits that I got when I
actually merged from you:
741438b... ACPI PM: Remove obsolete Toshiba workaround
a80a6da... PCI ACPI: Drop the second argument of platform_pci_choose_state
2fe2de5... ACPI PM: acpi_pm_device_sleep_state() cleanup
Look familiar? It's the same patches - just different commit ID's. You
rebased and moved them around, so they're not really the "same" at all,
and they don't show the shared history any more, and the fact that they
were pulled earlier into the PCI tree (and then into mine).
This is what rebasing causes.
and
So rebasing and cleanups may indeed result in a "simpler" history, but it
only look that way if you then ignore all the _other_ "simpler" histories.
So anybody who rebases basically creates not just one simple history, but
a _many_ "simple" histories, and in doing so actually creates a
potentially much bigger mess than he started out with!
As long as you never _ever_ expose your rewriting of history to anybody
else, people won't notice or care, because you basically guarantee that
nobody can ever see all those _other_ "simpler" histories, and they only
see the one final result. That's why 'rebase' is useful for private
histories.
But even then, any testing you did in your private tree is now suspect,
because that testing was done with the old history that you threw away.
So even if you delete all the old histories and never show them, they kind
of do exist conceptually - they existed in the sense that you tested them,
and you've just hidden the fact that what you release is different from
what you tested.
Well.
That was a lot of quoting, and I'm sorry to do that.
But so many of the web pages out there only point to Linus's Final Word On The Subject.
You know, the one which reads:
I want clean history, but that really means (a) clean and (b) history.
Now, that last essay is indeed brilliant, and you should print it out, and post it on your wall, and read it every morning, and think about what it is he's trying to say.
But if you just can't figure it out, well, go digging in the source material.
And then, I believe, it will finally all make sense.
At least, it finally did, to me.