Sunday, November 24, 2013

Anatomy of a bug fix

Some time recently, I happened to be re-reading the details about a really great bug fix, and I thought to myself: people don't write about bug fixes very much, and this one really deserves to be written about, so I'll do that.

It's a great bug fix, and it's great that it was found, carefully analyzed, and resolved.

It's even greater that this all happened on an Open Source project, so it's no problem to share the details and discuss it, since so often it is the case that Great Bug Fixes happen on closed source projects, where the details never see the light of day (except inside the relevant organization).

Anyway, with no further ado, let's have a look at DERBY-5358: SYSCS_COMPRESS_TABLE failed with conglomerate not found exception.

The first thing about a great bug fix is that you have to figure out that there is a bug. In this particular case, one of the Derby developers, Knut Anders Hatlen, came across the bug somewhat by accident:

When running the D4275.java repro attached to DERBY-4275 (with the patch invalidate-during-invalidation.diff as well as the fix for DERBY-5161 to prevent the select thread from failing) in four parallel processes on the same machine, one of the processes failed with the following stack trace: The conglomerate (4,294,967,295) requested does not exist. ...

Look how precise and compact that is, describing exactly what steps the reporter did to provoke the bug, the relevant environmental conditions ("four parallel processes on the same machine"), and what the symptoms of the bug were. This, as you will see, was no simple bug, so being able to crisply describe it was a crucial first step.

When he reported the bug, Knut Anders made a guess about the cause:

The conglomerate number 4,294,967,295 looks suspicious, as it's equal to 2^32-1. Perhaps it's hitting some internal limit on the number of conglomerates?

As it turned out, that initial guess was wrong, but it doesn't hurt to make such a guess. You have to have some sort of theory, and the only real clue was the conglomerate number in the error message.

Some time passed, and other Derby developers offered the occasional theories and suggestions ("Again with the in-memory back end?", " The message has the feel of a temporary timing issue.")

10 months later, Knut Anders returned to the issue, this time with a brilliant, insightful observation:

I instrumented this class and found that it never set heapConglomNumber to 4,294,967,295, but the method still returned that value some times.

Stop and think about that statement for a while.

This is the sort of thing that can happen with software: something that can't possibly happen actually happens, and when that occurs, the experienced engineer sits up straight and says "Whoa! What? Say that again? What does that mean?"

In this case, if the program never sets the variable to that value, but that value is in fact returned, then something "outside the language" must be happening. That is, the program must be breaking one of the rules of the Java language, in which case impossible things become possible.

Recalling that in the original description we had "four parallel processes on the same machine", Knut Anders took the brilliant inductive leap to realize that concurrent access to unprotected shared data was at work here:

The problem is that heapConglomNumber is a long, and the Java spec doesn't guarantee that reads/writes of long values are atomic.

He then proceeded to outline in detail how that violation of the language rules produced this behavior:

T2 reads heapConglomNumber in order to check if it's already cached. However, since T1's write was not atomic, it only sees half of it.

...

If T2 happens to see only the most significant half of the conglomerate number written by T1, that half will probably be all zeros (because it's not very likely that more than 4 billion conglomerates have been created). The bits in the least significant half will in that case be all ones (because the initial value is -1, which is all ones in two's complement). The returned value will therefore be 0x00000000ffffffff == 4,294,967,295, as seen in the error in the bug description.

I've also seen variants where the returned number is a negative one. That happens if T2 instead sees the least significant half of the correct column number, and the most significant half of the initial value -1. For example, if the conglomerate number is 344624, the error message will say: The conglomerate (-4 294 622 672) requested does not exist.

Beautiful, just beautiful: given just two clues:

  1. The error message consistently included the value 4,294,967,295 or an enormous negative number such as -4,294,622,672
  2. "four parallel processes on the same machine"
Knut Anders made the intuitive leap to understanding that we had a non-atomic read of a value undergoing update.

Now, it helps if you've seen behavior like that before, but even if you have, that sort of sudden, inspired vault from symptoms to cause is one of the most gratifying and enjoyable parts of computer programming; I often term this the "Ah hah! moment", because you will just be baffled by something and then, in the blink of an eye, you suddenly understand it perfectly.

My colleagues know that I often at this point clap my hands, shout out "Yes!", or otherwise let out a little burst of joy; it's just that pleasurable when it occurs.

Of course, I don't know if Knut Anders does that; he's in Norway, and I've only had the pleasure of meeting him once.

The next few days saw lots of activity on this issue. Mike Matrigali, one of the original authors of the Derby storage system, noted that this low-level error might explain a lot of similar symptoms:

I think in past I have seen user unreproducible issues with global temp tables and errors with strange large negative conglomerate numbers, so likley related to this issue being tracked here.
and encouraged the exploration of alternate scenarios that could give rise to this code path:
it would be good to update this issue with what user actions can cause this problem. Is it just SYSCS_COMPRESS_TABLE, or is it any operation that can result in conglomerate numbers changing?

Knut Anders agreed, saying

I think it could happen any time two threads concurrently ask for the conglomerate id when no other thread has done the same before.

I often refer to this as "widening the bug": when you realize that you've found a low-level error, it is important to stop and think about whether this particular code path is special, or whether there are other ways that the program might arrive at the same situation and encounter the same or similar problems.

Sometimes, that line of reasoning leads to a realization that you don't perfectly understand the bug: "wait; if that were true, then this other operation would also fail, but we know that works, so..."

However, in this case, the process of widening the bug held up very well: the bug could affect many code paths, all that was required was to have just the right pattern of concurrent access.

And soon, given that insight, Knut Anders was back with a simpler and more accurate reproduction script:

Attaching an alternative repro (MultiThreadedReadAfterDDL.java). In my environment (Solaris 11, Java SE 7u4) it typically fails after one to two minutes

Simplifying, clarifying, and refining the reproduction script is one of the most important steps in the bug fixing process. Sometimes, you are given a minimal reproduction script in the initial bug report, but for any real, complex, epic bug report, that almost never happens. Instead, you get a complicated, hard-to-reproduce problem description.

But once you truly and thoroughly understand the bug, it is so important to prove that to yourself by writing a tight, clear, minimal reproduction script.

  • First, it proves that you really do understand the code path(s) that provoke the bug
  • Secondly, it lets others reproduce the bug, which lets them help you
  • And perhaps most importantly, it allows you to have confidence that you've actually fixed the bug: run the script, see it fail, apply the fix, run the script, see it now succeed

Given all this great work, the next month is rather anticlimactic:

Attaching an updated patch (volatile-v2.diff), which only touches the TableDescriptor class. The new patch adds a javadoc comment that explains why the field is declared volatile. It also removes an unused variable from the getHeapConglomerateId() method.

...

Committed revision 1354015.

For many bugs, that's the end of it: careful problem tracking, persistence, a brilliant insight, brought to a close by solid, thorough engineering.

But with this bug, however, the realization that it's not just a good fix, but a Great Fix, is when, more than a year later, we continue to realize that this great fix is paying dividends. And, thus, witness DERBY-5532: Failure in XplainStatisticsTest.testSimpleQueryMultiWithInvalidation: ERROR XSAI2: The conglomerate (-4,294,965,312) requested does not exist..

This bug report was of particular interest to me, because it was in a test that I wrote, which exercised code that I was particularly familiar with, so it was frustrating that I had been unable to identify how my code could produce this failure.

Does that error message look familiar? Well, Mike and Knut Anders thought so, too, and when 18 months had passed since the original fix went in, we decided to chalk up another fix for DERBY-5358.

I often say that one of the differences between a good software engineer and a great one is in how they handle bug fixes. A good engineer will fix a bug, but they won't go that extra mile:

  • They won't narrow the reproduction script to the minimal case
  • They won't invest the time to clearly and crisply state the code flaw
  • They won't widen the bug, looking for other symptoms that the bug might have caused, and other code paths that might arrive at the problematic code
  • They won't search the problem database, looking for bug reports with different symptoms, but the same underlying cause.

Frankly, you could spend your entire career in software and not ever become as good an engineer as Knut Anders or Mike. But if you decide to become a software engineer, strive to be not just a good engineer, but a great one.

And to be a great engineer, strive, as much as you can, to make every bug fix a Great Bug Fix.

No comments:

Post a Comment