Tuesday, January 18, 2011

Sometimes, the best you can do is to add that test case to the test suites for next time

At my day job, several of our customers independently reported a bug to us over the weekend. After a bit of analysis, our support staff identified the commonality among the cases, and sure enough, there was a particular configuration, a "perfect storm" if you will, involving several different aspects of the server configuration which, if they were all just right, caused a small memory leak.

Well, I say a small memory leak, and it was in fact less than 15 bytes per each TCP/IP connection that the server accepted. Unfortunately, since our server routinely accepts hundreds or even thousands of connections a minute, that can really add up.

It was my bad for introducing the bug in the first place, and for not catching it during the 8 months (!) of internal testing, but things like this happen. I'm pleased that our support team was able to isolate the configuration conditions so rapidly; it saved me an immense amount of time to be able to demonstrate the problem in just a few simple commands.

Unfortunately, you can't fix a bug until you find it. But now that we've found it, and fixed it, I've done what I think is the best I can do:

  • I searched the code for any similar mistakes that I might have made, and didn't find any.

  • I ran my fix past 3 separate code reviewers, who each found small areas where I could improve the fix.

  • I added a test case based on the reproduction script to our nightly regression suites, and I verified that the test case fails without the fix, and passes with the fix in place. The presence of this test case greatly increases my confidence that this bug won't slip back into the product in some future release. It's a substantial bit of effort to add a test for every single bug fix, but the alternative is worse, so I always try to add that test case if at all possible.

  • And I fixed the problem in a way that I hope will lay the infrastructure for future improvements in subsequent releases.

Since I know people will suggest this: yes, we do make use of a number of resource leak detection tools; getting a clean valgrind run is important to us, and we have a large collection of stress and load test suites. Unfortunately, there are many more operational configurations then you might think, and this particular configuration, although not that unusual, still happened to be a configuration that we don't have leak detection tools for.

No more crying over spilt milk. This bug is fixed, I believe, and fixed well, and there are more tasks lying ahead. That is the way of software, after all.

No comments:

Post a Comment