Pages

Wednesday, May 6, 2009

A great test is a pathway to a mystery

I've had some great testing experiences recently. It's a bit challenging to explain why they were great experiences, but it put me in mind of this quote:

We absolutely must leave room for doubt or there is no progress and no learning. There is no learning without having to pose a question. And a question requires doubt. People search for certainty. But there is no certainty. People are terrified–how can you live and not know? It is not odd at all. You can think you know, as a matter of fact. And most of your actions are based on incomplete knowledge and you really don't know what it is all about, or what the purpose of the world is, or know a great deal of other things. It is possible to live and not know.

Richard Feynman, The Pleasure of Finding Things Out.

Recently I've been involved with an internal test at work, designed to examine the behavior of our system across DBMS outages. The idea is that:
  1. Things are OK,
  2. Then the database server goes down, or the network is cut, or the disk fills up, or...
  3. So we need to detect that, and recognize that, for the time being, the database is unavailable, and make sure that the user can easily tell that we are experiencing a database outage,
  4. Then, at some later point, access to the database is restored,
  5. And we need to recognize that, and clear our outage status, and resume normal processing.
Once we had come to agreement on the design and felt that we had a reasonable implementation, we set about proving that our implementation worked, by writing a test (actually, a suite of tests, all quite similar). We've been trying to do this in a fairly automated fashion, by using some simulation tools:
  1. Our test simulates normal activity, both before and after the outage.
  2. Our test simulates the database outage, by using a configurable wrappering JDBC driver, similar to p6spy, but actually adapted from some ideas published many years ago in a O'Reilly OnJava article.
So we wrote this test, and we've been running this test, and running this test, and running this test, and running this test. For close to a year now this test has been finding interesting problems and showing us new insights into the behavior of the system.

At first, many of the problems were with the test itself: it didn't do a very good job of simulating database outages, so sometimes the test would expect the database to be showing an outage, only it actually wasn't, and vice versa. And it didn't do a very good job of simulating normal system activity, so it wouldn't succeed in forcing the code down paths that we expected it to touch, or it would predict that the system would exhibit certain behaviors, but when we actually ran the real system, and provoked a real DBMS outage (e.g., shutting down the DBMS server process manually), we wouldn't see the behavior we expected.

But after a while we started to become confident in the test, and about at the same time we started to find various real problems in the system under test.
  • Our notion of status tracking wasn't very precise, and the test revealed that the code wasn't being very careful about tracking the status of the system, and the status of the database.
  • The system wasn't being very crisp about state transitions, so sometimes it would sort of lurch haphazardly between states.
Because the system runs as a server, with many background threads which are scheduled based on events or timers, its behavior is extremely complex. Moreover, since it is a multi-threaded system, accessing a shared database using a pool of available connections, the overall effect is that multiple different parts of the system may experience a database outage in different ways more or less simultaneously. It's quite challenging to understand the behavior of such a system; this test has been an extremely useful tool for doing that.

We continue to run the test; it continues to fail in new and different ways; there continue to be new mysteries to contemplate and explore; life is good!

No comments:

Post a Comment