We absolutely must leave room for doubt or there is no progress and no learning. There is no learning without having to pose a question. And a question requires doubt. People search for certainty. But there is no certainty. People are terrified–how can you live and not know? It is not odd at all. You can think you know, as a matter of fact. And most of your actions are based on incomplete knowledge and you really don't know what it is all about, or what the purpose of the world is, or know a great deal of other things. It is possible to live and not know.
Richard Feynman, The Pleasure of Finding Things Out.
Recently I've been involved with an internal test at work, designed to examine the behavior of our system across DBMS outages. The idea is that:
- Things are OK,
- Then the database server goes down, or the network is cut, or the disk fills up, or...
- So we need to detect that, and recognize that, for the time being, the database is unavailable, and make sure that the user can easily tell that we are experiencing a database outage,
- Then, at some later point, access to the database is restored,
- And we need to recognize that, and clear our outage status, and resume normal processing.
- Our test simulates normal activity, both before and after the outage.
- Our test simulates the database outage, by using a configurable wrappering JDBC driver, similar to p6spy, but actually adapted from some ideas published many years ago in a O'Reilly OnJava article.
At first, many of the problems were with the test itself: it didn't do a very good job of simulating database outages, so sometimes the test would expect the database to be showing an outage, only it actually wasn't, and vice versa. And it didn't do a very good job of simulating normal system activity, so it wouldn't succeed in forcing the code down paths that we expected it to touch, or it would predict that the system would exhibit certain behaviors, but when we actually ran the real system, and provoked a real DBMS outage (e.g., shutting down the DBMS server process manually), we wouldn't see the behavior we expected.
But after a while we started to become confident in the test, and about at the same time we started to find various real problems in the system under test.
- Our notion of status tracking wasn't very precise, and the test revealed that the code wasn't being very careful about tracking the status of the system, and the status of the database.
- The system wasn't being very crisp about state transitions, so sometimes it would sort of lurch haphazardly between states.
We continue to run the test; it continues to fail in new and different ways; there continue to be new mysteries to contemplate and explore; life is good!