Monday, May 18, 2009

TimerTask exception handling

We were running an enormous distributed multi-machine performance test, stressing the system heavily, and it wasn't working very well. Deep into the test, something was going wrong. Crawling through the logs, we discovered a strange error:


java.lang.IllegalStateException: Timer already cancelled


I was mystified: what did this mean?

Roger, my co-worker, however, had seen this message before, and knew: "it means that Java ran out of memory." Sure enough, we increased the -Xmx setting for the JVM, and the message disappeared.

I was happy that we resolved the problem, but unhappy with the manner of the resolution: why hadn't an out-of-memory problem displayed the symptom I expected, namely: java.lang.OutOfMemoryError?

So, with a few more hints from Roger, I studied this some more, and uncovered this bit of wisdom from the JavaDoc for java.util.Timer:


If the timer's task execution thread terminates unexpectedly, for example, because its stop method is invoked, any further attempt to schedule a task on the timer will result in an IllegalStateException, as if the timer's cancel method had been invoked.


Looking in our code, I saw that we had implemented a sub-class of java.util.TimerTask, and our sub-class has a run() method which catches Exception, but does not catch Throwable. Therefore, I think it is possible that exceptions may be terminating our run(), resulting in the timer being marked cancelled as described above. In particular, since it extends Error, not Exception, OutOfMemoryError would be such an exception.

Furthermore, I saw that none of our calls to timer.schedule() catch or check for IllegalStateException (it's not a checked exception).

So it seems quite plausible that we could have the following sequence of activity:

  1. The app server runs out of memory, and the thread that notices it is a TimerTask thread running our run method

  2. OutOfMemoryError is thrown, is not caught, and terminates the run method

  3. The timer is marked cancelled because run threw an exception

  4. Subsequently, uses of that timer result in IllegalStateException



I talked this over with several others and it seemed like it was holding water. Googling, it seemed like I wasn't the only holder of this theory. So I proceeded to the next step: I wrote a test.

Sure enough, when I threw an exception that extended Error out of the run method, subsequent calls to the schedule method of the corresponding timer threw IllegalStateException exceptions with the message "Timer already cancelled".

Confirmed!

My current plan is as follows:

  1. Catch Throwable, not just Exception in our run method.

  2. Log any such exception to our error log, together with a full stack trace.

  3. Don't allow exceptions to propagate out of our run method.

  4. Check for IllegalStateException when calling timer.schedule, and wrap any such exception that we get with a wrapper exception indicating that the JVM might have run out of memory before throwing it

And to hold it all together, I expanded my test case to contain three tests:
  1. A test which throws an Error-based exception in the TimerTask, to prove that the catch block now catches it.
  2. A test which tries to call schedule on a Timer after an Error-based exception has been thrown and caught in the TimerTask, to prove that the Timer can still be used.
  3. A test which tries to call schedule on a Timer after cancel of the Timer, to prove that the new chained exception is thrown properly and that its text clearly indicates a possible out-of-memory condition.

Hopefully, this new caution surrounding TimerTask exceptions will not only prevent our timers from getting into this damaged state, but will also improve the error messages to the log when resources start to get low.

And, there's now three more tests in the test suite!

4 comments:

  1. Hi Bryan,

    been there - it's not unchecked exceptions that you're after; in fact RuntimeException does extend Exception; however Error is a separate thing. So the test cases should probably throw Error-s not exceptions

    Cheers,
    Anton

    ReplyDelete
  2. TimerTask is kinda @deprecated. The new JDK concurrency API in Java 5 should be consider to use instead.

    Page 123 in Java Concurrency in Practice, book by Brian Goetz

    ReplyDelete
  3. Thanks Anton -- you were right, the test needed to throw Error. It's an even better test now!

    ReplyDelete
  4. Thanks Claus -- I'll have a look at the new API.

    ReplyDelete