One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
I love the idea of the Chaos Monkey!
Testing error recovery is hard. It's often hard to provoke errors. It's fairly straightforward to provoke errors that are caused by bad input, so you should always have a thorough suite of tests which tries lots of invalid invocations of your code: syntax errors, parameters out of range, missing values for required fields, invalid combinations of requests, etc.
It's harder to provoke errors that are caused by other conditions: resource shortages, disk or network I/O errors, etc. In a number of my tests I provoke these errors using surrogate mechanisms:
- To simulate I/O-related problems I tamper with file or directory permissions, or I remove or rename files and directories that the program wants to access
- To simulate network problems I shutdown one or the other end of a conversation, or I use invalid network addresses
There are more sophisticated tools for doing this, for example check out Holodeck.
And you can also modify your application so that it is easier to test; quite commonly this involves modifications which allow testers to force the software through error conditions. This is often called "testability"; here's a pointer to a recent testing conference -- note how many of the talks are focused on various aspects of testability. Currently, much of the focus in the testing world is on the notion of "mock" objects, and indeed they can be very powerful and worth building into your test harness. Here's an interesting recent example: Mocking the File System to Improve Testability.
You can also use randomization, and stress: here at my day job we have something we call the Submitatron, which is a tiny little script that simply loops around, generating arbitrary data and sending it to the server. Similar techniques, which focus more on randomization, are often referred to as "fuzz testing".
But the most important thing is to think about testing, think about errors, think about failures, and try it out!
So, big kudos to the Netflix team and their Chaos Monkey!