I quite enjoyed Richard Cook's presentation at the Velocity 2012 Conference: How Complex Systems Fail. The video is about 30 minutes long, but it moves right along and he is an excellent speaker.
Among the key concepts in the talk is this observation:
As systems developers, we design for reliability:
But what we actually want is resilience, which is different:
- stiff boundaries, layers, formalisms
- defence in depth
- interference protection
- withstand transients
- recover swiftly and smoothly from failures
- prioritize to serve high level goals
- recognize and respond to abnormal situations
- adapt to change
If you consider yourself a serious systems software engineer, or if you want to become one, you should listen to Cook's talk and go read some of his papers at his web site. He is a clear speaker and writer, and his proposals are sensible and grounded in real experience. Start by reading this concise summary: How Complex Systems Fail, and then move on from there to explore Cook's ideas about how to make systems safer by making them more resilient, for example in Operating at the Sharp End: The Complexity of Human Error.
Thankfully, the systems that I work on don't come close to approaching the safety-critical systems that Cook considers, but I'm quite grateful to him for sharing his experiences and observations, because even ordinary systems software can be made more stable, tolerant of errors, and adaptable, by considering these issues.
Update: Fixed the link to the CTLab site (thanks Anton!)
Update 2: Fixed the video link (is Blogger eating my links? Or am I just getting old...nevermind, don't answer that.)