I happened across two different "Internet scale" articles over the weekend that are related and worth considering together.
The first one is a few years old, but it had been a while since I'd read it, and I happened to re-read it: On Designing and Deploying Internet-Scale Services. This paper was written by James Hamilton when he was part of the MSN and Windows Live teams at Microsoft, and in the paper he discusses a series of "lessons learned" about building and operating "Internet scale" systems.
The paper is rich with examples and details, rules of thumb, techniques, and approaches. Near the front of the paper, Hamilton distills three rules that he says he learned from Bill Hoffman:
You can read more about Hoffman's approach to reliable Internet scale systems in this ACM Queue article.
- Expect failures. A component may crash or be stopped at any time. Dependent components might fail or be stopped at any time. There will be network failures. Disks will run out of space. Handle all failures gracefully.
- Keep things simple. Complexity breeds problems. Simple things are easier to get right. Avoid unnecessary dependencies. Installation should be simple. Failures on one server should have no impact on the rest of the data center.
- Automate everything. People make mistakes. People need sleep. People forget things. Automated processes are testable, fixable, and therefore ultimately much more reliable. Automate wherever possible.
These are three superb rules, even if they are hard to get right. Of course, the first step to getting them right is to get them out there, in front of everybody, to think about.
More recently, you don't want to miss this great article by Jay Kreps, an engineer on the LinkedIn team: Getting Real About Distributed System Reliability. Again, it's just a treasure trove of lessons learned and techniques for designing and building reliability into your systems from the beginning.
I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
As Hamilton's article points out, the emphasis on failure handling and on failure testing builds on the decades old work of Professors David Patterson and Armando Fox in their Recovery Oriented Computing and Crash-only Software efforts.
Crash-only programs crash safely and recover quickly. There is only one way to stop such software – by crashing it – and only one way to bring it up – by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users.
When these ideas were first being proposed they faced a certain amount of skepticism, but now, years later, they are accepted wisdom, and the stability and reliability of systems like Amazon, Netflix, and LinkedIn is testament to the fact that these techniques do in fact work.