So I wasn't much impacted by last week's Skype system outages, but I was still interested, because Skype is a big complex system and I love big complex systems :)
If, like me, you're fascinated by how these systems are built and maintained, and what we can learn from the problems of others, you'll want to dig into some of what's been written about the Skype outage:
- Start with this report from Lars Rabbe, Skype's CIO.
- Also check Dan York's blog; he's been publishing some very interesting information about the outage.
- Here's some background material from Skype about their basic architecture
- And here's a nice, though somewhat old, paper from some researchers at Columbia with some great background information about Skype system architecture
Building immense complicated distributed systems is incredibly hard; I've been working in the field for 15 years and I'm painfully aware of how little I really know about this.
It's wonderful that Skype is being so forthcoming about the problem, what caused it, what was done to fix it, and how it could be avoided in the future. I am always greatful when others take the time to write up information like this -- post-mortems are great, so thanks Skype!