Monday, November 12, 2012

AWS 10/22 outage

I've been looking through Amazon's information about the October 22 AWS outage: Summary of the October 22,2012 AWS Service Event in the US-East Region.

As you might expect, given the incredible sophistication of AWS, the outages are becoming more sophisticated as well; this outage was no exception, as it involved a cascade of events:

  • A hardware failure caused an internal monitoring server to do down
  • The DNS information describing the replacement for that server was not successfully propagated to all the internal DNS servers (Amazon runs their own DNS implementation)
  • This meant that other servers continued to try to contact the no-longer-operational server
  • A memory leak in the error-handling for failure-to-contact-internal-monitoring-server caused memory pressure in the production servers
  • which then caused those servers to run out of swap space and become non-responsive

I can sympathize with the AWS developers: memory leaks in error-handling code are easy things to overlook, and it is quite challenging to write thorough-enough test suites to be able to detect memory leaks in error-handling code:

  • First you have to provoke those errors, which can be quite challenging
  • Then you have to have a test harness which can observe memory leaks
  • And then you have to provoke the leak a sufficient number of times that the harness is able to detect it
As Jeff Darcy observes, these sorts of testing (dealing with distributed failures, and testing to ensure that your error-handling is up to snuff), are the sorts of things for which there have been some good tools developed:
Another possibility would be from the Recovery Oriented Computing project: periodically reboot apparently healthy subsystems to eliminate precisely the kind of accumulated degradation that something like a memory leak would cause. A related idea is Netflix’s Chaos Monkey: reboot components periodically to make sure the recovery paths get exercised.

It's also interesting to observe that some of the human actions that the AWS operations team took while trying to deal with the problem caused problems of their own:

We use throttling to protect our services from being overwhelmed by internal and external callers that intentionally or unintentionally put excess load on our services. A simple example of the kind of issue throttling protects against is a runaway application that naively retries a request as fast as possible when it fails to get a positive result. Our systems are scaled to handle these sorts of client errors, but during a large operational event, it is not uncommon for many users to inadvertently increase load on the system. So, while we always have a base level of throttling in place, the team enabled a more aggressive throttling policy during this event to try to assure that the system remained stable during the period where customers and the system were trying to recover. Unfortunately, the throttling policy that was put in place was too aggressive.

Higher-level software systems also encountered problems in their attempts to handle and recover from the lower-level problems, such as this behavior in Amazon Relational Database Service:

The second group of Multi-AZ instances did not failover automatically because the master database instances were disconnected from their standby for a brief time interval immediately before these master database instances’ volumes became stuck. Normally these events are simultaneous. Between the period of time the masters were disconnected from their standbys and the point where volumes became stuck, the masters continued to process transactions without being able to replicate to their standbys. When these masters subsequently became stuck, the system blocked automatic failover to the out-of-date standbys.

Given the amount of complexity in the overall system, it's remarkable that Amazon were able to analyze all of these events, and deal with them, in only eight hours. It wasn't so long ago that outages such as these were measured in days or weeks.

Still, eight hours is a long time for AWS.

As the team at Netflix describe, businesses that build atop AWS need to consider these issues, and ensure that they have their own processes and tools to handle such situations: Post-mortem of October 22,2012 AWS degradation:

We’ve developed a few patterns for improving the availability of our service. Past outages and a mindset for designing in resiliency at the start have taught us a few best-practices about building high availability systems.
The Netflix analysis is well worth reading; it contains lots of information about the tools and techniques that Netflix use for these purposes, many of which they've open sourced to the world.

Thanks once again to the AWS team for providing such detailed information; each time I read these reports, I learn more, and think about more ways that I can improve my own testing and development.

No comments:

Post a Comment