Wednesday, February 17, 2016

Blind injection

Obviously, the gravitational wave discovery was extremely, extremely cool.

But what I thought was in a lot of ways much cooler was the technique the team used to build in a process which ensured that they were extremely careful in their analysis, and weren't easily fooled: LIGO-Virgo Blind Injection

The LIGO Scientific Collaboration and the Virgo Collaboration conducted their latest joint observation run (using the LIGO Hanford, LIGO Livingston, Virgo and GEO 600 detectors) from July, 2009 through October 2010, and are jointly searching through the resulting data for gravitational wave signals standing above the detector noise levels. To make sure they get it right, they train and test their search procedures with many simulated signals that are injected into the detectors, or directly into the data streams. The data analysts agreed in advance to a "blind" test: a few carefully-selected members of the collaborations would secretly inject some (zero, one, or maybe more) signals into the data without telling anyone. The secret goes into a "Blind Injection Envelope", to be opened when the searches are complete. Such a "mock data challenge" has the potential to stress-test the full procedure and uncover problems that could not be found in other ways.

It must be really pleasing, in that what-makes-an-engineer-deeply-satisfied sort of way, to open up the Blind Injection Envelope and discover that your analysis was in fact correct.

It isn't a perfect comparison, but I am strongly reminded of the Netflix engineering team's approach to building in fault tolerance and reliability in their systems by intentionally provoking failures: The Netflix Simian Army

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables -- all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice.

In computing circles, this sort of thing is often gathered under the term Recovery Oriented Computing, but I really like the term "blind injection."

I'm going to remember that, and keep my eye out for places to take advantage of that technique.

No comments:

Post a Comment