Wednesday, January 25, 2023

When your secondary system causes a problem in your primary system.

I found this very intriguing description of yesterday's NYSE malfunction and how it was due to incorrect operation of the Stock Exchange's Disaster Recovery system:

After the 9/11 disaster, the NYSE was obligated to maintain a primary trading site (at the NYSE) and a back-up site (which is in Chicago).

On Monday evening, routine maintenance was being performed on the software for the Chicago back-up site.

On Tuesday morning, the back-up system (Chicago) was mistakenly still running when the primary system (NYSE) came online.

Because the back-up was still running, when the primary site started up some stocks behaved as if trading had already started.

This is very very tricky stuff. Disaster Recovery mechanisms in software systems are extremely hard to test, because in practice disasters are quite rare, and you can't really just make a (real) disaster happen in order to test your software.

So what people do, in general, is to pretend that a disaster has happened, and practice switching over to their secondary site, verify that everything in fact switched over properly, and then switch back.

On my team, these are called "site switch exercises", and we do them a lot, because we need the practice.

But just doing one of these exercises can in fact cause a problem, as we see with the unfortunate incident at the NYSE.

It's a really hard problem: you don't want to say "don't run any tests", because then how do you know that your Disaster Recovery system would actually survive a disaster?

But then your test actually causes a disaster, and you feel bad.

No comments:

Post a Comment