Sunday, April 26, 2015

Fragility

Programmers have an expression for a particular type of flaw in program code.

That's fragile, they say.

When a programmer points to a segment of code and calls it fragile, they mean something very specific.

They mean that a minor, innocent-looking, apparently-unrelated change to the overall program at some later date would cause this bit of code to unexpectedly fail without warning.

It might be because this code is in fact linked to the behavior of some other part of the system, but the linkage is not made explicit. For example, you might have an array which needs to be the same size as some other data structure elsewhere in the system, but there is nothing that is checking that those two conditions match, so changing the other part of the system would break an assumption that this code doesn't check.

In programming, this sort of programming mistake is typically called the Don't Repeat Yourself principle, but there are lots of other particular ways to make "fragile code.

I was thinking about this the other day when I read an intriguing essay on the Nautilus site: Why the Flash Crash Really Matters.

The essay talks about the financial system, and compares it to various other situations in which an otherwise minor event had catastrophic consequences:

This is why asking whether the Waddell & Reed sale, or the behavior of a manipulative trader, really caused the crash is a mistake. The disparate “causal” explanations of the crash can’t be reconciled with each other for a simple reason: They aren’t in conflict. The Flash Crash was an emergent phenomenon. Just as any grain of sand might cause the sandpile to collapse, and as Three Mile Island’s meltdown could be attributed to a failed pump, stuck valve, or operator error, the trigger for the Flash Crash could have been related to Sarao, Waddell & Reed, or something else entirely. The true roots were in the complexity of the system itself.

In large-scale systems programming (database systems, distributed systems, web servers, network file systems, etc.), there is a problem that arises when the system reaches a certain size and complexity: you can no longer hold the whole thing in your head. If you aren't careful, when your system reaches that size, you will find it completely breaking down: bugs crop up left and right, unexpectedly, faster than you can fix them; you feel like you're playing "Whack-a-Mole" with the breakages in your software.

The only way out of this is to graduate to a whole new type of system design and implementation. Your overall system must be componentized; individual modules must have clear responsibilities and clean interfaces; the boundaries between modules must be well-known to all the teams; extra care must be taken with code paths which cross module boundaries; system interfaces must check their parameters, assert their pre-constraints, and generally validate that the overall rules of the system are being obeyed.

Is the modern international financial system effectively a giant software system? The authors of the Nautilus essay clearly think so:

In the years since the Flash Crash, the SEC has implemented measures to reduce tight coupling in the markets, and exchanges now pause trading if there are drastic price moves in individual securities. These measures help, but are they enough? The fundamental interactive complexity of the market and the unpredictable and difficult-to-observe interactions between software components, trading models, and market participants remain in place.

"Tight coupling," eh? Yes, that's a code smell.

It won't be easy to re-architect the world's financial system, to remove its fragility and make it more scalable and less susceptible to catastrophic failures.

But, somehow, it needs to be done.

No comments:

Post a Comment