Saturday, November 2, 2013

There are bugs, and then there are bugs

As everyone knows, I adore post mortems; there's so much to learn.

As David Wilson points out on his python sweetness blog, this is one of the greatest bug descriptions of all time, and even though it's contained in an official SEC finding, it's still fascinating to read.

The SEC doesn't bury the lede, they get right to it:

On August 1, 2012, Knight Capital Americas LLC (“Knight”) experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion.

Yes, I'd call that a significant error.

If you're like me, questions just leap to mind when you read this:

  • What was the actual bug, precisely?
  • Why was it not caught during testing?
  • Why do they not have safeguards at other levels of the software?
  • Why do these sorts of bugs not appear more frequently?

Happily, the SEC document does not disappoint.

Let's start by getting a high-level understanding of RLP and SMARS:

To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange, which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.
Well, that helps a little bit, but it doesn't hurt to learn a bit more. The New York Times takes a stab at explaining RLP: Regulators Approve N.Y.S.E. Plan for Its Own ‘Dark Pool’
Regulators have approved a controversial proposal from the New York Stock Exchange that will result in some stock trades being diverted away from the traditional exchange.

The program, expected to begin later this summer, will direct trades from retail investors onto a special platform where trading firms will bid to offer them the best price. The trading will not be visible to the public.

Although the details are obviously quite complex, the overall concept is straightforward: the exchange, with the government's permission, changed the regulations controlling how trades may be executed by the participant trading firms and brokers.

But note the timeframes here: the new rules were approved at the beginning of July, 2012, and went into effect on August 1, 2012. That is one compact period!

Anyway, back to the SEC finding. What, precisely, did the programmers at Knight do wrong?

Well, the first thing is that their codebase contained some old code, still present in the program, but intended to be unused:

Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.

This is a common mistake in software: old code is disabled, not deleted. It's scary to delete code, but of course if you have a top-notch Version Control System, you can always get it back. Still, programmers can be superstitious, and I'm not surprised to hear that somebody left the old code in place.

Unfortunately, unused, unexecuted, untested code tends to decay, as the SEC observe:

When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.

So, there was old, buggy code in the program, and new code written to replace it, and that new code "repurposed a flag that was formerly used to activate the Power Peg code." We've got almost all the elements in place, but there is one more significant event that the SEC highlight:

Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers.

In the modern jargon, this deployment often goes under the terminology of "DevOps". The task of taking your program from its development phase and moving it into production seems like it should be simple, but unfortunately is fraught with peril. Techniques such as Continuous Delivery can help, but these approaches are sophisticated and take time to implement. Many organizations, as Knight apparently did, use old fashioned approaches: "hey Joe, can you copy the updated executable onto each of our production servers?"

The prologue has been written; the stage is set. What will happen next? The SEC finding continues:

The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server.

I can barely imagine how challenging this must have been to diagnose under the best of conditions, but in real time, in production, with orders flowing and executives screaming, this must have been extremely hard to figure out. 88% of the work was being done correctly, with just the one computer acting in a rogue fashion.

It's interesting that, in the hours leading up to the go-live moment, there were hints and signals that something was wrong:

an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open.

But they didn't realize the implication of these special error messages until much later, unfortunately.

You might wonder why other fail-safes didn't kick in and abort the rogue runaway program. This, too, it turns out, was a bug, although in this case it was more of a policy bug than a programming bug.

Knight had an account—designated the 33 Account—that temporarily held multiple types of positions, including positions resulting from executions that Knight received back from the markets that its systems could not match to the unfilled quantity of a parent order. Knight assigned a $2 million gross position limit to the 33 Account, but it did not link this account to any automated controls concerning Knight’s overall financial exposure.

On the morning of August 1, the 33 Account began accumulating an unusually large position resulting from the millions of executions of the child orders that SMARS was sending to the market. Because Knight did not link the 33 Account to pre-set, firm-wide capital thresholds that would prevent the entry of orders, on an automated basis, that exceeded those thresholds, SMARS continued to send millions of child orders to the market despite the fact that the parent orders already had been completely filled.

When something goes wrong during a new deployment, a programmer's first instinct is to doubt the newest code. No matter what the bug is, almost always the first thing you do is look at the change you most recently made, since odds are that is what caused the problem. So this reaction was predictable, even if sadly it was exactly the worst thing to do:

In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.

The SEC document contains a long series of suggestions about techniques that might have prevented these problems. To my mind, many of these ring quite stale to the ear:

a written procedure requiring a simple double-check of the deployment of the RLP code could have identified that a server had been missed and averted the events of August 1.
This seems mad to me. The solution to a human mistake by a manual operator who was redundantly deploying new code to 8 production servers is not, Not, NOT to institute a "written procedure requiring a simple double-check." That is a step backwards: humans make mistakes, but the answer is not to add additional humans to the process.

Rather, the solution is that the entire deployment process should be automated, with automated deployment and automated acceptance tests.

Other observations that the SEC make are more germane, in my opinion:

in 2003, Knight elected to leave the Power Peg code on SMARS’s production servers, and, in 2005, accessed this code to use the cumulative quantity functionality in another application without taking measures to safeguard against malfunctions or inadvertent activation.

It is, in general, a Very Good Idea to reuse code. The code you don't have to write contains the fewest bugs. So I can see why the programmers at Knight wanted to keep their existing code in the system, and tried to reuse it.

The flaw, in my view, is that this code wasn't adequately tested. They used existing code, but didn't have automated regression tests and automated acceptance tests to verify that the code was behaving the way they expected.

This is, frankly, hard. Testing is hard. It is hard to write tests, it is even harder to write good tests, and then you have to take the time to run the tests, and you have to review the results of the tests.

In my own development, it's not at all uncommon for the writing of the automated tests to take as long as the development of the original code. In fact, often it takes longer to write the tests than it does to write the code.

Moreover, the tests have to be maintained, just like the code does. And that's slow and expensive, too.

So I can understand why the programmers at Knight succumbed, apparently, to the temptation to skip the tests and deploy the code. Sadly, for them, that was a tragic mistake, and Knight Capital is now part of Getco Group.

Automated trading, like many of the most sophisticated uses of computers today, is not without its risks. This is not easy stuff. The rewards are high, but the risks are high, too. It is useful to try to learn from those mistakes, sad though it may be for those who suffered from them, and so I'm glad that the SEC took the time to share their findings with the world.

No comments:

Post a Comment