Wednesday, November 27, 2013

The Long Overdue Library Book: a very short review

I've been enjoying reading The Long Overdue Library Book.

This is that book that deserves the cliche: "a labor of love". Written by two professional librarians, reflecting on their decades of experience, drawing on colleagues near and far, the book contains 50 short essays about every aspect of libraries.

My favorite story, for whatever reason, was Number 27, "Walnut Man", with its delightful ending:

"Thank you, all of you," he said, "for your gracious hospitality. We have all appreciated it very much."

My favorite excerpt from the overall book, though, is in Number 50, "Reflections", with its wonderful description of what a library is:

A safe place for the strange, a welcoming refuge warm in winter and cool in summer, the library accepts everybody who wants to come in. Unlike schools, libraries let you come and go as you want. The staff will work hard to answer your questions, and the best of them will help you tell them what you really want to know. They will protect your right to read what you wish, as long as you respect the ultimate Library Law: Return Your Books On Time.

May they survive. We need our libraries more than ever now, to protect the unpopular theories and the forgotten poets, to introduce our children to Mr. Toad and Narnia, and to remind our politicians that there is a need for civility that is as basic as sunlight, as necessary to the human spirit as music or truth or love.

I'm not sure if there will ever be a second edition; I'm sure this book was a monumental effort to produce as it was.

But, should there be, let me offer a (possibly apocryphal) Library Story of my own:

Nearly 75 years ago, the University of Chicago was a very different place. Back then, the school was an athletics powerhouse. Known as "the Monsters of the Midway", they were a founding member of the Big 10 Conference (then known as the Western Conference), and a University of Chicago athlete was the first recipient of the now-famous Heisman Trophy. With their great coach, Amos Alonzo Stagg, the University won trophy after trophy.

During the Second World War, the University underwent a monumental shift in focus, perhaps the greatest re-alignment that any human organization has ever undergone. Led by university president Robert Maynard Hutchins, the school undertook a complete re-dedication of purpose toward the life of the mind.

Against vast protest and discord, Hutchins dismantled the athletics program, withdrew the school from the Big Ten Conference, and re-focused the institution on the goals of knowledge and research, creating the world's foremost intellectual institution, a position it still holds today.

Dramatically, during the latter days of World War II, Hutchins gave the command to tear down the football stadium (beside which Enrico Fermi had just demonstrated the first self-sustaining controlled nuclear chain reaction) and commanded that a library be built in its place.

That library, the Joseph Regenstein Library, remains one of the greatest libraries on the planet, now holding an astonishing 11 million volumes.

At the time, though, Hutchins's decision was met with more than dismay, it provoked controversy, discord, and outright defiance. The alumni, justifiably proud of their heritage, struggled to cope with the transition.

In this, Hutchins was steadfast and sure, and, when questioned about the appropriateness of destroying one of the world's top athletics programs to build (gasp!) a library, responded with one of the greatest sentiments ever delivered by a man of letters:

Whenever I get the urge to exercise, I sit down and read a book until it passes.

There you go.

Stuff I'm reading, Thanksgiving edition

Gearing up for all that time I'm gonna spend on the couch watching football, I gotta find something good to read...

  • The Mature Optimization Handbook
    Knuth’s famous quote about premature optimization was never meant to be a stick to beat people over the head with. It’s a witty remark he tossed off in the middle of a keen observation about leverage, which itself is embedded in a nuanced, evenhanded passage about, of all things, using gotos for fast and readable code. The final irony is that the whole paper was an earnest attempt to caution against taking Edsger Dijkstra’s infamous remark about gotos too seriously. It’s a wonder we risk saying anything at all about this stuff.
  • UDT: UDP-based Data Transfer
    UDT is a reliable UDP based application level data transport protocol for distributed data intensive applications over wide area high-speed networks. UDT uses UDP to transfer bulk data with its own reliability control and congestion control mechanisms. The new protocol can transfer data at a much higher speed than TCP does. UDT is also a highly configurable framework that can accommodate various congestion control algorithms.
  • Solar at Scale: How Big is a Solar Array of 9MW Average Output?
    The real challenge for most people is in trying to understand the practicality of solar to power datacenters is to get a reasonable feel for how big the land requirements actually would be. They sound big but data centers are big and everything associated with them is big. Large numbers aren’t remarkable. One approach to calibrating the “how big is it?” question is to go with a ratio. Each square foot of data center would require approximately 362 square feet of solar array, is one way to get calibration of the true size requirements.
  • DEFLATE performance improvements
    This patch series introduces a number of deflate performance improvements. These improvements include two new deflate strategies, quick and medium, as well as various improvements such as a faster hash function, PCLMULQDQ-optimized CRC folding, and SSE2 hash shifting.
  • How long do disk drives last?
    The chart below shows the failure rate of drives in each quarter of their life. For the first 18 months, the failure rate hovers around 5%, then it drops for a while, and then goes up substantially at about the 3-year mark. We are not seeing that much “infant mortality”, but it does look like 3 years is the point where drives start wearing out.
  • Farming hard drives: 2 years and $1M later
    In the last 30 years the cost for a gigabyte of storage has decreased from over $1 million in 1981 to less than $0.05 in 2011. This is evidenced by the work of Matthew Komorowski. In addition, the cost per gigabyte also declined in an amazingly predictable fashion over that time.

    Beginning in October 2011 those 30-years of history went out the window.

  • How to be a Programmer: A Short, Comprehensive, and Personal Summary
    To be a good programmer is difficult and noble. The hardest part of making real a collective vision of a software project is dealing with one's coworkers and customers. Writing computer programs is important and takes great intelligence and skill. But it is really child's play compared to everything else that a good programmer must do to make a software system that succeeds for both the customer and myriad colleagues for whom she is partially responsible. In this essay I attempt to summarize as concisely as possible those things that I wish someone had explained to me when I was twenty-one.
  • What It's Like to Fail
    During the nearly 18 months I spent homeless off and on, and during the ensuing years, I learned that I am more resourceful than I ever imagined, less respectable than I ever figured, and, ultimately, braver and more resilient than I ever dreamed. An important tool in my return to life has been Craigslist. It was through Craigslist that I found odd jobs -- gigs, they often are called -- doing everything from ghost-writing a memoir for a retired Caltech professor who had aphasia to web content writing jobs to actual real jobs with actual real startups.

Tuesday, November 26, 2013

Time to learn a new game

Uhm, yes, but learning this one may be a bit more complex than I had anticipated...

  • A Newbie's First Steps into Europa Universalis IV
    the first couple of hours were spent not so much playing the game as opening menus, hovering the cursor over icons and reading text, trying to figure out what the heck I was supposed to do.

    ...

    Toward the end of the second day, things in the game got pretty crazy as countries went bankrupt, monarchs were excommunicated left and right and people without the resources to do it started declaring war just to see what would happen. My army was humiliated by the stave-wielding natives of Sierra Leon, the market was dominated by fish, England fell, dogs and cats were living together—it was Armageddon.

  • Europa Universalis IV and the Border Between Complex and Complicated
    Europa Universalis games tend to be a little, shall we say, complex. It's a series that speaks in terms like cassus belli and papal curia, featuring a map crammed full of long-forgotten nations where every last political maneuver is an opportunity to broker a deal that might someday come back to haunt you.

    ...

    "We want complex, but not complicated," says Johansson. "A complex feature has a lot of factors that influence it, and you will discover that when you thought you mastered it, there is a sudden a shift in dynamics that forces you to reevaluate your strategy. Complex features are what make games fun in the long run."

  • Europe Universalis IV Post Spanish Tutorial Map - Episode 1
    Oh, Europa Universalis IV, you saucy minx you. This is the most confusing game I have played recently, but I still have fun with it.
  • Re: Europa Universalis 4
    I'm pretty much King of Europe at this point thanks to abusing personal unions and the only other large Christian realms are Poland and Lithuania, who I've got royal marriages with so I can force personal unions on them too if they ever lack an heir. I'm trying to unify the Holy Roman Empire as I've been Emperor for about 60 years now and I just passed the fifth reform (the one that disables internal HRE wars) and I've kind of hit a brick wall with authority, since there are no heretics to convert and no one will declare war on HRE members.
  • Beginner's guide
    There are no specific victory conditions in Europa Universalis IV, Although there is a score visible throughout and at the end of the game at the top right corner of the interface. The player is free to take history in whatever direction they desire. They may take a small nation with a single province and turn it into a powerhouse to rule the world, take control of a historically powerful nation and cause it to crumble and anything in between.
  • Top 5 Tips to getting started in Europa Universalis IV
    Don't Play as Scotland.

    In fact, don’t even be friends with Scotland. At least not until you are familiar with the game and want a challenge. Nobody likes Scotland at this stage of history, and you are only a couple hundred years from the Glorious Revolution. So unless you think you can do better than the Jacobites, best to avoid Scotland for now. Every start I have made so far has seen Scotland either nearly or completely wiped off the map at an early stage by England.

Waterfall vs Agile

Clay Shirky, who is not a software developer but who is both a very smart guy and a very good writer, has written a quite-worth-reading essay about the healthcare.gov development process: Healthcare.gov and the Gulf Between Planning and Reality.

I'm not sure how plugged-in Shirky was to the actual healthcare.gov development effort, so his specific comments on that endeavor are perhaps inaccurate, but he has some fascinating observations about the overall software development process.

Shirky starts by describing the challenges that arise when senior management have to oversee a project whose technology they don't understand, and draws an analogy to the historical changes that occurred when technology changed the media industry:

In the early days of print, you had to understand the tech to run the organization. (Ben Franklin, the man who made America a media hothouse, called himself Printer.) But in the 19th century, the printing press became domesticated. Printers were no longer senior figures — they became blue-collar workers. And the executive suite no longer interacted with them much, except during contract negotiations.

It's certainly a problem when technology executives don't understand the technology in the projects they oversee. However, Shirky has another point to make, which is about the choice of development processes that can be used in a software development project.

The preferred method for implementing large technology projects in Washington is to write the plans up front, break them into increasingly detailed specifications, then build what the specifications call for. It’s often called the waterfall method, because on a timeline the project cascades from planning, at the top left of the chart, down to implementation, on the bottom right.

As Shirky observes, in a wonderfully-pithy sound bite:

By putting the most serious planning at the beginning, with subsequent work derived from the plan, the waterfall method amounts to a pledge by all parties not to learn anything while doing the actual work. Instead, waterfall insists that the participants will understand best how things should work before accumulating any real-world experience, and that planners will always know more than workers.

This is just a brilliant point, so true, and so well stated. The great breakthrough of agile techniques is to realize that each step you take helps you comprehend what the next step should be, so allowing feedback and change into the overall cycle is critical.

Shirky then spends the remainder of his wonderful essay discussing policy-related matters such as the federal government's procurement policies, the implication of civil service bureacracy, etc., which are all well and good, but not things I really feel I have an informed opinion about.

Where I wish to slightly object to Shirky's formulation, though, is in the black-and-white way that he portrays the role of planning in a software development project:

the tradeoff is likely to mean sacrificing quality by default. That just happened to this administration’s signature policy goal. It will happen again, as long politicians can be allowed to imagine that if you just plan hard enough, you can ignore reality. It will happen again, as long as department heads imagine that complex technology can be procured like pencils. It will happen again as long as management regards listening to the people who understand the technology as a distasteful act.

This is, I think, a common mis-statement of the so-called "agile" approach to software development: Agile development processes do NOT eliminate planning! Shirky worsens the problem, in my opinion, by setting up a dichotomy, starting with the title of his essay and throughout its content, between "planning" and "reality".

To someone not deeply immersed in the world of software development process, Shirky's essay makes it sound like:

  • Waterfall processes involve complete up-front planning
  • That typically fails with software projects, because we're trying to do something new that's never been done before, and hence cannot be fully planned out ahead of time
  • Therefore we should replace all that futile planning with lots of testing ("reality")
It's that last step where I object.

Agile approaches, properly executed, solve this up-front planning problem by decomposing the overall project into smaller and smaller and smaller sub-projects, and decomposing the overall schedule into smaller and smaller and smaller incremental milestones. HOWEVER, we also decompose the planning into smaller and smaller and smaller plans (in one common formulation, captured on individual 3x5 index cards on a team bulletin board or wall), so that each little sub-project and each incremental milestone is still planned and described before it is executed.

That is, we're not just winging it.

Rather, we're endeavoring to make the units of work small enough so that:

  • Everyone on the team can understand the task being undertaken, and the result we expect it to have.
  • Regularly and frequently, everyone on the team can reflect on the work done so far, and incorporate lessons learned into the planning for the next steps
Shirky does a good job of conveying the value of the latter point, but I think he fails to understand the importance of the former point.

You can't settle for a situation in which management doesn't understand the work you're doing. Shirky is clearly aware of this, but perhaps he's never been close enough to a project run using agile approaches, to see the techniques they use to ensure that all members of the team are able to understand the work being done. (Or perhaps he just despairs of the possibility of changing the behavior of politicians and bureaucrats.)

Regardless, don't just listen to my rambling; go read Shirky's essay and keep it in mind the next time you're involved in a large software development project.

Sunday, November 24, 2013

Anatomy of a bug fix

Some time recently, I happened to be re-reading the details about a really great bug fix, and I thought to myself: people don't write about bug fixes very much, and this one really deserves to be written about, so I'll do that.

It's a great bug fix, and it's great that it was found, carefully analyzed, and resolved.

It's even greater that this all happened on an Open Source project, so it's no problem to share the details and discuss it, since so often it is the case that Great Bug Fixes happen on closed source projects, where the details never see the light of day (except inside the relevant organization).

Anyway, with no further ado, let's have a look at DERBY-5358: SYSCS_COMPRESS_TABLE failed with conglomerate not found exception.

The first thing about a great bug fix is that you have to figure out that there is a bug. In this particular case, one of the Derby developers, Knut Anders Hatlen, came across the bug somewhat by accident:

When running the D4275.java repro attached to DERBY-4275 (with the patch invalidate-during-invalidation.diff as well as the fix for DERBY-5161 to prevent the select thread from failing) in four parallel processes on the same machine, one of the processes failed with the following stack trace: The conglomerate (4,294,967,295) requested does not exist. ...

Look how precise and compact that is, describing exactly what steps the reporter did to provoke the bug, the relevant environmental conditions ("four parallel processes on the same machine"), and what the symptoms of the bug were. This, as you will see, was no simple bug, so being able to crisply describe it was a crucial first step.

When he reported the bug, Knut Anders made a guess about the cause:

The conglomerate number 4,294,967,295 looks suspicious, as it's equal to 2^32-1. Perhaps it's hitting some internal limit on the number of conglomerates?

As it turned out, that initial guess was wrong, but it doesn't hurt to make such a guess. You have to have some sort of theory, and the only real clue was the conglomerate number in the error message.

Some time passed, and other Derby developers offered the occasional theories and suggestions ("Again with the in-memory back end?", " The message has the feel of a temporary timing issue.")

10 months later, Knut Anders returned to the issue, this time with a brilliant, insightful observation:

I instrumented this class and found that it never set heapConglomNumber to 4,294,967,295, but the method still returned that value some times.

Stop and think about that statement for a while.

This is the sort of thing that can happen with software: something that can't possibly happen actually happens, and when that occurs, the experienced engineer sits up straight and says "Whoa! What? Say that again? What does that mean?"

In this case, if the program never sets the variable to that value, but that value is in fact returned, then something "outside the language" must be happening. That is, the program must be breaking one of the rules of the Java language, in which case impossible things become possible.

Recalling that in the original description we had "four parallel processes on the same machine", Knut Anders took the brilliant inductive leap to realize that concurrent access to unprotected shared data was at work here:

The problem is that heapConglomNumber is a long, and the Java spec doesn't guarantee that reads/writes of long values are atomic.

He then proceeded to outline in detail how that violation of the language rules produced this behavior:

T2 reads heapConglomNumber in order to check if it's already cached. However, since T1's write was not atomic, it only sees half of it.

...

If T2 happens to see only the most significant half of the conglomerate number written by T1, that half will probably be all zeros (because it's not very likely that more than 4 billion conglomerates have been created). The bits in the least significant half will in that case be all ones (because the initial value is -1, which is all ones in two's complement). The returned value will therefore be 0x00000000ffffffff == 4,294,967,295, as seen in the error in the bug description.

I've also seen variants where the returned number is a negative one. That happens if T2 instead sees the least significant half of the correct column number, and the most significant half of the initial value -1. For example, if the conglomerate number is 344624, the error message will say: The conglomerate (-4 294 622 672) requested does not exist.

Beautiful, just beautiful: given just two clues:

  1. The error message consistently included the value 4,294,967,295 or an enormous negative number such as -4,294,622,672
  2. "four parallel processes on the same machine"
Knut Anders made the intuitive leap to understanding that we had a non-atomic read of a value undergoing update.

Now, it helps if you've seen behavior like that before, but even if you have, that sort of sudden, inspired vault from symptoms to cause is one of the most gratifying and enjoyable parts of computer programming; I often term this the "Ah hah! moment", because you will just be baffled by something and then, in the blink of an eye, you suddenly understand it perfectly.

My colleagues know that I often at this point clap my hands, shout out "Yes!", or otherwise let out a little burst of joy; it's just that pleasurable when it occurs.

Of course, I don't know if Knut Anders does that; he's in Norway, and I've only had the pleasure of meeting him once.

The next few days saw lots of activity on this issue. Mike Matrigali, one of the original authors of the Derby storage system, noted that this low-level error might explain a lot of similar symptoms:

I think in past I have seen user unreproducible issues with global temp tables and errors with strange large negative conglomerate numbers, so likley related to this issue being tracked here.
and encouraged the exploration of alternate scenarios that could give rise to this code path:
it would be good to update this issue with what user actions can cause this problem. Is it just SYSCS_COMPRESS_TABLE, or is it any operation that can result in conglomerate numbers changing?

Knut Anders agreed, saying

I think it could happen any time two threads concurrently ask for the conglomerate id when no other thread has done the same before.

I often refer to this as "widening the bug": when you realize that you've found a low-level error, it is important to stop and think about whether this particular code path is special, or whether there are other ways that the program might arrive at the same situation and encounter the same or similar problems.

Sometimes, that line of reasoning leads to a realization that you don't perfectly understand the bug: "wait; if that were true, then this other operation would also fail, but we know that works, so..."

However, in this case, the process of widening the bug held up very well: the bug could affect many code paths, all that was required was to have just the right pattern of concurrent access.

And soon, given that insight, Knut Anders was back with a simpler and more accurate reproduction script:

Attaching an alternative repro (MultiThreadedReadAfterDDL.java). In my environment (Solaris 11, Java SE 7u4) it typically fails after one to two minutes

Simplifying, clarifying, and refining the reproduction script is one of the most important steps in the bug fixing process. Sometimes, you are given a minimal reproduction script in the initial bug report, but for any real, complex, epic bug report, that almost never happens. Instead, you get a complicated, hard-to-reproduce problem description.

But once you truly and thoroughly understand the bug, it is so important to prove that to yourself by writing a tight, clear, minimal reproduction script.

  • First, it proves that you really do understand the code path(s) that provoke the bug
  • Secondly, it lets others reproduce the bug, which lets them help you
  • And perhaps most importantly, it allows you to have confidence that you've actually fixed the bug: run the script, see it fail, apply the fix, run the script, see it now succeed

Given all this great work, the next month is rather anticlimactic:

Attaching an updated patch (volatile-v2.diff), which only touches the TableDescriptor class. The new patch adds a javadoc comment that explains why the field is declared volatile. It also removes an unused variable from the getHeapConglomerateId() method.

...

Committed revision 1354015.

For many bugs, that's the end of it: careful problem tracking, persistence, a brilliant insight, brought to a close by solid, thorough engineering.

But with this bug, however, the realization that it's not just a good fix, but a Great Fix, is when, more than a year later, we continue to realize that this great fix is paying dividends. And, thus, witness DERBY-5532: Failure in XplainStatisticsTest.testSimpleQueryMultiWithInvalidation: ERROR XSAI2: The conglomerate (-4,294,965,312) requested does not exist..

This bug report was of particular interest to me, because it was in a test that I wrote, which exercised code that I was particularly familiar with, so it was frustrating that I had been unable to identify how my code could produce this failure.

Does that error message look familiar? Well, Mike and Knut Anders thought so, too, and when 18 months had passed since the original fix went in, we decided to chalk up another fix for DERBY-5358.

I often say that one of the differences between a good software engineer and a great one is in how they handle bug fixes. A good engineer will fix a bug, but they won't go that extra mile:

  • They won't narrow the reproduction script to the minimal case
  • They won't invest the time to clearly and crisply state the code flaw
  • They won't widen the bug, looking for other symptoms that the bug might have caused, and other code paths that might arrive at the problematic code
  • They won't search the problem database, looking for bug reports with different symptoms, but the same underlying cause.

Frankly, you could spend your entire career in software and not ever become as good an engineer as Knut Anders or Mike. But if you decide to become a software engineer, strive to be not just a good engineer, but a great one.

And to be a great engineer, strive, as much as you can, to make every bug fix a Great Bug Fix.

Friday, November 22, 2013

Perforce 2013.3 server is now released!

Here's a nice way to end a busy week: the 2013.3 release of the Perforce server is now generally available for production use!

This release was all about performance: we did extensive work on memory usage, concurrency, and other performance features.

If you run a high-end Perforce server, you probably already know about the features that are available in this release, but if you've been a bit out of touch, here are the release notes; check it out!

World Chess Championship is over: game 10 is a draw

It's official: Magnus Carlsen is the new World Champion!.

The tenth game was a fight to the finish, lasting 65 moves, advancing to the second time control, and containing lots of sharp, beautiful play.

But there can be no doubt about the chess champion of the world: it is Magnus Carlsen.

(Unless, of course, you go have a look at that other chess tournament that's underway right now, where the contestants are substantially stronger than Carlsen ...)

Thursday, November 21, 2013

World Chess Championship game 9: Carlsen wins again!

In the 2013 World Chess Championship, today was a dramatic day. With his back against the wall, and with the white pieces, Anand went for it all

“There was not much of choice,” said Anand after having to play an aggressive line today. “I needed to change the course of the match drastically,” said Anand.

The game is wildly exciting; you should play through it.

The score is now: Carlsen 6 - Anand 3

They will play again tomorrow, though Anand has only the most theoretical of chances now...

Wednesday, November 20, 2013

The 2014 World Cup field is set!

Well, as of this moment it's almost set, since the final match (Uruguay - Jordan) has not yet been played.

But let's be bold, and assume that Jordan won't manage to overcome a 5 goal deficit on the road against Uruguay.

So we have determined the 32 qualifying teams.

Regionally, it breaks down as follows:

  • Europe: 13 teams
  • South America: 6 teams (assuming Uruguay qualifies)
  • Africa: 5 teams
  • North and Central America: 4 teams
  • Asia: 4 teams
Overall, that seems like a pretty good distribution, to me.

We now have 2 weeks until the group draws, on December 6th.

The final draw will be complicated. The top 8 teams will be based on the October Ranking, which means there will almost certainly be 4 teams from Europe and 4 teams from South America. As Wikipedia says

The other pots will be based on geographic and sports criteria.
These wacky criteria tend to be things like: Brazil and Argentina will be in groups that, assuming they both advance, won't meet until the semifinal. And Spain and Germany will be similarly distributed. And we'll distribute the teams from the various continents "evenly" across the groups, so that no group will contain more than 2 European teams, or more than 1 team from any other region.

Since it looks like the Netherlands will not be in the top 8, my pronouncement is: whichever group the Dutch are in is "the group of death". Of course, there are a few other extremely strong teams that are just outside the top 8, including: England, Chile, and Portugal.

Regardless, it looks like it will be a very strong lineup, overall, with many great teams. The top team which didn't qualify is the Ukraine, which shockingly lost to France yesterday and are out.

And, of course, as ESPN point out, there are some well known other teams that missed the boat: Sweden, Serbia, Turkey among others.

And, probably, many football fans will have only one question on their minds: will Leo Messi be healthy by next spring?

But I think it will be a fine selection of teams that make their way to Brazil next summer.

Start planning your viewing parties now! Where will you be on July 13, 2014?

Tuesday, November 19, 2013

This post brought to you by Windows 8.1 ...

... though you probably couldn't tell a lot of difference from those previous Windows 8 posts.

As 3.7 gigabyte downloads go, this one seems to have been trouble free.

World Chess Championship game 8: draw

Game 8 of the 2013 World Chess Championship ended with a draw. Carlsen had the white pieces, and used only 20 minutes of time for the entire game.

The match is two thirds complete, and the score is: Carlsen 5.0 - Anand 3.0

Time is running out for Anand; can he mount a challenge in the remaining games?

Monday, November 18, 2013

Whirlwind touring

A few short observations from a whirlwind trip down south:

  • The Santa Ynez Inn is beautiful. What a nice place to unwind.
  • California is dry, dry, dry. Parched land was everywhere, dust blowing, cows being fed on hay as there is no open range grass left anywhere. Please, somebody, bring us rain, and lots of it!
  • Central coast wineries have really hit the big time. There was lots and lots of delicious wine, from lots of wineries I'd never heard of, using all sorts of varieties of grapes I was totally unfamiliar with.
  • Central coast wineries know they are making lots of delicious wine. At most of the wineries we visited, the low-end bottles were going for $25 to $30, and we visited multiple wineries where the regular tasting menu included wines costing $70 or more
  • Fess Parker winery hasn't made sparkling wine in over a decade. Shows how long it's been since we were down that way.
  • On the other hand, the Flying Goat sparkling wines were quite nice; we particularly enjoyed their Cremant
  • The biggest foodie hit of the weekend was the San Marcos Farms honey, possibly available direct from San Marcos Farms but we found it at the Olive Barn in Los Olivos. In particular, the "Avocado Honey" is quite surprising. It's not avocado flavored (yuk), but rather is produced by bees who make their homes in an avocado orchard. Super!
  • Lompoc, Solvang, Santa Ynez, Buellton, Los Olivos: years may have passed, but these small towns barely seem to have changed at all.
  • My niece and nephew, however, are shooting up like sprouts! How quickly children grow up...

World Chess Championship game 7: draw

Game 7 of the 2013 World Chess Championship ended in a draw after 32 moves.

The score is now: Carlsen 4.5 - Anand 2.5, with 5 games remaining.

Sunday, November 17, 2013

World Chess Championship game 6: Carlsen wins again!

Game 6 of the 2013 World Chess Championship was another decisive result, as Magnus Carlsen won again, this time with the black pieces.

In game 6, Anand had white, and opened with 1. e4, and the game proceeded into classic lines of the Ruy Lopez defense.

The middle game was complex, but on move 30, as all the major pieces (though not the queens) were coming off the board, Anand was left with a weakened double e pawn.

By move 40, Carlsen had won that weakened pawn, and again we had a rook + pawn endgame with Carlsen up by a pawn.

By move 62, after extensive maneuvering, the tables had turned, and Anand had the pawn advantage, but Carlsen's pawn was advanced and his king was perfectly positioned.

By move 67, it was all over.

The score is now: Carlsen 4.0 - Anand 2.0.

The match is halfway over, and Carlsen has a tremendous advantage.

Will Anand come back? Today was a rest day, so tomorrow is game 7.

World Chess Championship game 5: Carlsen wins!

The first decisive result of the 2013 World Chess Championship goes to Magnus Carlsen, who prevails with the white pieces in game 5.

By move 15, many of the pieces, including the queens, were off the board, and each side had an isolated center pawn. Anand had the two bishops, but Carlsen's pieces were active. The play through move 40 was tremendously complex, with action across the board, but Carlsen succeeded in winning a pawn and then transitioning to a rook+pawn endgame by move 50.

At move 58 the world champion resigned.

And so the score is now: Carlsen 3.0 - Anand 2.0

Thursday, November 14, 2013

Aghast at the bullpen

The other day, I was reading a perfectly nice article on the Wired website: This Company Believes You Should Never Hack Alone.

The article discusses the hot new startup Pivotal. Pivotal is indeed very trendy, whether it's because of Paul Maritz, or their big-name backers, or their hot, hot market space, or for some other reason.

At any rate, it's a perfectly nice article, with lots of interesting details about how they're trying to establish the company culture and built the team.

But, frankly, I never really made it to the article.

I simply couldn't get past the headline, and the picture.

Stop. Go look at that picture: "Inside Pivotal’s San Francisco offices, where software coders rarely work alone."

All of a sudden, the bullpen is trendy.

I'm not quite sure when this happened, but it really took off when Facebook lost their mind and converted the old Sun Microsystems office space in Menlo Park into the world's worst offices, ever.

It will be a large, one room building that somewhat resembles a warehouse. Just like we do now, everyone will sit out in the open with desks that can be quickly shuffled around as teams form and break apart around projects.

Let's belabor that point a little bit. As Zuck said

The idea is to make the perfect engineering space: one giant room that fits thousands of people, all close enough to collaborate together. It will be the largest open floor plan in the world, ...

I guess the time has come to admit that I Just Don't Fit In.

When I see the picture, the first words that come to my mind are not "the perfect engineering space: one giant room".

I've actually worked, briefly, in bullpen environments. They do have a few advantages:

  • They help people who haven't worked together very much get to know each other somewhat better
  • They make it possible for the manager to look out over the open space and visually itemize who is present, and where they are located
  • They save money on things like walls and doors and windows.
But they have, oh, so many disadvantages.

I'm not quite sure where this open floor plan mania arose from. Some say it comes from the famous stock exchange trading floors. As if a scene like this makes you think that people are being productive in that environment.

Others say that the idea came from the famous newspaper "city desk" rooms, most famously the Washington Post newsroom where Bob Woodward and Carl Bernstein took down the president, an idea you can see parroted in New York City mayor Bloomberg's horrific city government bullpen, which is gathering followers throughout government

At the Wilson Building this week, workers will begin knocking down walls on the third floor to create a permanent bullpen for Fenty, who intends to shun the isolated sixth-floor mayor's suite after he is sworn in Jan. 2. By sitting among his deputies, keeping close tabs on them and engaging them in everyday decisions, Fenty said he will foster increased accountability and a spirit of openness that he believes has been missing from city government.

Maybe it works for city governments.

Maybe it works for stock exchange traders.

But I'll tell you something: it does not work for software development.

The reason why is, I think, best expressed by the great Rands, who wrote about the zone:

Let’s talk about the Zone once more. You’re either sitting down with your computer to futz around with something or you’re attempting to get in the Zone. This is that magical place where you’ve managed to fit the entire context of your current project in your head. With all this content in there, you can perform superhuman acts of productivity and creativity because you have the complete problem space at your mental disposal.

If you're more of a visual person, this is a brilliant web comic which makes the same point.

Look, Pivotal may be a quite nice place. I think they have some brilliant people there, and I think they are working hard to do what they think is right for them.

But I could never work there. I need to think. I need to concentrate, to focus, to enter the zone.

I guess it's good that I know this about myself, so that when I read about Pivotal, or read similar articles about companies like Square, where

Dorsey and Henderson applied that same sense of vision to their new offices, which take up parts of four floors in a dowdy former Bank of America data center. The theme is one of a miniature city, with conference rooms named for famous streets and a grand "boulevard" the length of two football fields.
I can simply say to myself: well, that's another company I'd never want to be part of.

I don't want an office "the length of two football fields".

I think that pair programming is interesting. In fact, I regularly practice it. But I call it "having a second pair of eyes." Here's how it works: my co-worker will stick his head in my cube and say:

Bryan, I'm a bit stuck on something. I've been through this code a million times, and I just can't see what's wrong. Can you take a look?
And I'll nod, and hit "save" in my editor, and step around the corner to his cube, where I'll pull up a chair and watch over his shoulder as he walks through the code.

After a minute or two, I'll say something like:

Oh, I see. You're expecting the variable "resetRequired" to be true, but in this particular case we're arriving here via a different code path, and it's going to be false.
He'll nod, and say
Of course! Nothing like another pair of eyes!
And I'll go back to my desk, and we'll both try to get back into The Zone

I take some hope from reading that maybe it's not just me

About 70 percent of U.S. employees now work in open offices, according to the International Management Facility Association. But the collaboration-friendly environment with minimal cubicle separations “proved ineffective if the ability to focus was not also considered,” according to a new study by the design firm Gensler. “When focus is compromised in pursuit of collaboration, neither works well.”
And I'll hope that some of these findings take hold:
People work less well when they move from a personal office to an open-plan layout, according to a longitudinal study carried out by Calgary University. Writing in the Journal of Environment and Behavior, Aoife Brennan, Jasdeep Chugh and Theresa Kline found that such workers reported more stress, less satisfaction with their environment and less productivity. Brennan et al went back to survey the participants six months after the move and found not only that they were still unhappy with their new office, but that their team relations had broken down even further.

Still, it's wicked trendy: More employers choosing 'open' offices

Experts say the use of open office design elements is now growing at a double-digit pace, heralding the death of the traditional corner office and the infamous high-walled cubicle

Sigh.

Not so long ago, I was lucky enough to work at a company which gave me a fantastic work environment: I had a (small) private office, which was quiet and calm. People could come by whenever they wanted to talk to me. If I needed to have a conversation, I could close the door to avoid bothering others. The noise of my co-workers was completely invisible to me. I was fantastically productive.

So I know such companies exist, clear-thinking places where they recognize that Programming Is Hard, and you need to have long, undisturbed periods of complete silence and deep thought; you need to get into The Zone.

I'll say a silent prayer that such companies survive, and that we aren't all forced into the bullpen.

Please.

Wednesday, November 13, 2013

World Chess Championship game 4: draw

Game 4 of the Anand-Carlsen match is now complete, and it was a draw.

The score now stands at: 2.0 - 2.0.

But this was no bloodless draw! Anand had white, the game was the Ruy Lopez Berlin, the position was wide open.

On move 18, Carlsen bravely grabbed the a2 pawn. Anand tried to trap the bishop, but could not. However, the time it took for Carlsen to extract the bishop allowed Anand to get his pieces very active, and for a while it was like Carlsen was playing an entire rook down.

If you look at the position around move 27, all of Carlsen's pieces are on the first or second rank, and Anand controls about 80% of the board, all for the price of that a2 pawn.

But Anand couldn't quite break through, and Carlsen was able to activate his pieces, and finally the pieces came off the board and Carlsen's one pawn advantage was not enough for the win.

The first week is now over; one third of the games have been played; all remains dead even.

Tuesday, November 12, 2013

World Chess Championship game 3: draw

The third game of the Anand-Carlsen match is complete, and again it is a draw.

This game was not a quick draw, though: it was played all the way through to the endgame, when both players had only a king and a bishop remaining.

The game was very exciting in the middle, with combinations across the board. Late in the middle of the game, Anand was ahead by a pawn, but Carlsen's pawns were connected, while Anand's extra pawn was disconnected and weaker, and Carlsen won the pawn back and then the game was over.

After three games, the score is: 1.5 - 1.5.

What will game 4 bring?

Monday, November 11, 2013

Dissecting the "once in a lifetime fish"

The folks at Southern Fried Science have a great post on their site this week: Fish out of water: the necropsy of the beached oarfish.

The easiest way to tell that this was an exciting scientific discovery was by the spectators in the necropsy lab. As fish dissections can be a little messy and smelly, people who are not actively working on a necropsy project normally avoid this lab, but on this particular morning the lab was standing room only.

Even if you didn't enjoy the dissection labs in your high school biology class, this article is worth your time, as the Oarfish discovery truly was a once in a lifetime opportunity:

When marine scientists talk about a “once in a lifetime fish,” we often mean a species that is so rarely seen that we feel lucky to have observed it, even after it has washed up on a beach somewhere. This month we in Southern California have been lucky enough to have one such “once in a lifetime fish” appear twice in a span of a week, as two oarfish washed ashore local beaches. The first, an 18-foot specimen was found on Catalina Island and the second, a 14-foot specimen (approximately 275 pounds), was found in Oceanside, CA.

The pictures of the teams of scientists and spectators lined up along the 20-foot lab table are great, and I really enjoyed the description of how each specialist from each different institution investigates a different set of details in a different sort of way.

Sunday, November 10, 2013

World Chess Championship game 2: draw

Game 2 of the Anand-Carlsen championship match is a draw, after 25 moves.

My son often plays the Caro-Kann against me, so I should know it, but I've never thought to play 6. h4. It's just an uncomfortable move for me to make.

The overall score is now 1-1.

Saturday, November 9, 2013

World Chess Championship game 1: draw

Game 1 of Anand-Carlsen was a draw.

Carlsen had white but could not find any advantage out of the opening, and the game ended with a draw after 16 moves.

Overall score:

Anand 0.5 - Carlsen 0.5

Friday, November 8, 2013

Friday afternoon reading

Various random stuff I'm reading; as usual, presented with little or no comment, except to note it seems like you might find it interesting too.

  • Toyota Acceleration Case
    NASA reported not running all the code in simulation due to a lack of tooling. Now, the new panel of experts appears to have actually managed to simulate the system, and found ways to make it crash in the interaction between multiple tasks. The fact that, as EETimes report, a certain task crashing can cause acceleration to continue without control, is pretty indicative of issues arising in integration rather than unit testing.
  • The Saddest Moment
    Watching a presentation on Byzantine fault tolerance is similar to watching a foreign film from a depressing nation that used to be controlled by the Soviets—the only difference is that computers and networks are constantly failing instead of young Kapruskin being unable to reunite with the girl he fell in love with while he was working in a coal mine beneath an orphanage that was atop a prison that was inside the abstract concept of World War II.
  • Why Does Windows Have Terrible Battery Life?
    The Windows light usage battery life situation has not improved at all since 2009. If anything the disparity between OS X and Windows light usage battery life has gotten worse.
  • Comments on the 2013 Gartner Magic Quadrant for Operational Database Management Systems
    There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
  • Drilling Network Stacks with packetdrill
    Testing and troubleshooting network protocols and stacks can be painstaking. To ease this process, our team built packetdrill, a tool that lets you write precise scripts to test entire network stacks, from the system call layer down to the NIC hardware. packetdrill scripts use a familiar syntax and run in seconds, making them easy to use during development, debugging, and regression testing, and for learning and investigation.
  • Distributed Systems Archaeology: Works Cited

    That's a very nice reading list, even if it is basically from the previous millenium. In particular, I had not seen the new textbook by Varela before.

  • I Failed a Twitter Interview
    I am upset that the interviewer didn't ask me the right questions to guide me towards the right train of thought. I don't know why Justin told me "this should work," when my solution in fact didn't . I know that this should have come up in the test cases he asked for, but since I missed the flaw when coming up with the algorithm, I didn't think of testing for it.
  • Chuck Moore's Creations; Programming the F18; The Beautiful Simplicity of colorForth
    He is endlessly willing to rethink things from a truly clean slate. It's astonishing how simple things become when you're willing to do that and design something bottom up with that perspective; the "bottom" being raw silicon. He said in his talk that the colorForth compiler is "a dozen or so lines of code." This is shocking to most people. It's because tokenization is done at edit-time and there is almost a one-to-one correspondence between the primitive words in the language and instructions on the chip. The compiler is left with not much to do other than maintain the dictionary and do very straightforward instruction and call/jump packing. This "brutal simplicity" is possible because every aspect has been rethought and carefully orchestrated to work perfectly together.
  • Dance Your Ph.D. Finalists Announced!
    This is the 6th year of the contest, which challenges scientists to explain their doctoral research through the medium of interpretive dance. The finalists were selected from 31 dance submissions by the winners from previous years of the contest. The production value has increased considerably from the live Ph.D. dance event that launched the contest in 2007. The goal is to do away with jargon -- indeed, to do away with spoken words altogether -- and use human bodies to convey the essence of scientific research.

Parbuckling, redux

Well, it's not the Costa Concordia, but closer to home we've had our own little parbuckling.

The local newspaper, the Alameda Sun, covers the action, all of which took place just 100 yards from my office: VIDEO: Old tug exhumed from watery grave

Thalhamer said the project may be the largest vessel removal effort to be undertaken on the West Coast since Pearl Harbor, at the start of World War II. The multi-agency team is expecting to remove 40 vessels settled along the length of the estuary, some of them visible to the rowers, sailors and Coasties who frequent the estuary, some of them obscured by water.

Apparently early reports about the tugboat's name and history were nothing more than urban legends, and some serious investigation must now be done to figure out what this wreck was and who abandoned it.

The construction area on the waterfront is amazing now, with a half dozen smaller abandoned boats scattered about, having been pulled from the estuary over the last week by the various cranes.

As the video shows, there are still at least three major vessel recoveries to go:

  • The tugboat "Respect"
  • Two large barges

The wrecks have to be extracted in the right order, since some are laying on top of others.

Keep up the great work, Cal Recycle!

Thursday, November 7, 2013

Scaring away the earthquakes

Somehow I never knew about the story at the time, but I was delighted to read about The Rescue Of The Troll: Bay Bridge's mysterious protector out of hiding

The troll, who has no name, was created and surreptitiously installed in 1989 on a quickly fabricated section of bridge deck that replaced the pieces that collapsed in the Loma Prieta earthquake. He remained out of sight - only bridge workers and boaters could see him on the north side of the span - and cast his magic to protect the bridge and its users. In early September, when the new eastern span opened, the troll was spirited away by ironworkers, who wanted to make sure he was free before demolition of the old span began.

I must have passed under that section of the bridge more than a dozen times over the years, but never saw the troll. Of course, he was small and I was far away and I would have needed a good pair of binoculars to see him.

In some respects, it's best not to see the troll, nor even to know about the troll, as some things are better off that way:

"There's a great deal of mystery surrounding any troll," Goodwin said, "but a special amount of mystery surrounding this troll."

Thank you troll, may you have a safe and happy retirement.

TCP ex Machina

I enjoyed this short paper by Keith Winstein and Hari Balakrishnan of MIT: TCP ex Machina: Computer-Generated Congestion Control .

This paper describes a new approach to end-to-end congestion control on a multi-user network. Rather than manually formulate each endpoint’s reaction to congestion signals, as in traditional protocols, we developed a program called Remy that generates congestion-control algorithms to run at the endpoints.

In this approach, the protocol designer specifies their prior knowledge or assumptions about the network and an objective that the algorithm will try to achieve, e.g., high throughput and low queueing delay. Remy then produces a distributed algorithm—the control rules for the independent endpoints—that tries to achieve this objective.

As long-time readers of my blog know, I consider the TCP Congestion Control algorithm to be one of the "Hilbert Questions" of Computer Science: deep, fundamental, of great practical importance yet still unsolved after decades of work.

Many fine approximate algorithms have been developed over the years, yet we still remain frustratingly far from any clearly "best" algorithm.

The congestion control problem is easy to state:

Each endpoint that has pending data must decide for itself at every instant: send a packet, or don’t send a packet.

If all nodes knew in advance the network topology and capacity, and the schedule of each node’s present and future offered load, such decisions could in principle be made perfectly, to achieve a desired allocation of throughput on shared links.

In practice, however, endpoints receive observations that only hint at this information. These include feedback from receivers con- cerning the timing of packets that arrived and detection of packets that didn’t, and sometimes signals, such as ECN marks, from within the network itself. Nodes then make sending decisions based on this partial information about the network.

In this recent work, the authors have developed an algorithm for generating congestion-control algorithms, to see if a powerful computer, searching many alternatives, can uncover a superior algorithm:

Using a few CPU-weeks of computation, Remy produced several computer-generated congestion-control algorithms, which we then evaluated on a variety of simulated network conditions of varying similarity to the prior assumptions supplied at design-time.

On networks whose parameters mostly obeyed the prior knowledge supplied at design range — such as the dumbbell network with the 15 Mbps link — Remy’s end-to-end algorithms outperformed all of the human-generated congestion-control algorithms, even algorithms that receive help from network infrastructure.

But there is a twist: the new algorithm is superior, but we don't understand how it works:

Although the RemyCCs appear to work well on networks whose parameters fall within or near the limits of what they were prepared for — even beating in-network schemes at their own game and even when the design range spans an order of magnitude variation in network parameters — we do not yet understand clearly why they work, other than the observation that they seem to optimize their intended objective well.

We have attempted to make algorithms ourselves that surpass the generated RemyCCs, without success. That suggests to us that Remy may have accomplished something substantive. But digging through the dozens of rules in a RemyCC and figuring out their purpose and function is a challenging job in reverse-engineering.

RemyCCs designed for broader classes of networks will likely be even more complex, compounding the problem.

I recently whiled away an entire afternoon fiddling with the several dozen knobs on a modern Linux implementation of Congestion Control.

After some false starts, I found a promising path, and was able to speed up one of my benchmarks by a factor of 26, from 3,272 seconds down to 124 seconds.

There is tremendous opportunity for a TCP Congestion Control algorithm that can work well, as the gains in performance and efficiency are immense.

But, for now, what we have are our complex approximations, and even the most powerful computers aren't (yet) able to design a superior algorithm.

At least, not one we can understand.

Wednesday, November 6, 2013

OpenStack news

The Register takes note of the annual OpenStack Summit and surveys the state of the OpenStack nation: Inside OpenStack: Gifted, troubled project that wants to clobber Amazon

With their typical snark, The Register tell it like it is:

Though many bill OpenStack as "the Linux of the cloud", the technology so far fails to meet the expectations of usability and compatibility that defines the Linux community, though it is improving rapidly.
Ouch! Now, that's faint praise indeed!

On a more technical level, it seems that the largest problems are in the networking arena:

Another issue that is endemic to both within-compute networking (Nova) and the standalone in-development networking module (Neutron), is that for single-host and flat networks, the IP allocation, IP routing, NAT, DHCP, and OpenStack metadata services are in a single chunk of code making them difficult to interface with, while in a multi-host format the services are distributed across hypervisors presenting a much larger attack surface.

At the OpenStack Summit website, it's revealing to browse through the talks, to get a sense for the current priorities of the community.

Many of the component technologies are maturing fast, but it's still early days for OpenStack. The degree of interest in the conference is promising, but it's awfully hard to see this sort of fractured coalition catching up to the giants of the cloud.

Still, this is the way Open Source software works: it's messy, it's chaotic, it can be hard to figure out what's happening, or where it's going. Yet nonetheless the process works, and the software improves.

Monday, November 4, 2013

Team Geek: A Very Short Review

This fall, I zipped right through the new book by Brian Fitzpatrick and Ben Collins-Sussman, Team Geek: A Software Developer's Guide to Working Well with Others.

The authors are well-known engineers, both in the Open Source community (for Subversion), and as Googlers, and certainly have the necessary qualifications to write about software engineering practices.

But Team Geek is not that book.

Team Geek is rather surprising: it is a manual of manners, a book of etiquette.

This is not as far-fetched as it might seem; programmers certainly have a reputation as arrogant, egotistical, prima donnas who can't be bothered to hold a civil conversation. (Of course, if this is really the way you see us programmers, may I suggest you take a bit of time to read Michael Lopp's superb The Nerd Handbook, to help you understand why your favorite programmer is behaving the way he/she does.)

As the authors explain in their "Mission Statement", Team Geek is all about the realization that, in almost every situation, the successful programmer has to work as part of a team:

The goal of this book is to help programmers become more effective and efficient at creating software by improving their ability to understand, communicate with, and collaborate with other people.

And just a few pages later they make the point even more explicitly:

The point we've been hammering is that in the realm of programming, lone craftsmen are extremely rare -- and even when they do exist, they don't perform superhuman achievements in a facuum; their world-changing accomplishment is almost always the result of a spark of inspiration followed by a heroic team effort.

Team Geek is a fine book, easy to read, entertainingly written, with lots of anecdotes and lessons learned from hard experience. As I read it, I said to myself, over and over again: yes, yes, that's exactly how it is, that's indeed how it goes.

If you're a professional programmer, if you're a person who has to work extensively with professional programmers, if you're considering becoming a professional programmer, or if you're just interested in what it's like to be around professional programmers all the time, you will probably find Team Geek to be interesting, practical, and useful.

It won't teach you any new algorithms, or programming languages, and you won't learn anything about Subversion, but you will learn a lot about all the stuff they don't cover in your Computer Science classes in college.

You may not have thought that the world needed a manual of manners for programmers, but it does, and this is a pretty good one.

Saturday, November 2, 2013

There are bugs, and then there are bugs

As everyone knows, I adore post mortems; there's so much to learn.

As David Wilson points out on his python sweetness blog, this is one of the greatest bug descriptions of all time, and even though it's contained in an official SEC finding, it's still fascinating to read.

The SEC doesn't bury the lede, they get right to it:

On August 1, 2012, Knight Capital Americas LLC (“Knight”) experienced a significant error in the operation of its automated routing system for equity orders, known as SMARS. While processing 212 small retail orders that Knight had received from its customers, SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion.

Yes, I'd call that a significant error.

If you're like me, questions just leap to mind when you read this:

  • What was the actual bug, precisely?
  • Why was it not caught during testing?
  • Why do they not have safeguards at other levels of the software?
  • Why do these sorts of bugs not appear more frequently?

Happily, the SEC document does not disappoint.

Let's start by getting a high-level understanding of RLP and SMARS:

To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange, which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.
Well, that helps a little bit, but it doesn't hurt to learn a bit more. The New York Times takes a stab at explaining RLP: Regulators Approve N.Y.S.E. Plan for Its Own ‘Dark Pool’
Regulators have approved a controversial proposal from the New York Stock Exchange that will result in some stock trades being diverted away from the traditional exchange.

The program, expected to begin later this summer, will direct trades from retail investors onto a special platform where trading firms will bid to offer them the best price. The trading will not be visible to the public.

Although the details are obviously quite complex, the overall concept is straightforward: the exchange, with the government's permission, changed the regulations controlling how trades may be executed by the participant trading firms and brokers.

But note the timeframes here: the new rules were approved at the beginning of July, 2012, and went into effect on August 1, 2012. That is one compact period!

Anyway, back to the SEC finding. What, precisely, did the programmers at Knight do wrong?

Well, the first thing is that their codebase contained some old code, still present in the program, but intended to be unused:

Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.

This is a common mistake in software: old code is disabled, not deleted. It's scary to delete code, but of course if you have a top-notch Version Control System, you can always get it back. Still, programmers can be superstitious, and I'm not surprised to hear that somebody left the old code in place.

Unfortunately, unused, unexecuted, untested code tends to decay, as the SEC observe:

When Knight used the Power Peg code previously, as child orders were executed, a cumulative quantity function counted the number of shares of the parent order that had been executed. This feature instructed the code to stop routing child orders after the parent order had been filled completely. In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.

So, there was old, buggy code in the program, and new code written to replace it, and that new code "repurposed a flag that was formerly used to activate the Power Peg code." We've got almost all the elements in place, but there is one more significant event that the SEC highlight:

Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers.

In the modern jargon, this deployment often goes under the terminology of "DevOps". The task of taking your program from its development phase and moving it into production seems like it should be simple, but unfortunately is fraught with peril. Techniques such as Continuous Delivery can help, but these approaches are sophisticated and take time to implement. Many organizations, as Knight apparently did, use old fashioned approaches: "hey Joe, can you copy the updated executable onto each of our production servers?"

The prologue has been written; the stage is set. What will happen next? The SEC finding continues:

The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server.

I can barely imagine how challenging this must have been to diagnose under the best of conditions, but in real time, in production, with orders flowing and executives screaming, this must have been extremely hard to figure out. 88% of the work was being done correctly, with just the one computer acting in a rogue fashion.

It's interesting that, in the hours leading up to the go-live moment, there were hints and signals that something was wrong:

an internal system at Knight generated automated e-mail messages (called “BNET rejects”) that referenced SMARS and identified an error described as “Power Peg disabled.” Knight’s system sent 97 of these e-mail messages to a group of Knight personnel before the 9:30 a.m. market open.

But they didn't realize the implication of these special error messages until much later, unfortunately.

You might wonder why other fail-safes didn't kick in and abort the rogue runaway program. This, too, it turns out, was a bug, although in this case it was more of a policy bug than a programming bug.

Knight had an account—designated the 33 Account—that temporarily held multiple types of positions, including positions resulting from executions that Knight received back from the markets that its systems could not match to the unfilled quantity of a parent order. Knight assigned a $2 million gross position limit to the 33 Account, but it did not link this account to any automated controls concerning Knight’s overall financial exposure.

On the morning of August 1, the 33 Account began accumulating an unusually large position resulting from the millions of executions of the child orders that SMARS was sending to the market. Because Knight did not link the 33 Account to pre-set, firm-wide capital thresholds that would prevent the entry of orders, on an automated basis, that exceeded those thresholds, SMARS continued to send millions of child orders to the market despite the fact that the parent orders already had been completely filled.

When something goes wrong during a new deployment, a programmer's first instinct is to doubt the newest code. No matter what the bug is, almost always the first thing you do is look at the change you most recently made, since odds are that is what caused the problem. So this reaction was predictable, even if sadly it was exactly the worst thing to do:

In one of its attempts to address the problem, Knight uninstalled the new RLP code from the seven servers where it had been deployed correctly. This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.

The SEC document contains a long series of suggestions about techniques that might have prevented these problems. To my mind, many of these ring quite stale to the ear:

a written procedure requiring a simple double-check of the deployment of the RLP code could have identified that a server had been missed and averted the events of August 1.
This seems mad to me. The solution to a human mistake by a manual operator who was redundantly deploying new code to 8 production servers is not, Not, NOT to institute a "written procedure requiring a simple double-check." That is a step backwards: humans make mistakes, but the answer is not to add additional humans to the process.

Rather, the solution is that the entire deployment process should be automated, with automated deployment and automated acceptance tests.

Other observations that the SEC make are more germane, in my opinion:

in 2003, Knight elected to leave the Power Peg code on SMARS’s production servers, and, in 2005, accessed this code to use the cumulative quantity functionality in another application without taking measures to safeguard against malfunctions or inadvertent activation.

It is, in general, a Very Good Idea to reuse code. The code you don't have to write contains the fewest bugs. So I can see why the programmers at Knight wanted to keep their existing code in the system, and tried to reuse it.

The flaw, in my view, is that this code wasn't adequately tested. They used existing code, but didn't have automated regression tests and automated acceptance tests to verify that the code was behaving the way they expected.

This is, frankly, hard. Testing is hard. It is hard to write tests, it is even harder to write good tests, and then you have to take the time to run the tests, and you have to review the results of the tests.

In my own development, it's not at all uncommon for the writing of the automated tests to take as long as the development of the original code. In fact, often it takes longer to write the tests than it does to write the code.

Moreover, the tests have to be maintained, just like the code does. And that's slow and expensive, too.

So I can understand why the programmers at Knight succumbed, apparently, to the temptation to skip the tests and deploy the code. Sadly, for them, that was a tragic mistake, and Knight Capital is now part of Getco Group.

Automated trading, like many of the most sophisticated uses of computers today, is not without its risks. This is not easy stuff. The rewards are high, but the risks are high, too. It is useful to try to learn from those mistakes, sad though it may be for those who suffered from them, and so I'm glad that the SEC took the time to share their findings with the world.