Tuesday, June 30, 2009

Cat 2, Dell 0

I'm now on my third power supply for my old Dell laptop.

This laptop is great; it's reliable and plenty powerful for my needs. I run Ubuntu on it, and use it as an alternate machine for learning about various new technology.

However, the battery is essentially defunct, so I have to run it in an always-plugged-in configuration, which means that the laptop power supply is always connected to the laptop, and to the wall.

Unfortunately, one of the pets in my house (my wife's money is on the cat) apparently thinks that this particular power cord is tasty.

Twice now, in an 18 month period, the cat has chewed sufficiently far through the power cord to short it out, and ruin it.

Happily, replacing a Dell D610 power supply is not the end of the world.

And it's very nice to get my laptop back into operation (I suspect I have quite a lot of Ubuntu updates to apply).

But why does the cat like to chew on this cord?

Sunday, June 28, 2009

The greatest (U.S.) goal ever?

The setting: the 2009 Confederations Cup, the every-four-years warm-up tournament which is held by the host country for the upcoming World Cup to practice and refine their processes and protocols in anticipation of the following year's World Cup matches. Next year's world cup will be held in South Africa.

It's a small tournament, just eight teams, including the host country, but it has attracted the best players in the world: Brazil, Spain, and Italy are all participating. Early in the tournament, the US team have lost badly to both Italy and Brazil, but following an improbable series of events, the US find themselves suddenly in a match with Spain, which they win in an entirely unexpected fashion, and advance to the final match, a re-match versus Brazil.

Midway through the first half, the action is down by the US goal. The US are ahead by a goal, and Brazil are pressing their attack ferociously. Everyone has retreated far to the US side, and even the Brazilian defensemen are three quarters of the way down the field.

Suddenly the ball pops loose in front of the box, and an alert midfielder sends the ball to US superstar Landon Donovan, who finds himself with the ball, but 80 yards from the Brazilian goal.

Donovan takes the ball and begins racing up the center of the field as fast as he can. Quickly and efficiently, the two Brazil defenders move toward him, poised to reclaim the ball. But just instants before they converge, Donovan sends the ball out to the wing, where Charlie Davies is streaking down the left side.

The pass is a bit strong, and is out ahead of Davies, and it looks for a second like the play is over. But Davies lunges with all his might, and is able to reach the ball and send it back into the center of the field, where Donovan, who hasn't paused at all, is still streaking forward toward the goal.

Donovan catches the ball with his right foot and shifts his weight to the left side, somehow keeping his feet as the Brazilian defender falls to the grass.

Then Donovan steps once to the left to reach the ball, and sends a left-footed blast just past the diving feet of the Brazilian goalie and into the far corner of the net.

It's all over in 9 or 10 seconds, but it bears watching a few times if you haven't seen it yet. There's a second clip which is shot from overhead which is a great alternate perspective; fast-forward to around 1:50 in the second video to see the Donovan goal.

The US still have a long ways to go to compete at the very highest level of soccer, but I hope that this goal will be remembered for a long, long time: "do you remember Donovan's brilliant run at the 2009 Confederations Cup, and how he sliced through the Brazilian defense like a saber?"

Thursday, June 25, 2009

Benchmarking GROUP BY

I want to design a GROUP BY benchmark.

As part of DERBY-3002, I've been re-designing the implementation of the GROUP BY statement in Derby. Specifically, I've been working on implementing ROLLUP grouping, which is a feature that enables the computation of multiple levels of grouped aggregation with a single pass over the input data.

I believe that I've got this feature working, and in addition to adding support for ROLLUP grouping, I've also, as a side effect, added support for multiple distinct aggregates in the same result set.

However, my implementation is substantially different than the previous implementation, as the previous implementation generally computed aggregates using sort observers, whereas my new implementation does not use sort observers. Instead, the implementation sorts the data (unless it's already in sorted order) and then processes the data in a single pass over the sorted values, computing the various levels of aggregates as it goes.

During review of the changes, Knut Anders wondered what the overall affects on performance were. This is a good question, and I'm trying to figure out how to answer it. I think that the best thing to do is to write a benchmark, and then we can run it against both the old and the new implementations, and measure the results.

Writing a GROUP BY benchmark is simpler than many other possible benchmarks, but it is still not an easy job. For the most part, the overall structure of a reasonable GROUP BY benchmark is probably something like:


for each interesting GROUP BY statement,
note the current time
execute the GROUP BY statement and
read all the results
note the current time again and
compute the elapsed time

print a simple result stating
the statements that were run and
the elapsed time for each statement

The interesting questions, then, are:
  • what statements should we run, and
  • what data should we run them against?
The statements of interest should include some mix of existing GROUP BY statements, as well as some new GROUP BY statements utilizing the ROLLUP features, as well as possibly some new GROUP BY statements utilitizing multiple DISTINCT aggregates. The presence of the new statements in the benchmark is a complicating factor because of course these statements can't be run against older versions of Derby, so the benchmark will need to contain two sets of these statements:
  • the new statements, which can only be run against Derby with my proposed DERBY-3002 patch, and
  • an equivalent set of statements which can be run against older versions of Derby. For example, ROLLUP can be replaced by an equivalent set of UNION'ed GROUP BY statements. And multiple DISTINCT aggregates can be replaced by some sort of join of multiple subqueries, each with its own single DISTINCT aggregate.
The data to use in the benchmark is also important:
  • There should be enough data to make the queries interesting, and require the database to do enough work so that differences in the implementation performance become important. If there isn't much data, almost any algorithm would be good enough.
  • The data needs to include the right data types; should we use strings? numbers? large objects? dates and times?
  • The distribution of the data is very important: how many groups are there in the data, and what is the ratio between the number of groups, and the total amount of data?
Both of these problems (choosing the workload of statements to run, and choosing the data to run them against) are tricky, and mistakes in either area could greatly reduce the value of the benchmark.

One place to start is to incorporate the GROUP BY examples from the Derby documentation, and from the ROLLUP specification.

Another possibility is to use statements and data from existing benchmarks; for example, there are the TPC-* benchmarks from tpc.org; TPC-D might be the best place to start looking for GROUP BY examples. Are there other such benchmarks?

Where else might I look for inspiration for building a GROUP BY benchmark?

Wednesday, June 24, 2009

More detective stories

Less than a month since Air France 447, another puzzle.

Something went wrong with the automated control systems on the Washington D.C. mass transit system, resulting in a horrible crash with multiple fatalities.

"That train was never supposed to get closer than 1,200 feet, period," said Jackie Jeter, president of a union that represents Metro workers.

I rode the Metro once, in the early 80's; at that time it was brand new, bright and gleaming.

This investigation is still very early, but it is already known that the train in question was not equipped with any particular event recording information, so there will be less information for the investigators to chew on.

The Metro is similar in a number of ways to San Francisco's BART, which I ride regularly. They are both roughly the same age, use roughly similar technologies, and have broadly similar purpose. There are some quite interesting articles comparing the two systems: here's one from transit planners in Seattle; here's another from a group in Houston.

As transit systems become more complex, the need for automation assists becomes crucial. BART was a major symbol of this in the 1970's: both as a symbol of the importance of automation in modern transit systems; and as a symbol of the complications that come with automation.

California is just beginning work on a massive state-wide high-speed train system, funded by some recent ballot initiatives. The level of automation required for such a system will be substantial; this is of course both an opportunity and a challenge.

Saturday, June 20, 2009

isAssignableFrom and instanceof

I had occasion recently to use Class.isAssignableFrom again.

I don't use this API routinely, so when I do use it I always seem to have to stop and think about it for a second, to make sure I know what I'm doing. Class.isAssignableFrom is closely related to instanceof, but they do slightly different things, and both are extremely powerful.

I think that one of the reasons that I get a bit confused with isAssignableFrom is that it sort-of reads backwards from instanceof. That is, suppose you have the following:

interface IFace { }

class A implements IFace { }

class B extends A { }

class CNum extends java.math.BigInteger implements IFace
public CNum(String s) {super(s);}

And then suppose that you create some instances of these classes:

A a = new A();
B b = new B();
CNum c = new CNum("1");
IFace i = new CNum("2");

Then the following uses of isAssignableFrom and instanceof are somewhat illuminating:

if ( Object.class.isAssignableFrom(B.class) ) { ... }

if ( b instanceof Object ) { ... }

These are both true, because Object is assignable from everything. Object is the root type of all Java classes, so Object.class is always assignable from every other Java class.

Looking at the two statements above, do you see why I say that one reads backwards from the other? With instanceof, you tend to see the subclass on the left, while with isAssignableFrom, you tend to see the subclass on the right. Or, at least, I think that's why I sometimes stumble over isAssignableFrom, because instanceof seems to read naturally to me, but I always have to stop a second and think about isAssignableFrom.

Any instance of B is an instance of Iface, because B is a subclass of A, and A implements IFace, so the following is always true:

if (b instanceof IFace) { ... }
if (b instanceof A) { ... }

But a particular instance of Iface might in actuality be a A, a B, or a CNum, so the following is true because we initialized the variable i to be a CNum instance:

if (i instanceof BigInteger) { ... }

And for that very same reason, this statement is false:

if ( i.getClass().isAssignableFrom(b.getClass())) { ... }

A class is assignable from itself:

if ( c.getClass().isAssignableFrom(CNum.class)) { ... }

An interface is assignable from an implementing class:

if ( IFace.class.isAssignableFrom(b.getClass())) { ... }

It's a wonderful API.

Thursday, June 18, 2009

Redacting with Confidence

Redacting is the fancy term for blacking out parts of a document that you don't want other people to read.

With a physical document, this is pretty straightforward: you get a thick black marker, and you smear black ink over the sensitive part of the document until nobody can read it anymore.

With a softcopy document, this is trickier, because modern word processing software is pretty sophisticated, and although it allows you to mask over the text, the original text is still present.

I read about this years ago on Bruce Schneier's wonderful CryptoGram newsletter, and somehow it stuck in my mind.

Then, the other day, my wife asked me about how to do this properly, because it turned out that her office actually needed to redact a PDF document in a secure fashion.

I was impressed to see that the National Security Agency actually has some pretty good advice about this, published as a simple and clear document describing the overall process, with advice about common mistakes, and checklists for how to avoid them.

The most recent version of the NSA guide, which covers Word 2007, can be found at: http://www.nsa.gov/ia/_files/support/I733-028R-2008.pdf

A previous version, which covers older versions of Microsoft Word, can be found at:

The documents are pretty good, in my opinion; your tax dollars at work!

Friday, June 12, 2009

Perforce listens to its customers

Perforce 2009.1 is entering testing.

There are enhancements to the server, to the command line clients, and to several of the specialized tools. In particular, I noticed:
  • There's a command-line version of the move/rename command: "p4 move". This feature, which is perhaps the #1 requested feature in the product (!) requires upgrading the server to 2009.1 also.
  • p4v has lots of new features, including "Reconcile Offline Work", which sounds quite intriguing for people who work disconnected routinely. We used to have to script this using the command-line tools; you can see how much the process has improved by comparing the P4V and command-line sections of the new Working Disconnected doc.
  • "p4 opened" now has a "-u" flag to look just for files opened by a particular user.
  • P4V now checks and prompts you if it notices that you are opening an older version of a file for edit. This is a nice feature, and may help people minimize their merging by helping them realize when they're not up-to-date with the server.
  • "p4 logtail" looks like it might be handy. Happily, our server is quite free from problems nowadays.
  • The server release doesn't appear to have any database changes. The database changes aren't common, but I always pay attention to them because they may require extra disk space and time during the server upgrade step.
I've always been impressed with Perforce's release notes. They are very, very thorough about describing exactly what they changed, and how it affects things, and their Technical Support staff is incomparable. You can always ask them questions if things aren't clear, and they'll answer straightaway. Perforce will even document internal changes that other companies would never tell their customers about, so you'll see entries in the release notes like:

#188212 (Bug #28268) *
Undocumented 'p4d -jds' crash fixed. This operation is
still not meant for regular production use.
Perforce are quite disciplined when it comes to rolling out releases; they almost always deliver a release every 6 months, with a short beta period open to all without any special restrictions. Because I don't generally need to live on the bleeding edge with our Perforce usage at work, I generally follow this simple strategy:
  • Read the release notes when the beta is announced, to get a feel for what's upcoming.
  • Wait until the release finishes its beta period.
  • Download the client software to my regular development machines (Linux and Windows) and start using it with the current server. I like to try to be the first person at my company to be using the latest Perforce client software, but I don't get so extreme as to run the beta clients, generally.
  • Wait another several months before upgrading the server software.
I usually find that I'm about 1-2 releases out of date on the server software, but that's OK.

Tuesday, June 9, 2009

Caching the non-existence of data

Generally, when you look an item up in a cache, you:
  • Check the cache for the item
  • If it's not found, go get the item from the backing store
  • Put the item into the cache, and return it to your caller
But what should the behavior be when the caller wants to look for a non-existent item?

A simple implementation is to use the same algorithm as above, except that the last step then becomes:
  • Return to your caller the fact that no such item was found.
This is a fine implementation, and probably works well for many cases.

However, if (a) it's expensive to go get the item from the backing store, and (b) requests for non-existent items are routine, then you may find this implementation unsatisfactory.

Instead, you might want to consider caching the non-existence of the item, by which I mean you could have your cache have a special cache entry which allows you to hold a "no such item with this key exists" token, so that when your caller asks for an item which does not exist, you can satisfy that request from your cache.

There are a variety of details that exist in this implementation, but I think that the basic idea is sound. For the cache that I maintain at work, I'm considering whether I need to pursue this idea, because I have some reason to believe that some of my performance issues are due to requests for non-existent data.

But first I'm going to try to figure out why my callers are asking for the non-existent data routinely, because I don't think that should be a common behavior, and might indicate a bug in the callers.

Yet it got me wondering: how do other caches behave? Is it fairly routine for a modern object cache implementation to cache information about non-existing objects? Or do most caches only cache information about existing objects?

Monday, June 8, 2009

Cache Oblivious Data Structures

Cache oblivious data structures are a new and quite interesting concept to me.

I first came across these ideas while browsing Wes Felter's blog, sometime last winter, but this is apparently quite an old idea which I've only learned about recently. Most of what I've read cites Prokop's 1999 Master's Thesis at MIT as the work which really got this area going, but that work in turn builds on a data structure called the van Emde Boas layout which dates from 1977, which is quite old (in Computer Science years that is). I'm particularly interested in cache-oblivious BTree algorithms, but I find all of the cache-oblivious research quite interesting; the B-Tree, of course, is an older-still data structure, from Rudolf Bayer's work at Boeing in 1970.

The basic notion of cache oblivious data structures is fairly simple: modern computers have multi-level storage hierarchies, in which data which is being manipulated by the CPU moves back and forth between various levels of storage:
  • in a register on the CPU
  • in specialized cache memory (often called Level-1 or Level-2 cache)
  • in the computer's main memory
  • on disk
The issue is that the computers have different-sized caches, and the size of the cache is important. Specifically, the block size of the cache is important; the block size is the unit of transfer and replacement (that is, an entire block is moved in and out of the cache as a single unit).

An algorithm is referred to as cache-aware if the algorithm itself contains explicit parameters that can be tuned to optimize the performance of the algorithm for a specific cache size. Most algorithms are cache-aware, meaning that, in practice, they either must be tuned specifically on a machine-by-machine basis, or will perform better on some machines than on others, because their default behavior works well for some cache sizes but not others.

This is undesirable; we'd prefer to write algorithms that work equally well on different cache sizes, without requiring explicit configuration and tuning; such algorithms are called cache-oblivious, which doesn't mean that they don't know of the existence of the cache, but instead means that they automatically work well in various different cache sizes without needing to be configured or tuned.

Caches, in general, take advantage of two principles:
  • The principle of temporal locality,
  • And the principle of spatial locality.
The principle of temporal locality is the observation that a program which uses a particular piece of data is likely to use that same data again soon. So a good caching implementation will try to keep data that the program uses around for a little while, hoping that it is used again soon, and will generally try to discard from the cache that data which the implementation guesses is least likely to be used again soon. Many cache implementations attempt to predict the program's future data accesses by observing the recent data access, leading to the implementation technique known as Least Recently Used.

The principle of spatial locality is the observation that a program which uses a particular piece of data is likely to use some nearby data soon. A program which is stepping through an array is a good example of the principle of spatial locality, because an access to array element 1 is often followed by an access to array element 2, and then an access to array element 3, and so forth. So a good caching implementation will try to keep nearby data available in the cache.

The standard Bayer-McCreight BTree exhibits both temporal locality and spatial locality. However, the standard implementation block size B, which is generally optimized with respect to disk transfer sizes and tree heights, and does not generally work very well for the memory cache hierarchy. This is where cache-oblivious BTree implementations and the van Emde Boas tree layout come into play. The van Emde Boas tree is a data structure which enables the Btree to lay out the index nodes in such a way that they have memory behavior similar to a single large array, which means that the in-memory behaviors of the cache-oblivious BTree have excellent memory cache performance while still preserving the BTree's excellent disk access performance.

Apparently there is a new database startup company named Tokutek, which is trying to use these new types of BTree indexes in a complete database engine. The founders of Tokutek include a number of the researchers who developed cache-oblivious search tree algorithms, and they have some interesting postings on their blog discussing their ideas in more depth.

It also seems that other groups are working on similar techniques.

I'm still trying to wrap my head around the way that the van Emde Boas tree improves the in-memory spatial locality of the Btree search nodes, but it's been quite interesting to read the work so far, and I'm looking forward to learning more about these algorithms and techniques.

Saturday, June 6, 2009

Detective stories

I love detective stories, and I think that's why I love computer programming.

I was famously addicted to Encyclopedia Brown mysteries as a child, many many years before I discovered computers. I've always been one of those computer programmers for whom the most interesting part is when the computer does something unusual or unexpected, and the puzzle is: "why did it do that?" I love that sort of backward-reasoning where you are presented with a confusing behavior, and the goal is to reconstruct a series of steps by which the program ended up doing what it did.

Some people are wonderful at capturing this aspect of computer programming. Mark Russinovich is a great example; I lap up his essays the moment they arrive. I think that the format works well for describing the mental process of tracking down a complex problem; here are a few other recent examples from other writers in other fields.

So I've been very interested in the tragic mystery of what happened to Air France flight 447.
At this point, a week after the plane vanished, there is still much that is not known.

The interesting part, to me, is to follow the people who are trying to come up with theories about what could have happened and why. The effort has been focusing on the fragmentary and partial information that was sent by the plane's computers during its final minutes. What does this data mean? What can we conclude, and what is just speculation?

Modern airliners are incredibly complex. Sophisticated computer software is necessary even for small planes, and commercial passenger jets are just phenomenally complicated. But they are also amazingly capable. Captain Sullenberger's "Miracle on the Hudson" was aided by the amazing Airbus software, which helped him gently glide the plane to a safe result. But there are many mysteries about these vehicles too, and this is not the first time that an Airbus has behaved in a surprising and dangerous fashion, a fact which many others have noticed.

It takes patient, thorough investigation to figure these things out. The shining example of this is Richard Feynman's work on the Challenger investigation, which is a masterpiece of analysis.

Will we figure out the flight 447 mystery? I'm sure I'm not the only person who wants to see us make the best possible effort, and I'm pleased to see that the government has offered to help.

Update: Several thoughtful readers pointed me at some other interesting articles.

Update (2): Here's an analysis of the margin for error in the plane's likely circumstances, and here's another interesting reflection on the role of computer software in modern commercial airliners.

Update (3): Today's Christian Science Monitor reports on the progress of the investigation. Unfortunately, there are still many more questions than answers.

Thursday, June 4, 2009

Naming things

People who know me know that I'm rather a fanatic about naming things.

One of the fundamental lessons that I think I've learned over 30 years of writing software is that names matter. Spend a lot of time thinking about how you name variables, how you name methods and functions, how you name classes, how you name tables and columns, etc. Names are frustratingly hard to change later, but if chosen well they can make all the difference in writing clear and understandable code.

There are lots of good naming conventions out there. Go read some of them and try to stick to them.

I've found that, if names are well chosen, you will see a greatly reduced need for comments, because the code itself will be self-evident and natural to read:

if (shelvesToInventory.contains(shelf))

The flip side of the coin is that a poorly chosen name can simply demolish the understanding of code, and make it practically impossible to read. Often, when I find such code, I simply have to rewrite it on the spot, because otherwise I'm just constantly fighting with the code and I feel like my brain is thrashing.

Double-negatives in code often do this to me:

if (! o.isMissingParts() )
if (! o.isIncomplete())

emerges from my editor as the much more legible:

if (o.isComplete())

Anyway, the point of this post is that I nearly lost my lunch when I read this the other day:

public class BaseDependencyHandler
extends CustomDependencyHandler

Luckily there were no small children or church-goers around to be offended by my reaction...

Tuesday, June 2, 2009

set background=dark

Well, since I'm on the topic of Vim ...

I use Vim on lots of different computers: Windows machines, Solaris machines, Red Hat server machines, Ubuntu desktop machines. On my Windows machines, I use both Vim and gVim, kind of interchangeably. On the other machines, I generally use only Vim ("console Vim", that is), and I tend to be using either XTerms or Putty SSH sessions.

For some of my machines, I tend to have Putty configured so that its SSH sessions look like xterms, and I have my xterms configured to run with a black background:
This seems to work quite well, but Vim's default syntax coloring options for console mode doesn't work very well with black-background terminals.

Well, just the other day, I discovered that Vim has a built-in workaround for this problem:

:set background=dark

Works like a charm!

Monday, June 1, 2009


I've been a vi user forever.

I think I probably first learned vi around 1985, in Boston, while working at CCA. CCA was a very interesting place for technology, as it was quite closely tied into the Cambridge tech world. For example, it was involved with computer networking in the 1960's, and was one of the 113 official nodes on the ARPANET. By the time I got there, the ARPANET had long been retired, and we were doing that weird UUCP thing, giving out our email addresses as things like decvax!cca!bryanp and the like. It is definitely strange to look back at how things were done then.

At any rate, although in my group we were doing most of our work on IBM mainframes, we had a number of community terminals scattered around which were connected to our Unix machines. I can't remember what version of Unix we were running on our machine then ... could it have been BSD Unix in 1986? I think so, maybe. Well, whatever it was, we each had an email account on that system, and could walk down the hall to sit at the machine and check our email and read Usenet, which was the 1986 version of surfing the web :)

So that was when I first used vi.

Nowadays, for the most part, I don't use vi, I use vim, which is the modern open wonderful version of vi.

Someday I'll go on and on about how wonderful vim is, but for today, what was wonderful about vim was its documentation. I had had a network glitch, which orphaned a number of my Putty SSH sessions, including several that were active in vim sessions. Usually this all cleans up automatically, but this time when I signed back on to the system and started some more edit sessions, I began to receive this message:

"E138: Can't write viminfo file"
I checked my viminfo file, and it seemed to have the right permissions, so I Googled the message, and found that the vim docs had exactly the info I needed, together with a detailed explanation of how the viminfo file works, what it does, and how to handle its care and feeding.

Most people that I know think that I'm nuts to still be using vim after 25 years, but: I know how to use it, it works, and I'm productive using it. I fear I'm unlikely to change.

Besides, the guy in the office next door (Roger) uses Emacs!