Monday, August 31, 2009

Benchmarking Derby GROUP BY

I've been building a simple GROUP BY benchmark for Derby.

I've been prototyping a new GROUP BY implementation (DERBY-3002) which provides support for the new ROLLUP keyword. As part of this work, it's important to be able to get some data about the relative performance of the current Derby GROUP BY implementation versus the new proposed implementation.

So I've been working on building a simple GROUP BY benchmark.

Happily, Derby already has a quite sophisticated benchmarking infrastructure:
  • The and classes provide support for loading a scalable Wisconsin benchmark schema to an arbitrary size.
  • The perf.clients package provides general benchmarking capability, with generic classes to manage the overall benchmark.
  • A somewhat similar benchmark was written not too long ago to measure index join performance.
So, starting with that infrastructure, I've been writing a GROUP BY benchmark (DERBY-4363). My first implementation demonstrated that I could run GROUP BY statements in this simple harness. My next implementation needs to provide a richer set of statements, and also needs to provide command-line arguments to pick the specific statement to run.

Once I get a reasonable benchmark, which I hope to be able to do this week, I'll then use it to collect a set of performance numbers against the current Derby trunk, and against the DERBY-3002 patch.

This will hopefully give us some hard data regarding the performance of the new GROUP BY algorithm.

Thursday, August 27, 2009

Labrador Retriever code reading

Every time I take my dog out for a walk, even on a path we've trod hundreds of times, she finds something new and interesting to investigate.

As you're stepping through the code, fixing bugs, reading new sections, try to keep your senses alert for things to explore and investigate.

That is, you want to be sensitive to code smells, a wonderful term that was coined by Martin Fowler and Kent Beck some years ago.

Over time, you'll find your ability to detect code smells will improve with practice, but there are also some easy tools which can help you "sniff" the code for possible problem areas:
  • Turn on the warnings in your compiler, and pay attention to what the warnings are saying. Modern compilers detect a large variety of easy-to-address problems.
  • Find a good static analysis tool and run it. For example, try FindBugs or PMD for Java code, or a classic tool like lint for C.
  • Better still, integrate these tools into your build system, so that it runs the warnings and static analysis reports routinely, and places the current results somewhere that you can view them whenever you want.
  • Augment your unit test runs with a code-coverage tool like Cobertura or Emma, and have a look at the reports to help understand what code is being tested, how thoroughly it is being tested, and what code isn't being tested at all.
Pretty soon, just like my puppy, you'll be finding fascinating new things to explore everyplace you look!

Wednesday, August 26, 2009

Bug-fixing code reading

To recap:
  • I've been digging into a new body of code
  • I started by reading through lots of the code in my editor
  • Then I started setting up some scenarios and stepping through them in my debugger
  • Then I tried to explain how things work, to a (bored, but kind) colleague or two
What's the next step? It's time to fix a bug!

This is a point where a little bit of luck can come in handy, because you don't want to pick the wrong bug. You want to pick a bug which is challenging enough to make you have to exercise some of the knowledge and mental models that you've been developing, but not so hard that you get stuck, and discouraged, and annoyed.

So if you happen to have somebody around who can suggest a good bug, that's great. Otherwise, you'll just have to take your best guess from the bugs that are known.

Update the bug-tracking system to indicate that you're working on the bug, and get to work:
  • Ensure that you can reproduce the bug
  • If the bug-tracking system doesn't already contain an automated test case for the bug, write one
  • See if you can modify the code, to make the test pass
  • If you can, run all the other tests, to see if your change had any other (observable) effects

Tuesday, August 25, 2009

Instructional code reading

As I mentioned in my last note, I'm immersing myself in a new library of code.

Slowly but surely, I start to understand the code, and why it behaves like it does. Or, at least I think I understand it, but I don't really trust myself to know whether I actually do understand it, so periodically I try to validate whether or not I am indeed understanding the code.

There's an old saying: "If you can't teach it, you don't know it".

So one of the exercises I use is to attempt to teach this new code to somebody else. This works best if you actually have a sucker^H^H^H^H^H^Hcolleague nearby and you can tediously bore that person with your attempts to explain the behavior of the code, while trying to answer their questions.

But even if you don't have such a resource, you can still practice this exercise: just go to a whiteboard or wiki page or blank PowerPoint preso or something similar, and try to type up some notes explaining how the new code works.

You'll find that as you try to explain it, you naturally start raising questions, and pursuing those follow-on questions leads you to improve your understanding.

Here's a nice short essay expressing this notion quite clearly; I like this summarization:

The act of teaching excels at revealing those gaps in our knowledge of a subject. This is especially true when a student asks good questions
Now I'm off to work on that wiki page...

Monday, August 24, 2009

Dynamic code reading

Recently, I've been digging into a new body of code.

As a result of a sequence of events, I've picked up responsibility for a large body of existing code, and the original author of that code is not available to me. This is not the first time this has happened to me; in fact, this is actually a fairly common circumstance in the software industry, and if anything it's a bit of a surprise that this doesn't happen to me more often.

At any rate, I'm faced with the prospect of learning how this code works, learning where its strengths and weaknesses are, and then constructing a proposal for where to take this code in the future.

My first milestone is to take the existing test suites, verify that I can run those suites, and then try to make those suites run faster and more reliably. I'm lucky: the project was left in a state where the code compiles, and the tests mostly run. Even better, there are a fairly substantial set of tests. So this is a good situation to start in.

My other first milestone is to fix a bug. I've even identified the bug (well, actually, somebody else picked the bug, which is even better, because they picked a bug which matters to a particular user, which is a good way to choose a bug).

So now I just have to learn the code, fix the bug, run the tests, and submit the fix to the various branches of the SCM system.

Of course, that's easier said than done. And the idea behind this post was to talk about how I go about learning a new body of code.

One thing you can do is to sit down with your editor and start reading the code. This is fine, and in fact I'm actually a pretty good code-reader. I've been reading code for 30 years, and in general I think I'm a fairly effective and rapid code reader.

But that's not my preferred way to learn code. The best way to learn code, I've found, is to step through it in the debugger. There are a bunch of advantages to this approach, such as being able to look at the values of the variables while the code is actually running, but the most important reason for using a debugger to step through code while you're learning it is that the debugger will help you figure out:
  • who calls this code
  • when do they call it
  • why do they call it? (that is, what does the caller do with the results?)
These sorts of questions are terribly hard to figure out when you are just statically reading code in your editor, particularly in an object-oriented language where overridden and overloaded methods, interface-implementation abstraction layers, and the like make it terribly hard to look at a method call in your editor and figure out where that call is actually going to take you.

So, my suggestion is: if you are trying to learn a new body of code:
  • set up your debugger
  • construct an interesting test case
  • and start stepping through code!

Wednesday, August 19, 2009

Google Summer of Code 2009 draws to a close

We've reached the end of the 2009 Google Summer of Code.

This was my first year being involved as a mentor, and it was quite enjoyable. My intern, Eranda Sooriyabandara from Sri Lanka, worked on a number of Derby projects, and completed them. Together we found and fixed a number of Derby bugs, and also made substantial progress on the great multi-year project to convert the Derby test base into JUnit assertion-based tests.

I think that Eranda was able to learn a lot about Derby, and about database systems in general, and was also exposed to a body of software that was significantly more complex and involved than he had worked with in the past. I think that Eranda also had the opportunity to meet and work with a variety of the other people in the Derby community, and had the opportunity to learn a lot about how the open source software process operates.

As always, we hope that the GSoC interns who worked with Derby will continue to stay active in the Derby community in the future. This doesn't always happen, but when it does, it's very rewarding. Even if they don't, they may get involved in some other open source project in the future, building upon their experiences in the Derby community.

It was a good experience for me; I hope I have the opportunity to be involved in the Google Summer of Code again in the future.

Thursday, August 13, 2009

Game programming and the GPU

Via Wes Felter's blog I found this fascinating presentation on graphics programming futures.

I've never been a graphics programmer; anybody who's seen my UIs knows that this is not one of my strong points. I can open windows, show buttons and menus, do the basic stuff, and I have a particular fondness for UIs that do mapping, but I've never even begun to write true 3D graphics code (though I have several books on my shelf that I really will read one of these days...)

So it's hard for me to know what to learn from this presentation, but it seemed like the essence of the presentation was:
  • Low-level GPU programming was necessary when general purpose CPUs were insufficiently powerful.
  • But modern multi-core CPUs are providing vast amounts of surplus general-purpose computing power.
  • And in principle it should be easier to program for the general-purpose CPU than for the special GPU hardware, which is important because programmer productivity is the limiting factor here.
  • So perhaps game developers can reasonably consider whether the near-term future will spell an end to custom video board GPU progamming and a return to pure software rendering.
  • With at least one BIG question: can the memory subsystems in these modern multi-core machines provide both the bandwidth and the cache coherency needed to write such software rendering libraries.
I don't know what to make of any of this, but it certainly is a fascinating presentation!

Wednesday, August 12, 2009

USCF drama

Wow! High Drama in the USCF.

I hadn't been following this story at all; apparently it's been going on for some time.

It's been a while (10+ years, sadly) since I seriously followed chess, but I definitely remember the Polgar sisters, and when they first took the chess world by storm. As I recall, Judit Polgar was the most talented of the sisters, but they were all very strong players.

I was thinking about chess recently because I've been playing chess with my son when I see him, which isn't as often as I'd like. My son was asking about chess clubs and ways to play chess, and I remembered that at the time, there was something called the Internet Chess Club.

It seems like the Internet Chess Club is still around, but they are no longer free. Is that right?

What I remembered was a place where you could sign on, watch other people's games in real-time, and occasionally play a game yourself if you wanted to. Does such a place still exist? I see that there is something called the Free Internet Chess Server, is that worth investigating?

I don't have the time, unfortunately, to get seriously back into chess, but if I was looking for a hobbyist's web site, where I could follow chess news, view chess games (both stored ones and real-time games), and occasionally play a game of my own, where would I go?

Time-based partitioning

Here's a nice short article from Josh Berkus on partitioning your data by time.

We do a lot of this at work, and spend a lot of time discussing how to do it well.

I think Josh hits all the major points:
  • Choose your partition size to work nicely with your retention period, so that reclaiming data becomes simply dropping tables.
  • Avoid having too many tables, because query processing will slow down dramatically.
  • Beware of the overhead and complexity of dynamically creating and dropping tables. It is expensive, it is low-concurrency, it requires high privileges in the database, and it requires that you have a good scheme for naming the tables to avoid confusion.
  • Be alert for opportunities to have time-varying detail, since older data generally can be retained with less instant-by-instant detail, but can instead be aggregated into larger time units. Of course, this aggregation is expensive and complicated, too.
An interesting point that Josh doesn't discuss too much is the question of how to design the primary and alternate keys (and their corresponding indexes) for the data. It is tempting to make the timestamp field be the primary key, and to use that for a so-called "clustered" index on those database systems which support it. However:
  • You need to have a way to ensure that your timestamp is unique, which may not match your application semantics depending on what sort of data you have, and
  • Not all database implementations recognize the special case of "ever-increasing primary key insertions"; if your particular database doesn't recognize this, you can encounter a well-known physical storage problem where each leaf page in your clustered index is exactly half full, meaning that your database is twice as big as it should be.
All in all, a great summary by Josh, who is a blogger worth reading in the database world (he's been involved in Postgres for many years).

Monday, August 10, 2009


I've always been a fan of software tools that stick to their knitting.

I've been using JSwat as my Java debugging tool for close to a decade. It works and I'm comfortable with it, and I don't see the need to change. There are plenty of other Java debugging tools available, so you should use whichever one you're most comfortable with.

But if you decide you want to try JSwat, here's a capsule summary:

  1. Install it from the project home:
  2. When you start any Java program that you wish to debug, pass these flags on the command line:

    -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5000

  3. In JSwat, use the "Session -> Settings" dialog to add your various source code paths to the debugger. For example, if you are working with Derby code and you want to debug it, add the trunk/java/engine folder. You should only have to do this once and then JSwat remembers it.

  4. Then use "Session -> Attach" to attach to your Java program to debug it. You will have to choose port 5000 in the attach dialog.

Then you should be able to set breakpoints, view the values of variables, etc.

It takes a little while to learn how to use the debugger, so it's good to practice.

And read the debugger's built-in help files, which are quite helpful and clear.

Thursday, August 6, 2009

C++ Templates and the Concepts feature

It's now been 10 years since I was a full-time C++ programmer, so I've been rather disconnected from the C++ world.

But I was recently drawn back in by all the excitement over the removal of the concepts feature from the upcoming new C++ standard.

When I was using C++ intensively, templates and the STL were just being invented, and while I thought they were fascinating, the compilers that we had access to at the time didn't support those features, so we didn't use them.

The feature of Java which is most similar to the templates feature of C++ is called generics, and has been part of Java for many years, introduced in JDK 1.5. I've used generics a fair amount, and whined about them somewhat, but overall I think that generics are a strong feature in Java.

C++ has always been a fascinating language. It is incredibly powerful, but it is also extremely hard to learn, and it is tricky to use it well. Languages like Java and C# seem to have been able to learn from C++, and to provide tremendous power with substantially less complexity.

However, I think it's good for the programming profession as a whole to have C++ out there exploring the nether regions, so we mere mortals can come along and learn from it. :)

Wednesday, August 5, 2009

Does Firefox 3.5 portend the end of heavy plug-ins?

For years, the browser experience has been built atop various substantial plug-ins, such as Adobe's Flash, Microsoft's Silverlight, or Sun's Java.

But as I learn more about Firefox 3.5, I'm wondering if this is finally the release where the venerable plug-in architecture starts to fade, as developers are finally able to build complete browser-based applications using solely the native browser features.

Consider these Firefox 3.5 (and earlier) features as a package:
  • WebWorkers
  • Canvas tag
  • XMLHttpRequest security
  • Audio and Video tags
  • The Storage API (SQLite)
I'm sure I've missed some features, but the point is that these features, as a package, address some large proportion of the reasons that developers had for developing "substantial" browser-side applications using a plug-in such as Flash/Silverlight/Java instead of in the base browser.

I think it's very exciting to see that Firefox 3.5 is now sufficiently capable that it's reasonable to consider building a serious complete application hosted directly in the base browser, without all the complexity of having to exit to an external plug-in.