Journal of a Programmer: May 2009

Saturday, May 30, 2009

JDK 1.5 adoption rates

This month, we passed the 5 year anniversary of JDK 1.5.

Java has had seven major releases in its 15 year history, and an 8th is soon to arrive (will there be significant JDK 1.7 news at Java One? There's been quite a lot of recent activity.). Here's a nice page that summarizes things. For much of this century, Java releases arrived reliably every two years, and for much of Java's lifetime, adoption of the new releases was quite rapid.

However, that adoption rate has slowed, and recent Java releases are taking much longer to achieve high penetration rates. In my experience, it's quite common to find that JDK 1.4 is still the deployed release in a certain environment.

I find this rather surprising, as 1.5 and 1.6 offer substantial benefits. In my own development work, I try to use Java 1.6 as much as possible, not only for its higher performance, but in large part because of the great work on supportability and diagnosis. Many of the problems that I study at work only arise in situations where I can't pre-configure a specially-enabled JVM, so the ability to attach to an existing JVM and get heap dumps, thread dumps, performance information, etc. is extremely valuable.

But there's no denying that JDK 1.4 remains highly common in deployment. I'm not the only one to notice this, for example see the commenters in this article. So why is this? I think that the contributing factors include at least the following:

In the enterprise software world, where I spend much of my time, application servers drive adoption. These are systems such as WebSphere, Weblogic, JBoss, Netweaver, Oracle AppSvr, etc. The vendors that provide these systems tend to be conservative and upgrade their base JDK level slowly, and the users of these systems tend to be even more conservative and upgrade their installed applications slowly. So even though IBM has introduced a new WebSphere version which supports JDK 1.6, most IBM customers are still in the process of upgrading from WebSphere version 5 (which supported JDK 1.3) to WebSphere version 6.0/6.1 (which support JDK 1.4 and JDK 1.5)
JDK 1.4 was "good enough" for many things. Although there continue to be major improvements in Java, JDK 1.4 was a very impressive release, and for an enormous class of applications, it has everything you need. So since it exists, and it's stable and reliable and widely available, it has seen wide and stable adoption, with little pressure to upgrade.
Slow delivery of 1.5 and 1.6 on Mac OS X, Linux, etc. Java releases on platforms other than Windows and Solaris have lagged, sometimes substantially. It will be quite interesting to see if the open source treatment of Java 7 changes this platform coverage experience, but in the past it was certainly true that platforms were rolled out unevenly. For example, I believe there is still no version of Java FX for Linux, well more than a year after it was released for Windows and Solaris.
Generics are intimidating. Moving past JDK 1.5 means dealing with the world of Generics, which, although extremely powerful, mean a lot of new syntax, behavior, error messages, and the like. I think Generics are a very good feature, but the "in your face" nature of generics means that many will prefer to procrastinate and stay with JDK 1.4.
Java update isn't pushy enough. Sun generally doesn't force Java updates on their users. They have been very good about sustaining older releases, and have not pushed very hard on their user base to upgrade.

So, given this, what should Java programmers do? The low-end solution is to stick with JDK 1.4 and avoid use of JDK 1.5 and above features. But this is a fairly severe restriction, so most people have moved to a more sophisticated implementation. In Derby, for example, there is infrastructure which enables Derby to run in JDK 1.4 environments with full functionality. But in JDK 1.5 and 1.6 configurations, Derby dynamically detects the newer JDK version and dynamically loads alternate implementations of various functionality. I'll try to provide a more complete description of how this works in a separate post.

So just because the world is stuck at JDK 1.4, don't feel like you have to stay there.

Thursday, May 28, 2009

Google I/O is rocking!

Wow! Google I/O looks like it is a really hot conference! I had sort-of thought that developer conferences had fallen by the wayside, but Google appears to have brought them back to life with some amazing content, including:

Lots of new information about Google Chrome.
Switching projects hosted at Google Code from Subversion to Mercurial.
Google App Engine for Java is now ready for real use.
Something called Google Web Elements.
Something called Google Wave.
A new Java library for Google visualization.

I'm sure there's a lot more even than this; this is just what I noticed in a quick skim over the Internet...

This is rather overwhelming and intimidating! I guess I know what I'll be reading about for the next few days.

Java 6 Update 14

Java 6 update 14 has apparently been released.

This is a very substantial patch, with a lot of fixes and changes.

Among the things that interested me in this list:

6u14 contains the 10.4.2.1 version of Java DB. Note that this version number is very similar, but not identical, to a version number you might see on an Apache Derby release. But, there is no 10.4.2.1 release of Apache Derby. This was the subject of a long and complex discussion in the Derby community back in February. The simple summary is that 10.4.2.1 is a JavaDB release which contains some additional fixes beyond the 10.4.2.0 release available from the Apache web site, and the modifies 4th digit in the release number helps us track that.
6u14 contains a substantially-updated version of the VisualVM tool for Java monitoring and profiling.
6u14 contains a number of new HotSpot changes, most significantly a version of the much-anticipated G1 garbage collector.

Although I'm constrained by circumstances to do much of my work with JDK 1.4, I try to use JDK 1.6 whenever possible in my development environment, because it has so many fixes and improvements.

I'll try to get this update soon.

Implementing the SQL Standard(s)

In the Derby project, we work very hard to adhere to the SQL Standard.

Faithfully implementing the SQL Standard makes it easy for users to use Derby for applications that have been developed on other DBMS implementations, and also makes it easy for users to develop applications on Derby, then move them to other systems. Both of these are good uses of Derby, and I'm pleased to see them.

Furthermore, implementing the SQL Standard is an effective way to build a high-quality piece of software, for the SQL Standard is the result of decades of work by thousands of very intelligent and hard-working people, who have tried to describe the SQL language as accurately and thoroughly as they can.

That is, in general, the standard specifies reasonable and desirable behavior.

However, there is a major frustration in attempting to implement the SQL standard: it's not freely available. In order to access the official SQL standard, in order to contribute to the development of the official SQL standard, in order to participate in direct discussions with the team of people who maintain and enhance the official SQL standard, you have to have access to the ANSI/ISO world, which is a membership group which costs a substantial amount of money to join. As far as I know, the Apache Software Foundation does not have access to the SQL Standards process, which is a shame.

This leads to problems in:

access to the official text of the standard
access to communities for discussions involving interpretation of the standard
access to tools and test beds for verification of the standard

It would sure be nice if the SQL standard were openly and freely available to all.

Tuesday, May 26, 2009

Some high-level thoughts on Java Performance Tuning

So you think you want to do some performance tuning. Here's some high-level thoughts on the subject.

First, construct a benchmark. This will have at least these benefits:

It will force you to get specific about the software under test. For any significant and complex piece of software, there are a forest of inter-related variables: which features are you interested in tuning, what configuration will be used to run them, what workload will be applied, what duration of test, etc.
It will give you some real running software to measure, tune, adjust, etc. Without real running software, you won't be able to run experiments, and without experiments, you won't know if you're making progress or not.

OK, so now you have a benchmark. The next step is to run it! You need to run it, in order to start getting some data points about how it performs. You also need to run it, in order to start developing some theories about how it is behaving. Ideally, save all your benchmark results, recording as much information as you can about the configurations that were used, the software that was tested, and the results of the benchmark.

While the benchmark is running, watch it run. At this point, you're just trying to get an overall sense of how the benchmark runs. In particular, pay attention to the four major components of overall system performance:

CPU cycles
Disk activity
Memory usage
Network activity

You may, or may not, be able to see that your benchmark is constrained by one of these four resources. For example, if your machine immediately goes 100% CPU busy at the start of your benchmark, and stays there until the end, you are probably CPU-constrained. However, try not to jump to such a conclusion. Perhaps you didn't give the system enough memory? Or perhaps your benchmark is using too small of a configuration, and so everything fit tidily into memory, while in real-life your usage never experiences such a small problem, so real usage is never CPU-bound, but is always hitting the disk. Try to keep an open mind, and get a feel for whether your configuration is reasonable and your workload is realistic, and only then come to a conclusion about whether you are CPU, Disk, Memory, or Network-bound.

Perhaps, while running your benchmark, you didn't seem to be bound by any resource? Your CPU was never fully busy, you always had spare memory, your disks and network were never very active? Here are two other possibilities that you don't want to overlook at this stage:

Perhaps your benchmark just isn't giving the system enough work to do? Many systems are designed to handle lots of concurrently-overlapping work, so make sure that you're giving your benchmark a large enough workload so that it is busy. Try adding more work (more clients, or more requests, or larger tasks...) and see if throughput goes up, indicating that there was actually spare capacity available.
Perhaps your benchmark is fine, but your system has some internal contention or concurrency bottlenecks which is preventing it from using all the resources you're giving it. If it seems like you're not using all your resources, but adding more work doesn't help, then you should start to think about whether some critical resource is experiencing contention.

OK, so you have a solid benchmark, you've run it a number of times and started to develop a performance history, and you have some initial theories about where your bottlenecks might reside. Now let's start talking about Java-specific stuff.

If you're going to be this serious about Java performance tuning, you should be reading, because there's 15 years of smart people thinking about this and you can (and should) benefit from all their hard work. Here's some places to start:

Read every issue in the archive of Heinz Kabutz's wonderful newsletter.
Read Brian Goetz's archived series in IBM Developer Works.
Spend some time on Jack Shirazi and Kirk Pepperdine's web site: javaperformancetuning.com

OK, back to coding and tuning. We're almost to the point where we have to start talking about serious tools, but first, here's a couple simple ideas that will get you a surprisingly long way with your performance studies:

First, while you are running your benchmark, trying pressing Control-Break a few times. If you are running on a Linux or Solaris system, try sending signal QUIT (kill -QUIT my-pid) to your java process. This will tell your JVM (at least, if it's the Sun JVM) to dump a stack trace of the current location of all running threads in the virtual machine. Do this a half-dozen times, spaced randomly through your benchmark run. Then read through the stack traces, and consider what they are telling you: are you surprised about which threads are doing what? Did you see any enormously deep or intricate stack traces? Did you see a lot of threads waiting for monitor entry on some master synchronization object? I call this the "poor man's profiler", and it's amazing what it can tell you. It's particularly nice because you don't have to set anything up ahead of time; you can just walk up to a misbehaving system, toss a couple stack trace requests into it, and see if anything jumps out at you in the output.

The next step is to instrument your program, at a fairly high level, using the built-in Java clock: System.currentTimeMillis. This handy little routine is quite inexpensive to call, and for many purposes can give you a very good first guess at where problems may be lurking. Just write some code like this:


long myStartTime = System.currentTimeMillis();
int numLoops = 0;

for (some-kinda-interesting-loop)
{
 .. lotsa code that might be revealing here ...
 numLoops++;
}

long myElapsedTime = System.currentTimeMillis() - myStartTime;

System.out.println("My loop ran " + numLoops +
 " times, and took " + myElapsedTime + " milliseconds.");

Usually it doesn't take too long to sprinkle some calls to currentTimeMillis around, and get some first-level reporting on where the time is going.

One thing to keep in mind when using this technique is that the currentTimeMillis clock ticks rather irregularly: it tends to "jump" by 15-16 milliseconds at a time. So it's hard to measure things more accurately than about 1/60th of a second. Starting with JDK 1.5, there is a System.nanoTime call which is much more precise, which gets around this problem.

If you've gotten this far, you should have:

A reliable benchmark, with a history of results
Some high-level numbers and theories about where the time is going
Some initial ideas about whether your system is CPU-bound, Disk-bound, memory-bound, etc.

This by itself may be enough that you can make an experimental change, to see if you can improve things (aren't you glad you kept those historical numbers, to compare with?).

But if you can't yet figure out what to change, but you at least know what code concerns you, it's time to put a serious profiler to work. You can get profilers from a number of places: there are commercial tools, such as JProfiler, or tools built into Eclipse and NetBeans, or standalone tools like VisualVM. At any rate, find one and get it installed.

When using a profiler, there are essentially 3 things you can do:

You can look at elapsed time
You can look at method invocation counts
You can look at memory allocation behaviors

Each of them has an appeal, but each also has pitfalls to beware of. Method/line invocation counts are great for figuring out how often code is called, and for finding deeply-nested methods that are being run a lot more than you anticipated, which can lead to some very insightful realizations about how your code is working. But mindlessly trying to reduce invocation counts can lead to trying to optimize away code that in reality is very efficient; furthermore, you may be simply repeating the work that the JIT is doing automatically on your behalf for free.

Analyzing memory allocation is also a great way for finding code which is run a lot, and removing unnecessary and wasteful memory allocation can be a good way to speed up Java code. However, modern garbage collectors are just spectacularly, phenomenally efficient, and in recent years I've often found the cutting back on the generation of garbage by itself often doesn't really help much. However, if the generation of garbage is a clue (a "code smell", as they say) which leads you to locate a bunch of unnecessary and wasteful work that can be eliminated, then it's a good way to quickly locate a problem.

Lastly, there is the profiler's elapsed-time analysis. Sometimes, this can work really well, and the profiler can show you exactly where your code is spending its time. However, in a complicated server process, with lots of background threads, I/O operations, etc., figuring out how to interpret the elapsed-time graphs can be very tricky. I've often thought that the profiler was pointing a giant Flying Fickle Finger of Fate directly at some section of my code, worked industriously to isolate and improve that code, only to find that, overall, it didn't make any difference in the actual benchmark results.

Saturday, May 23, 2009

The prevalence of caching

I've been thinking about caching this week.

At work, I'm neck deep in a complex re-implementation of an intricate caching algorithm, and I realized the other day that I've been fascinated by cache implementations for almost 25 years.

My first exposure to caches was at Computer Corporation of America in Boston in the mid 80's. The team was extending the Model 204 cache algorithms as part of the new BTree implementation that was being built (previously, Model 204 had supported only entry-order and hash access methods). I learned that caches are intricate and delicate, and have to be used properly.

At Ingres, in the early 90's, I was deeply involved in several projects in the area of the cache implementation. In one project, we were re-designing the recovery system from a page-oriented physical recovery system to an ARIES-style logical recovery system, which gave me a deep appreciation for the inter-connectedness of the page cache and the recovery system. In another project we were providing shared-memory multi-processor support. Since the page cache was going to be located in shared memory, we converted it to position-independent data structures. Later, I spent some time thinking about how to handle a variety of page sizes in the page cache in an efficient manner, without excessive memory fragmentation, and while being simultaneously efficient at caching all pages sizes in the range, but didn't get to the point of building working code.

At Sybase, I worked with the Store team on implementing a page cache for the ISS system; this was my first exposure to C++ and to object-oriented methods. We tried to build a flexible cache that would be extensible and re-usable as the code evolved. We implemented a working prototype but never reached production code. However, many of the ideas behind this system traveled with Siuling, Mike, and Nat to Cloudscape (and thus to Derby).

For most of this decade, I've been extending, enhancing, and maintaining a complex Object-Relational Mapping library at work. If you haven't seen an ORM before, Hibernate is a good example. Our system isn't as sophisticated as Hibernate, but it's powerful and intricate and carefully tailored to our needs.

In particular, our ORM library provides an object cache, and does its best to return cached items when possible. As with any cache, the primary issues are:

Using resources (mainly, memory) efficiently
Managing shared access to cached objects
Ensuring cache coherency

A lot of work has gone into each of these general areas. When it comes to using resources efficiently, the most important topic involves the replacement policy; that is, what do you do when the cache is full? Here, you can find a rich research literature. I was lucky enough to take a class from Professor Betty O'Neil at UMB, and think that the LRU-K algorithm is among the best available.

Testing a caching algorithm is also a tricky task, as there are several hidden traps:

The cache may malfunction, and return stale data rather than the most up-to-date data. Only a very carefully written test will be sensitive enough to catch this.
The cache may also malfunction, and return dirty (not-yet-committed) data. Again, the test has to be very aware to be able to detect this behavior.

The cache may also be returning correct, accurate data, but may still have a variety of performance problems:

The cache may be failing to cache data, and may be needlessly re-computing data from the underlying data source.
The cache may be caching data effectively, but may be preferring one sort of workload over another. For example, it may be working well for a workload of mostly read queries, but may not function well when updates are being performed. Or vice versa.
The cache may be violating some of its configuration. For example, it may not be implementing its replacement policy as designed. Or it may be exceeding the resource limits that have been assigned to it.
The cache may be doing all its processing correctly, but it still may be using too many resources, for example it may be experiencing high contention on thread synchronization, or it may be using inefficient algorithms for maintaining its data structures.

It's the last problem that I've been chasing this month.

Cache coherency is an complex problem, and this is the area where I've been spending a lot of time recently. In our system, one of the problems that is most challenging involves the situation where a transaction modifies an object, and that object is referenced by some other object that is resident in the cache. We have some complex graph traversal algorithms which trace the references and locate stale data which must be refreshed. In many ways it's similar to the type of code you see in garbage collection algorithms, although our problem is much simpler than the full GC problem.

In my particular case, the issue I've been struggling with is how to verify whether a modified data value happens to be currently referenced by an object which is referenceable from some object currently resident in the cache. That is, if object D is modified, and object A points to object B points to object C points to object D, and object A is currently resident in the cache, then I need to know this fact, because the modification of object D is relevant to object A. The problem is that, since objects are complex and have lots of inter-connectedness, there can be a variety of "pointer paths" that could possibly connect from A to D. The goal is to find any such object A instances as cheaply as possible, avoiding simple-to-implement-but-unusable-in-practice algorithms such viewing every object that is reachable from every cached object to see if object D is in that set. The implementation that we use is based on 2 building blocks:

We analyze the object schema to construct the possible reference paths of interest
We partition the cache by type, so that we can efficiently locate the cached instances of a desired type.

Then we can examine just the necessary cached instances, and traverse only the possible paths.

Friday, May 22, 2009

A small request for compiler writers

If you're writing a compiler, and you happen to issue an error message like:


 [javac] C:\src\xxx.java:230: ')' expected
 [javac]     }
 [javac]     ^

It would be really nice if you could take the time to make that error message also say something like

Unmatched left parenthesis was seen on line 184

That way I wouldn't have to fiddle around using hacky techniques like commenting out half of my program to try to figure out where I accidentally left out the parenthesis.

It seems that modern compilers do usually issue quite precise messages; maybe this particular one just fell through some crack in the parse techniques where the compiler just had no idea where it was that the parenthesis matching got out of sync.

Oh well, I eventually found it. Just wish I could have that 15 minutes of my life back.

Wednesday, May 20, 2009

Silly mistake with JUnit assertNotNull

Every so often, I make a fairly simple mistake using the JUnit Assert class:


SomeType myObject = SomeTypeFactory.createInstance();
assertNotNull("unexpectedly had a null myObject");
assertTrue("unexpected return from callSomething",
  myObject.callSomething());
assertEquals("unexpected result of callAnother",
  4, myObject.callAnother());

Do you see the error? Maybe it helps if I point you at the doc?

Well, it's a pretty simple error. I forgot to pass the object that I wanted to check into the assertNotNull method, so I unexpectedly called the other overload of assertNotNull; that is, the overload that just takes a single object. (My message string, after all, is an Object.)

And so my call to assertNotNull is checking to see if the message string is not null, which of course it always is not null, so that assertion never does any good.

And then one day I'm surprised by getting a null pointer exception on the next line (the call to myObject.callSomething) because the factory method failed and returned a null, and I think to myself: "Hmmm, that's weird, why didn't the assertNotNull catch that?"

And then I kick myself. Again. For making that same simple mistake another time.

Monday, May 18, 2009

TimerTask exception handling

We were running an enormous distributed multi-machine performance test, stressing the system heavily, and it wasn't working very well. Deep into the test, something was going wrong. Crawling through the logs, we discovered a strange error:

java.lang.IllegalStateException: Timer already cancelled

I was mystified: what did this mean?

Roger, my co-worker, however, had seen this message before, and knew: "it means that Java ran out of memory." Sure enough, we increased the -Xmx setting for the JVM, and the message disappeared.

I was happy that we resolved the problem, but unhappy with the manner of the resolution: why hadn't an out-of-memory problem displayed the symptom I expected, namely: java.lang.OutOfMemoryError?

So, with a few more hints from Roger, I studied this some more, and uncovered this bit of wisdom from the JavaDoc for java.util.Timer:

If the timer's task execution thread terminates unexpectedly, for example, because its stop method is invoked, any further attempt to schedule a task on the timer will result in an IllegalStateException, as if the timer's cancel method had been invoked.

Looking in our code, I saw that we had implemented a sub-class of java.util.TimerTask, and our sub-class has a run() method which catches Exception, but does not catch Throwable. Therefore, I think it is possible that exceptions may be terminating our run(), resulting in the timer being marked cancelled as described above. In particular, since it extends Error, not Exception, OutOfMemoryError would be such an exception.

Furthermore, I saw that none of our calls to timer.schedule() catch or check for IllegalStateException (it's not a checked exception).

So it seems quite plausible that we could have the following sequence of activity:

The app server runs out of memory, and the thread that notices it is a TimerTask thread running our run method

OutOfMemoryError is thrown, is not caught, and terminates the run method

The timer is marked cancelled because run threw an exception

Subsequently, uses of that timer result in IllegalStateException

I talked this over with several others and it seemed like it was holding water. Googling, it seemed like I wasn't the only holder of this theory. So I proceeded to the next step: I wrote a test.

Sure enough, when I threw an exception that extended Error out of the run method, subsequent calls to the schedule method of the corresponding timer threw IllegalStateException exceptions with the message "Timer already cancelled".

Confirmed!

My current plan is as follows:

Catch Throwable, not just Exception in our run method.

Log any such exception to our error log, together with a full stack trace.

Don't allow exceptions to propagate out of our run method.

Check for IllegalStateException when calling timer.schedule, and wrap any such exception that we get with a wrapper exception indicating that the JVM might have run out of memory before throwing it

And to hold it all together, I expanded my test case to contain three tests:

A test which throws an Error-based exception in the TimerTask, to prove that the catch block now catches it.
A test which tries to call schedule on a Timer after an Error-based exception has been thrown and caught in the TimerTask, to prove that the Timer can still be used.
A test which tries to call schedule on a Timer after cancel of the Timer, to prove that the new chained exception is thrown properly and that its text clearly indicates a possible out-of-memory condition.

Hopefully, this new caution surrounding TimerTask exceptions will not only prevent our timers from getting into this damaged state, but will also improve the error messages to the log when resources start to get low.

And, there's now three more tests in the test suite!

Build Farm Part 3: The core query

Returning to the topic of the Build Farm, recall that the core mission of the Build Farm is to determine the best job that a machine can run. We managed to distill that mission into a single query; this query is executed when a machine asks for a job to do, and the point of the query is to fetch the "best" available job. There may be no jobs currently available to run, or there may be several that the machine could run, and we want to find the "best".

Here's the query:


select wq.id, p.name as project_name, p.depot_root, p.label_name, 
       i.sync_at_start, wq.label_override,
       i.operation, wq.item_id, i.max_duration,
       wq.detail,i.status_token,i.working_dir,wq.group_inst_id,
       i.failure_token,i.min_reqd_tests
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
     left outer join apt_projects p on wq.project_id = p.id
where wq.started_time is null and
     wq.item_id not in ( select distinct item_id from apt_item_cap_map
                          where capability_id not in
                                 (select capability_id from apt_machine_desc,
                                                            apt_machine_cap_map
                                   where apt_machine_desc.id =
                                                apt_machine_cap_map.machine_id and
                                         apt_machine_desc.name = 'MACHINE-NAME' ))
      and wq.project_id not in
                                 ( select distinct project_id from
                                                            apt_project_cap_map
                                   where capability_id not in
                                      (select capability_id from apt_machine_desc,
                                                                 apt_machine_cap_map
                                          where apt_machine_desc.id =
                                                 apt_machine_cap_map.machine_id and
                                          apt_machine_desc.name = 'MACHINE-NAME' ))
      and (wq.machine_id is null or 
                                 wq.machine_id=(select id from apt_machine_desc
                                          where name='MACHINE-NAME'))
order by wq.priority desc

Wow, that's a mouthful! Let's try to break it down into smaller parts.

Firstly, for understanding the query, it doesn't really matter what particular columns are retrieved, so let's just collapse the columns together, and call them "STUFF":


select STUFF 
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
     left outer join apt_projects p on wq.project_id = p.id
where wq.started_time is null and
     wq.item_id not in ( select distinct item_id from apt_item_cap_map
                          where capability_id not in
                                 (select capability_id from apt_machine_desc,
                                                            apt_machine_cap_map
                                   where apt_machine_desc.id =
                                                apt_machine_cap_map.machine_id and
                                         apt_machine_desc.name = 'MACHINE-NAME' ))
      and wq.project_id not in ( select distinct project_id from
                                                            apt_project_cap_map
                                   where capability_id not in
                                      (select capability_id from apt_machine_desc,
                                                                 apt_machine_cap_map
                                          where apt_machine_desc.id =
                                                 apt_machine_cap_map.machine_id and
                                          apt_machine_desc.name = 'MACHINE-NAME' ))
      and (wq.machine_id is null or wq.machine_id=(select id from apt_machine_desc
                                          where name='MACHINE-NAME'))
order by wq.priority desc

Secondly, the condition wq.started_time is null means "job hasn't started yet", and the condition:


                            and (wq.machine_id is null or 
                                 wq.machine_id=(select id from apt_machine_desc
                                          where name='MACHINE-NAME'))

says "the job either doesn't care what machine it runs on, or it has specifically requested this machine". Let's abbreviate that to "and ok-to-run-on-this-machine".

Thirdly, there's a sub-query which occurs twice in this larger query. This query says: "get this machine's capabilities":


   select capability_id from apt_machine_desc, apt_machine_cap_map
   where apt_machine_desc.id = apt_machine_cap_map.machine_id and
         apt_machine_desc.name = 'MACHINE-NAME'

Each capability_id for a particular machine describes some bit of its configuration in a way which is important to the Build Farm scheduler. So for example, one our our build machines might have a capability for "OS=Windows", a capability for "Java=JDK 1.5", and a capability for "DBMS=Oracle 10g", meaning that this machine is running Windows, and has JDK 1.5 and Oracle 10g installed (and so can run jobs which require those resources). So we can replace this sub-query with "this machine's capabilities" and the query now looks like this:


select STUFF 
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
     left outer join apt_projects p on wq.project_id = p.id
where job-hasn't-started-yet 
     and wq.item_id not in ( select distinct item_id from apt_item_cap_map
                          where capability_id not in (this-machine's-capabilities))
     and wq.project_id not in ( select distinct project_id from apt_project_cap_map
                          where capability_id not in (this-machine's-capabilities))
    and ok-to-run-on-this-machine
order by wq.priority desc

That's good, the query is starting to get a bit easier to read. Let's keep going.

The first remaining complex condition in the WHERE clause:


     and wq.item_id not in ( select distinct item_id from apt_item_cap_map
                          where capability_id not in (this-machine's-capabilities))

means "job is not one that requires a capability this machine doesn't have", and the other, similar, condition on the job's project means "job's project is not one that requires a capability this machine doesn't have."

So now we have:


select STUFF 
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
     left outer join apt_projects p on wq.project_id = p.id
where job-hasn't-started-yet 
     and job-doesn't-require-a-capability-this-machine-doesn't-have 
     and job's-project-doesn't-require-a-capability-this-machine-doesn't-have
    and ok-to-run-on-this-machine
order by wq.priority desc

That's much better. Breaking down the "NOT IN ... NOT IN ..." double negatives into English hopefully makes them easier to understand, but they are still double negatives, unfortunately, which means you have to read the query forwards and backwards and let it bounce around in your head for a while.

Lastly, the ORDER BY clause arranges all the legitimate candidate jobs in order by priority, descending, so that the first row in the result is the best job for this machine to process.

Hopefully the query makes sense now!

Sunday, May 17, 2009

Group By Rollup

About two years ago, I did some investigation and prototyping of a new Derby feature: GROUP BY ROLLUP.

I got distracted by some other work, and time passed. But recently Knut noticed that I hadn't done anything on the project in a while, and encouraged me to return to it.

So I have.

In SQL, the GROUP BY statement instructs the DBMS to partition the data into separate groups based on the values of one or more columns; usually it is used with computations which evaluate a COUNT, SUM, AVG, or other function across the groups. So, for example, it is common to see:

SELECT D.NAME, COUNT(E.ID) FROM EMPLOYEE E, DEPARMENT D WHERE E.DEPT_ID = D.ID GROUP BY D.NAME

SELECT D.NAME, SUM(PROJ.BUDGET) FROM PROJECT PROJ, DEPARTMENT D WHERE PROJ.DEPT_ID = D.ID GROUP BY D.NAME

SELECT D.NAME, AVG(E.SALARY) AS AVG_SALARY FROM EMPLOYEE E, DEPARTMENT D WHERE E.DEPT_ID = D.ID GROUP BY D.NAME ORDER BY AVG_SALARY

The first statement finds the number of employees in each department; the second statement finds the total budget for each department's projects; the third statement finds the average salary for each department.

You can group by more than one column, so if you wanted to find the budget for each department, for each project type, for each month, you'd write something like:


  SELECT D.NAME, P.MONTH, P.TYPE, SUM(P.BUDGET)
         FROM DEPARTMENT D, PROJECT P
         WHERE D.ID = P.DEPT_ID
         GROUP BY D.NAME, P.MONTH, P.TYPE

This statement produces a detailed breakdown, but when producing a report, the user commonly wants to "roll up" these detailed groups so that they get an overall sum for:

All the project types for this department this month (GROUP BY D.NAME, P.MONTH), as well as

All the project types for the entire budget year for this department (GROUP BY D.NAME)

The GROUP BY ROLLUP feature is designed for precisely this purpose, as it allows the user to issue a single statement which produces groupings at all those levels of detail in a single result set:


  SELECT D.NAME, P.MONTH, P.TYPE, SUM(P.BUDGET)
         FROM DEPARTMENT D, PROJECT P
         WHERE D.ID = P.DEPT_ID
         GROUP BY ROLLUP ( D.NAME, P.MONTH, P.TYPE )

The prototype work seems to be still quite relevant, as that part of Derby hasn't changed substantially. Knut noticed a few things that need cleaning up, so I'll start with those.

I suspect that the major pieces to tidy this up and ready it for commit will be:

More testing. I wrote a couple trivial tests, but more are required.
Documentation. Some additional information is required to describe what the new feature is and how to use it.
Performance. The ROLLUP feature is essentially a shorthand for issuing N separate SQL statements, so as long as it doesn't perform any worse than the separate SQL statements, it's a useful feature because of its convenience. But in principle it should enable some optimizations which would cause it to perform substantially better in many cases, so I'll want to have a look at whether we can do any of those improvements.

If I can find a little bit of time, I should be able to accomplish this over the next few months, and see if I can get the project over the hurdle from prototype to operational.

Friday, May 15, 2009

Pruning File Servers

One of our main file servers filled up its disk last night.

This happens quite regularly to us. Our nightly build and test process produces something like 20-40 Gb of output each day, so we are always fighting a losing battle with disk space.

Modern file servers can easily hold several terabytes of disk space at a quite reasonable price, so we have about 5 terabytes of shared master file server space available.

But since we produce about a terabyte a month, that's only 4-5 months worth of data that we can keep online.

So, given that we can't keep everything online forever, how do we handle it?

Well, there's a few basic principles that we go by:

Don't clean until we need to, even to the point of allowing the disk to go 100% full. Modern file systems seem to handle this without going corrupt or doing other stupid things.
When cleaning, clean minimally and conservatively.
When cleaning, consider that not all files are alike. Some files are more important than others.

That is, we try to ensure that we keep more important data online in preference to less important data, and we try to ensure that we keep as much data online as possible.

At time in the past, we've used automatic disk pruning tools that I've written, and there are still a few places on our file servers where we do this. But writing and running automatic disk pruning tools is a very scary activity, because a bug in the tool could easily cause much more damage than I'm willing to risk. When writing those tools, I tried to be very careful:

Testing the tool extensively
Auditing the actions of the tool, and mailing myself a full report when the tool took action
Providing the tool with extensive rules so that it could decide which files to deleted based on age, size, type, name, and other characteristics
Controlling the tool so that it didn't just rummage over the entire hard disk, but instead was an "opt in" system, where we had to explicitly instruct the tool to access a certain area on the file system.
Never running the tool as a highly-privileged user.

Even so, I was always worried whenever the tool ran, nervous that it would do something terrible to the master file server.

Each time the file server fills up, there is a certain amount of inconvenience, and a certain amount of downtime, while we scramble to find and free up some space to keep running. And each time this happens, I think that I should return to my automated pruning tools, and get them back into operation, so that this part of the operations runs more smoothly.

But for the time being, this part of our production system remains a manual process.

Thursday, May 14, 2009

Derby's clean targets are quite complex

I was puzzled by a mystifying build error in one of my Derby clients.

The problem arose in the runmessagecheck target, which is an internal build tool that cross-checks the internationalized message files against the message codes in SQLState.java and MessageId.java.

In this particular Derby client, the cross-checks were failing.

I couldn't figure out anything different between this Derby client and my other Derby clients. It was, as I say, a mystifying build error.

Then, on a hunch, I noticed that I had CLASSPATH set when I ran the build, so I cleared CLASSPATH, and the problem went away. So I set CLASSPATH back again, and the problem came back.

OK, so I understood that the problem was due to my CLASSPATH, but I was still puzzled, because on my other machines, I can set CLASSPATH and I don't have this build error.

I struggled with the problem some more, until I finally looked in great detail at the CLASSPATH, and at the jars it was pointing to, and saw that the jars were ancient. On this particular machine, I hadn't built in a while, and so my jars were months old.

I rm'd the jars, and the problem went away.

Now I understood that the problem was:

the MessageBundleTest tool cross-checks the externalized message files against the SQLState and MessageId classes that it loads from the CLASSPATH
I had set CLASSPATH to point to some very old jars, so it was cross-checking some old binaries against the current up-to-date source, and failing the cross-check

But, there was still something I didn't understand: I had done "ant clobber". I have this mantra that whenever I re-visit an old Derby client, I can bring it up to date by:

"svn revert -R .". This removes any partly-made changes from the client.
"svn up". This gets the latest source code from the master repository.
"ant clobber". This removes all the build artifacts from the client. I always use "ant clobber", not "ant clean", since "ant clean" cleans much less than "ant clobber".
"ant all". This compiles the Derby code.
"ant buildjars". This packages the Derby code into jars.

So what was wrong with my mantra?

It turns out that the error is in the "ant clobber" step. I thought that "ant clobber" did a good job of cleaning the build artifacts from the client, but it doesn't. In particular, it doesn't remove the jars that I built. (It also doesn't remove junit test results, and perhaps some other things. But the jars is the part that was really messing me up.)

So my problem is that, since my CLASSPATH was pointing at my jars, and since "ant clobber" was not cleaning my jars, my "ant all" step was running the message check utility against the old version of the code. When I cleared my CLASSPATH, the message check utility checked the messages in the classes directory, and passed.

This behavior is documented in the build.xml file, where there is a comment saying that "ant clobber" does not remove the jars.

But while I was studying build.xml, I found the "cibuild" target, which does just what I want:


    cibuild - target suitable for continuous integration builds.
    Clobbers/cleans everything it can, then builds the jars
    and the javadoc.

 <target name="cibuild"
       depends="clobber,cleanjars,cleandocs,junit-clean,gump_all"/>

 <target name="gump_all" depends="all,buildjars,javadoc"/>

So now I know that, when I go to a relatively out-of-date client, and bring it up to date, I shouldn't just do "ant clobber; ant all; ant buildjars". Instead, to be thorough, I should do "ant cibuild".

I've been a Derby committer for 3+ years, and I just figured this out now!

Fuzzer farms

Dave Aitel made a great posting on his DailyDave mailing list the other day. You do read DailyDave regularly, don't you? Dave was in Shanghai for SyScan, which is a security conference with a somewhat unusual format, being held in 4 different locations serially over a period of several months.

At any rate, Dave was talking about some work done by Ben Nagy of COSEINC in the area of fuzzing, and specifically regarding fuzzing of Microsoft Word.

Fuzzing is a technique for searching for bugs in software, specifically searching for security bugs, but it can be used for finding various types of bugs. I first heard about fuzzing in the network protocol field, but I see that its use has expanded to include API fuzzing, data fuzzing, and more. Fuzzers, which are the tools that perform fuzzing, use randomization or exhaustive enumeration techniques to generate a succession of arbitrary semi-random input to a program, and then watch the results. Since the generated input is random, the fuzzers generally don't really care what the program does, so long as it doesn't crash or hang.

Fuzzers have been around for a long time (20+ years), but really took off when the security field began using them to find security vulnerabilities.

So, back to Dave's report on Ben's talk. The talk, which I think covers work Ben has been doing for years, involves ways to throw hardware at the fuzzing problem. The problem with fuzzing is that it can be slow, because you may have to try a lot of random data input before you find some input with a problem.

So Ben has apparently built a fuzzing farm, which allows for automated processing that can, in fact, process a lot of data. The reported results are that they can run 1.2 million fuzz cases per day against Microsoft Word. I'm guessing, but I think that a fuzz case is probably something like:

generate a random document that looks enough like a Microsoft Word document to be interesting.
Invoke Microsoft Word to open the document.
See if it came up properly, or if it hung or crashed. If it hung or crashed, record that fact somewhere, saving the random document which triggered the hang or crash.

They apparently have 72 machines in their fuzzer farm, with each machine able to execute 20 fuzz test cases per second, and have discovered that, amazingly, 10% of these fuzz test cases result in crashes or hangs.

That means that they are able to generate about 150 Microsoft Word crashes or hangs per second in their farm.

Now, each crash or hang then needs further investigation. Many of them may be completely boring and unusable as security vulnerabilities. But some of them represent new bugs that could be used by some virus author somewhere to build the next virus. So finding them faster than the bad guys can find them is a Good Thing.

However, as Dave notes, the problem then shifts from finding crashes to evaluating them.

Presumably, the next step will be that somebody will figure out how to automate the assessment of a Microsoft Word crash, to determine whether the crash represents a possibly exploitable security vulnerability.

In fact, such a thing apparently exists!

My friend Gil loves to write tools. I'm very much a tool writer myself, but Gil is a tool writer in the extreme. When I worked with him, I saw that he would always look for an opportunity to write/extend/enhance a tool.

When you get in that mindset, and stay there, you start building some great tools.

Tuesday, May 12, 2009

sp_spaceused and multiple JDBC result sets

I'm quite familiar with using JDBC's statement class to execute a query and return a result set:


ResultSet rs = conn.createStatement().executeQuery("select * from employee");
while (rs.next())
System.out.println("Employee: " + rs.getString("name"));

Recently, though, I encountered a variation of this that I hadn't seen before: a statement which may return multiple result sets.

I was trying to enhance an internal tool which I use for performance benchmarking. This tool connects to a database, and summarizes the database contents. The tool currently:

Enumerates each of the user tables in the database
executes select count * from table on the table, to get a count of the number of rows.
sorts the list of tables by the number of rows, and
prints out the top ten tables and the number of rows in each

So far, it's been a very useful tool. But I wanted to enhance it, to display the overall disk space used by each table. Unfortunately, this becomes vendor specific, as there doesn't appear to be any standard-to-all-databases way to ask "how much disk space is this table using?"

So, since I'm currently working with Microsoft SQL Server, I wanted to invoke the sp_spaceused system procedure, to get information about the space. So I needed to enhance my tool to do:


statement.execute("exec sp_spaceused 'my-table-name'");

This is where things get weird, because Statement.execute is a weird beast. Here's the description from the documentation:

Executes the given SQL statement, which may return multiple results. In some (uncommon) situations, a single SQL statement may return multiple result sets and/or update counts. Normally you can ignore this unless you are (1) executing a stored procedure that you know may return multiple results or (2) you are dynamically executing an unknown SQL string.
The execute method executes an SQL statement and indicates the form of the first result. You must then use the methods getResultSet or getUpdateCount to retrieve the result, and getMoreResults to move to any subsequent result(s).

Guess what? I found myself in one of those "uncommon" situations! sp_spaceused is a rather bizarre stored procedure that may return one result set, or it may return two result sets. It returns a single result set in the case I was using, where I want to get the space used by a single table. It returns two result sets in the case where you pass no arguments to the procedure, in which case it gives information about the overall space used by the entire database.

So what does this mean from a practical point of view? Well, the JDBC documentation sort of hints at the fun:

Because the method execute handles the cases that are out of the ordinary, it is no surprise that retrieving its results requires some special handling.

The documentation goes on for several pages to describe all the various intricacies of different things that a stored procedure might return, then offers a snippet of code to handle the general case:


stmt.execute(string-with-unknown-results);
while (true)
{
  int rowCount = stmt.getUpdateCount();
  if (rowCount > 0) // this is an update count
  {
      // handle the row count ...
      stmt.getMoreResults();
      continue;
  }
  if (rowCount == 0) // this is a DDL command or it performed 0 updates
  {
      // handle the DDL command case ...
      stmt.getMoreResults();
      continue;
  }
  // if we get here, we have either a result set, or no more results
  ResultSet rs = stmt.getResultSet();
  if (rs != null)
  {
      // call rs.getMetaData() to learn about the result set "shape"
      // handle the result set, if it's the one you want ...
      while (rs.next())
      {
          // process a row from this result set ...
      }
      rs.close();
      stmt.getMoreResults();
      continue;
  }
  // if we get here, we have fully exhausted the results from stmt.execute()
  break;
}
stmt.close();

Other sources have slight variations of this code. This is nightmarish, but having worked my way through it, I think I understand it, and even better, when I coded it up, my calls to sp_spaceused work! I haven't yet encountered a stored procedure which can return a combination of result sets and update counts, but when I do, I'm not going to be quite so confused about how to handle it.

Best of all, now I can start trying to figure out why my database is so big!

Monday, May 11, 2009

Design your Database

When we started designing the Build Farm implementation, one thing we agreed on almost immediately was that we wanted to explicitly and thoroughly design the database schema. We both had had frustrating encounters with the modern trend toward high-level Object-Relational Mapping systems, in which it seems that the typical approach is to ignore the underlying database, and spend all your energy designing the object layer.

I'm not trying to say that there is no point to designing the object layer; object modeling is quite important and necessary. But it seems as though, more and more, people don't want to talk about the database at all!

They don't want to talk about data types, or candidate keys, or normalization,
They don't want to talk about check constraints, or referential integrity, or update propagation,
They don't want to talk about index definitions, or access patterns, or query planning, or transaction design,
They don't want to talk about data lifetimes, or versioning, or auditing, or security control.

They don't seem to want to talk about the database at all, which I think is an unfortunate turn of events. Sure, modern database systems are complex and intricate, and have behaviors that take thorough and detailed study. But they are also extraordinarily powerful pieces of software, with amazing capabilities, and rich, mature functionality. They can be used well, or they can be used poorly.

In the Build Farm, we started our overall design by discussing some of the basic goals and requirements. Then, the next step we took was to talk about the data that we wanted to track. We discussed the data at a fairly abstract level, then we sat down at a keyboard and typed in a bunch of DDL: CREATE TABLE, CREATE INDEX, etc. This forced us to get specific about table names, and column names, and data types, and inter-table relationships, and so forth. Then we loaded up a little bit of sample data into those tables (by hand, using SQL Squirrel), and practiced writing some queries.

This is what people used to call "building the Data Dictionary". I don't know if anybody talks about a data dictionary anymore; I guess I'm old-fashioned.

We certainly made some mistakes when we implemented the database schema for the Build Farm. We've fixed some of those mistakes, and we'll need to fix more of them in the future.

But I firmly believe that the decision to design the database first was a major reason why the Build Farm's implementation was a success.

Sunday, May 10, 2009

The inverse relationship between build time and build errors

The other day, I needed to test a bugfix.

It was, I thought, a fairly simple fix, and I was pretty confident I knew what to change. I just needed to modify two classes in the same package in the same subsystem.

However, to test the fix, I needed to build the code.

And, in this particular project, performing a complete build of the code from scratch takes more than two hours on the fastest machine that I have available.

So, I tried to cheat. I tried to hack things, and do a partial build, and tried to wedge my changed code into an already built tree.

Of course, I messed it up, and didn't quite get the changed code into all the right places (I built the jar, but not the war; or maybe I built the war, but didn't clear the previous copy out of the cache; or something like that, it doesn't really matter). At any rate, it was a build error. So I double-checked my steps, and found that error, and then tried to hack the build again, and made another mistake.

Eventually I got it right, and tested my fix. After about 2 hours of trying to hack the build in order to avoid running the 2 hour build. Sigh. Another afternoon wasted.

On Derby, this never happens to me.

On the machine I generally use for working on Derby, I can build the complete Derby tree, "soup to nuts" (ant clobber; ant all; ant buildjars), in barely two minutes.

So I never, ever have an annoying build problem. I never carefully set up a test and then discover that I'm not running the code that I thought I was running. I never find that I've changed a method signature and not caught one of the places that uses that method because I failed to compile it.

Of course, things are never fast enough. I know that a number of Derby developers find that two minutes is too long, and so they routinely skip the jar-building step, which shaves 30 seconds off the build time. And I can understand their frustration; there are times when 30 seconds is a long time to wait.

But, this morning, I waited 142 minutes after making a change, before I tested it.

And then I noticed I'd made a simple mistake :(

Friday, May 8, 2009

Build Farm Part 2: The core mission

The core mission of the Build Farm is:

Accurately assign the right work to the right machines,
While providing access to historical results for analysis.

You can make a perfectly reasonable argument that both these sub-missions are equally important, but it's the first part of the mission that I want to talk about here: as jobs queue up on the Build Farm, and as machines become available to do work, which job should be assigned to which machine?

Efficiently executing work is an often-discussed topic; you can find it covered until topics such as "Workload Management", "Resource Management", etc. It tends to be the province of software such as DBMSs, Application Servers, and Operating Systems. That is, it's a really hard problem to do this well.

We did not want to write an operating system.

So, we made several simplifying assumptions:

We can afford to over-provision our Build Farm. Machine resources are cheap enough nowadays, and our overall problem is small enough, that we can "throw hardware at the problem".
We don't have to be perfect. We have to do a decent job of scheduling work, but we can afford to make the occasional sub-par decision, so long as in general all the necessary work is getting done in a timely fashion.

We decided upon a fairly simple design, based on two basic concepts: Capabilities, and Priorities.

Priorities are just what you think they are: each job in the Build Farm is assigned a priority, which is just an integer, and jobs with a higher priority are scheduled in preference to jobs with a lower priority.

Capabilities are an abstract concept which enable the Build Farm to perform match-making between waiting jobs, and available machines. Jobs require certain capabilities, and machines provide certain capabilities, and for a job to be scheduled on a machine, the job's capabilities must be a strict subset of the machine's capabilities.

So, for example, a particular job might specify that it requires the capabilities:

Windows
JDK 1.5
JBoss 4.3
DB2 9.5

Trying to run this job on a Linux machine, or on a machine which only has Oracle installed, would be pointless. So each machine specifies the software which it has available as a similar set of capabilities, and the Build Farm can use this information to decide if this machine is capable of handling this job.

The bottom line is that we re-defined the Build Farm's core job-scheduling requirement to be:

When a machine is available to do work, ask it to do the highest-priority job in the queue which this machine is capable of executing.

When a programming problem is too hard, re-define it so that you have a problem you can actually solve.

Wednesday, May 6, 2009

A great test is a pathway to a mystery

I've had some great testing experiences recently. It's a bit challenging to explain why they were great experiences, but it put me in mind of this quote:

We absolutely must leave room for doubt or there is no progress and no learning. There is no learning without having to pose a question. And a question requires doubt. People search for certainty. But there is no certainty. People are terrified–how can you live and not know? It is not odd at all. You can think you know, as a matter of fact. And most of your actions are based on incomplete knowledge and you really don't know what it is all about, or what the purpose of the world is, or know a great deal of other things. It is possible to live and not know.

Richard Feynman, The Pleasure of Finding Things Out.

Recently I've been involved with an internal test at work, designed to examine the behavior of our system across DBMS outages. The idea is that:

Things are OK,
Then the database server goes down, or the network is cut, or the disk fills up, or...
So we need to detect that, and recognize that, for the time being, the database is unavailable, and make sure that the user can easily tell that we are experiencing a database outage,
Then, at some later point, access to the database is restored,
And we need to recognize that, and clear our outage status, and resume normal processing.

Once we had come to agreement on the design and felt that we had a reasonable implementation, we set about proving that our implementation worked, by writing a test (actually, a suite of tests, all quite similar). We've been trying to do this in a fairly automated fashion, by using some simulation tools:

Our test simulates normal activity, both before and after the outage.
Our test simulates the database outage, by using a configurable wrappering JDBC driver, similar to p6spy, but actually adapted from some ideas published many years ago in a O'Reilly OnJava article.

So we wrote this test, and we've been running this test, and running this test, and running this test, and running this test. For close to a year now this test has been finding interesting problems and showing us new insights into the behavior of the system.

At first, many of the problems were with the test itself: it didn't do a very good job of simulating database outages, so sometimes the test would expect the database to be showing an outage, only it actually wasn't, and vice versa. And it didn't do a very good job of simulating normal system activity, so it wouldn't succeed in forcing the code down paths that we expected it to touch, or it would predict that the system would exhibit certain behaviors, but when we actually ran the real system, and provoked a real DBMS outage (e.g., shutting down the DBMS server process manually), we wouldn't see the behavior we expected.

But after a while we started to become confident in the test, and about at the same time we started to find various real problems in the system under test.

Our notion of status tracking wasn't very precise, and the test revealed that the code wasn't being very careful about tracking the status of the system, and the status of the database.
The system wasn't being very crisp about state transitions, so sometimes it would sort of lurch haphazardly between states.

Because the system runs as a server, with many background threads which are scheduled based on events or timers, its behavior is extremely complex. Moreover, since it is a multi-threaded system, accessing a shared database using a pool of available connections, the overall effect is that multiple different parts of the system may experience a database outage in different ways more or less simultaneously. It's quite challenging to understand the behavior of such a system; this test has been an extremely useful tool for doing that.

We continue to run the test; it continues to fail in new and different ways; there continue to be new mysteries to contemplate and explore; life is good!

Tuesday, May 5, 2009

File scheme URLs

I'm through with file scheme URLs.

An URL, as of course you know, is a Uniform Resource Locator. The notion is that, given an URL, client software (such as your web browser) can figure out how to retrieve that resource for your use. In the Build Farm, as is true in many modern web applications, the user interface is full of URLs, as there are many resources that we wish to display.

In the Build Farm, many of the resources to which we provide access are files:

build logs
test output
server logs

These files are recorded by individual build machines during job execution, and then are saved on the master file servers. The Build Farm UI then wants to make these files available, so I need to show links (URLs) to these files.

For a time, I tried to construct these links using file scheme URLs. URLs can use a variety of schemes. The "scheme" is the part of the URL at the start, which describes the underlying technique that is to be used to access the resource. Among the common schemes are:

http:// URLs, which are the most common, and which indicate that HyperTextTransportProtocol requests are to be used to access this resource.
https:// URLs, which are the secured variant of http URLs.
mailto: URLs, which are used for constructing links that initiate the sending of an email request.
file: URLs, which can be used to specify a file.

When you first have a look at file scheme URLs, they don't look too bad:

file://host/path

However, have a very close look at that middle paragraph in the description on Wikipedia:

On MS Windows systems, the normal colon (:) after a device letter has sometimes been replaced by a vertical bar (|) in file URLs. For example, to refer to file FOO.BAR in the top level directory of the C disk, the URL file:///C|/FOO.BAR was used. This reflected the original URL syntax, which made the colon a reserved character in a path part. For network shares, add an additional two slashes. For example, \\host\share\dir\file.txt, becomes file:////host/share/dir/file.txt.

Trying to figure out these details of how to handle device letters and network file shares is where the wheels start to come off the wagon. As I said, I tried, for a time, to make file scheme URLs work for network share references to build logs and other goodies stored on our file servers. But I could never get it to work reliably across different versions of different browsers. Firefox seemed to behave differently than Internet Explorer, and different versions of the various browsers seemed to behave differently. And the more I searched around on the net, the more confusing it got. There were seemingly authoritative articles suggesting use of 3 slashes, 4 slashes, even 5 slashes. In the bug reports, people seemed to be flailing, even using 20 or more slashes!

It was terribly frustrating, but in the end, I realized that I didn't need to care. Instead, I realized that I was trying to link to a bunch of files that were all on file servers that my web server already had access to. So, I simply changed my web server so that it was willing to serve static files via HTTP protocol (you have to fiddle with allowLinking=true in the Context definition), defined a simple set of symbolic links in my webapps/ROOT directory, thus creating a simple HTTP namespace for accessing the important files on the file server, and replaced all my nasty code which tried to build portable file scheme URLs with simple code that built simple HTTP scheme URLs.

And I was happy! So, like I said, I'm through with file scheme URLs.

Monday, May 4, 2009

Java assignment expression

Java has an assignment expression, not just an assignment statement. This is one of the things it inherited from C. Java's assignment expression is more restricted than C's, but it still can be a bit scary. Consider the following:


            boolean b = false;
            if (b = true)
                    System.out.println("B is true.");

This code will print


B is true.

which may surprise you, if you didn't look closely, because you probably understood that the almost identical bit of code:


            boolean b = false;
            if (b == true)
                    System.out.println("B is true.");

will print nothing.

Java's "if" statement only accepts boolean-valued expressions in its test, which means that some of the worst types of accidental mistakes from C will not occur in Java; the following code will not compile:


            int i = 2;
            if (i = 3)
                    System.out.println("I is 3.");

However, other rather frightening bits of Java code are legal, and do compile, and require a fair bit of thought to understand. Consider these snippets from the Java Language Specification itself:


int i = 2;
int j = (i=3) * i;
System.out.println(j);
int a = 9;
a += (a = 3);
System.out.println(a);
int b = 9;
b = b + (b = 3);
System.out.println(b);

This code prints 9, 12, and 12. But it takes a fair amount of thought to understand why, and therefore many people abhor the assignment expression, and wish that Java had only included an assignment statement. I'm generally sympathetic to this view; I certainly don't enjoy code like the above, and would never write such code myself (and would probably re-write such code if I came upon it while maintaining a system).

However, consider this example, from Derby's SQLInteger.java:


 public SQLInteger(Integer obj) {
  if (isnull = (obj == null))
   ;
  else
   value = obj.intValue();
 }

I think this is actually quite clean, and appealing. Let's break it down:

Check the passed-in argument, "obj", to see if it is null or not, and save the answer in the member field "isnull".
If "obj" was in fact null, do nothing more.
Otherwise, "obj" was not null, so call the intValue() method on it, and save the result in the member field "value".

The use of the assignment expression is clear, the code is simple and direct, and I approve of the style. I think this shows that there is room for good uses of the assignment expression after all.

Build Farm Part 1: Concepts and Goals

At work, I spend a lot of time building and maintaining a tool we call the Build Farm.

The Build Farm is internally designed and developed, but it bears a lot of resemblance to tools such as Tinderbox or BuildBot, together with some aspects of tools such as CruiseControl, and even some aspects of test case management systems. I looked at several such tools when we were starting out, but our system is wholly home-grown and coded from scratch by us; this is both good and bad. It's good because we're intimately familiar with how and why it behaves as it does; it's bad because we probably could have taken an existing system and re-used it.

Anyway, the Build Farm's primary goals are:

To manage and make effective use of a large (for us) pool of machines dedicated to building and testing our software. We currently have about 80 machines in the Build Farm; there are about 60 Windows machines, about 10 Linux machines, and about 10 Solaris machines. Work is scheduled both automatically and on-demand.
To preserve and assist with the analysis of the results of building and testing the software. Build problems must be identified and resolved; test problems must be analyzed and investigated.
To enable access to these tasks by users across the company, in all our worldwide offices, at any time 24/7.

Architecturally, the Build Farm consists of several primary pieces:

Persistent storage for information about builds, tests, work requests, machines, schedules, results, etc. Our storage includes a large file server which serves both NFS (Linux, Solaris) clients and SMB (Windows) clients (via Samba), and is used to store build logs, test output, and other build-related data such as application server trace logs, test databases, etc.. Our storage also includes a relational database for structured data; we use Apache Derby.
A web server which provides a simple UI for editing build configurations and for viewing build and test reports. The UI allows a variety of searching, as analyzing historical information and looking for trends and patterns is a very common use.
A client program which is installed on each build machine. We try to keep the client program fairly simple, and concentrate most of the logic on the server. The client knows how to request work, how to spawn subprocesses to perform work, and how to capture output and route it to the file server for the central UI to analyse.
The web server also provides a SOAP interface for use by the build machines. The most important and most frequently-issued SOAP request is: "get next job to perform".
A source code control system which the build machines can use to fetch the code to build and test. We use Perforce.

I'll talk more, and in more detail, about the Build Farm in future postings. I've been thrilled with the success of this tool, and I'm quite grateful to the other members of my team for their ideas, their support, and their willingness to tolerate the ups-and-downs of an internally-developed tool like this.

Saturday, May 2, 2009

Preparing for a rainy day

It's May, and it's raining!

Like most Californians, I am hopelessly pathetic when it comes to preparing for the rain:

My windshield wipers are worn
My tires are low on tread
I never have an umbrella or raincoat handy
I forget to take my canvas lawn furniture inside

I think we're always surprised by the rain, because it comes so rarely, even during the "rainy season", which this is definitely not. Perhaps it was a mistake for the governor to name his proposition on this month's statewide election the "Rainy Day Fund", since we have so few of those.

But it all got me thinking about preparation, and wondering about how well I do at making my software ready for the rain. It's a metaphor of course, but there are lots of things that engineers can do to improve their own rainy day fund when it comes to their software:

Go through your bug list (you do use a bug tracker, don't you?). Make sure each bug has a reproducible case, and is appropriately categorized and prioritized. Don't let your bugs linger. Try to fix them as fast as you can.
Look at all the warning messages that your compiler issues. If the compiler allows you to configure the level of warnings (such as javac's -Xlint), turn it up to the max. Figure out why you are getting those warnings, and remove them.
Find a good static analysis tool, and run it over your code. Figure out how to configure the level of warnings that it issues, and turn it up to the max. Figure out why you are getting those warnings, and remove them.
Get somebody to review your code. Review bug fixes, review modules, review architecture. Everyone's busy and none will want to do this, so you'll have to bribe them. Pressure your boss to fund such bribery.

But the most important thing to do is: work on your tests.

You can never have enough tests, and you never will. So you have to just steal bits of time and opportunities whenever possible to work on your tests:

Whenever a bug report comes in, ask for a test. If the reporter can't easily provide a test, write one yourself.
Whenever you report a bug yourself, always provide a test. It's the golden rule.
If you've got nothing else to do (you do have those times, right?), write tests. Test your own code, test other people's code, it doesn't matter. More tests is good, and practicing writing tests will improve your testing abilities.
Review your tests like you would review your code. Ensure that your tests are clear and simple, and that they describe what they intend to test, and (this is very important) ensure that when the test fails, that it's very clear what is wrong and why the test has failed.
Test your tests.

Hanging out in the Derby community, I've seen several patterns that have become so ingrained, that they are thoroughly part of the Derby DNA at this point:

Reported issues always have repro scripts or test programs attached. If they don't, we keep trying to get one.
Reviewers routinely review tests, and make comments or suggestions for improvements to make the test clearer, or simpler.
When reviewing a patch and readying it for commit, reviewers routinely test the test, as well as examining the fix.
We are extremely reluctant to accept a fix without a test, although we will often accept a test without a fix. That is, it's common practice in Derby-land to commit a new test to the trunk by itself. Sometimes, people will first check in a test, to make sure that the test itself is reliable and stable, then, later, they will check in the fix. The first version of the test that is checked in demonstrates the wrong behavior (with comments referencing the issue-tracking system); the later version of the test, which is submitted with the code change that resolves the problem, is adjusted to demonstrate the correct behavior.

For people looking for a body of test code to study, to learn more about how to write good tests, I recommend having a look through the Derby test suites.

Now, I'm off to find that umbrella in the garage...

Friday, May 1, 2009

Derby 10.5 released

Yay! Derby 10.5 has been released!

I didn't contribute any major fixes to this release, but I still feel like I was a part of the release. There are a lot of ways to be part of an open source community; the Derby web site talks about a number of different types of contribution to the community. For this particular release, I was primarily involved by answering questions on the various lists, reviewing and helping to test changes, contributing to the wiki, voting on proposals, etc.

As the announcement indicates, Derby 10.5 contained a number of intriguing features. Over on his weblog, Knut has been discussing some of them in greater detail. Although the Derby development community is somewhat smaller than it was a few years back, it is still quite active and there is a lot of interesting work occurring.

I'm already looking forward to Derby 10.6!

XPLAIN documentation

I made good progress on revising the XPLAIN documentation this week.

I had drafted some documentation (DERBY-4065) back in early February, and had received a number of helpful comments from reviewers. I kind of drifted away from the documentation when the overall DERBY-2487 change missed the 10.5 cut-off, so it was good to return to it. I'm somewhat eager to get the documentation into at least adequate shape to perform an initial commit, because right now we're in the awkward position that the code for the new XPLAIN feature is in the trunk, but the documentation is not.

Not all programmers enjoy technical writing, but I've always felt pretty comfortable with drafting my own documentation. It's taken me a while to learn how to make an acceptable attempt. Like pretty much all technical skills, as long as you respect the skill, work hard at understanding what people are telling you, and practice, practice, practice, you can learn to do this, too. Happily, the Derby project is gifted with some truly skilled technical writers who allow us to provide a complete and thorough documentation set.

Overall, I'm reasonably productive with the Derby documentation feature. We use a package called DITA, which is an awkward acronym for an awkwardly-named technology: the Darwin Information Typing Architecture. At its core, DITA is a domain-specific markup language for writing technical documentation, which of course is precisely what the Derby documentation is.

The Derby documentation survived a substantial conversion from some other format, which happened before I got involved with Derby. So DITA is all I know when it comes to Derby documentation.

My biggest gripe with DITA is that when I make a simple mistake, I often don't get a very good error message. There are two things that DITA is, in my opinion, particularly hard to deal with:

The first is when I make a mistake which results in one of my *.dita files becoming not well-formed XML. Since the DITA build is basically a bunch of XSLT stylesheets which get run across the DITA source, and since XSLT stylesheets are among the least friendly technologies in the world when it comes to error messages, I usually just get something pathetic out of the XSLT engine. At least those error messages are pretty obvious, and I am able to figure out that I have a basic XML structure problem, and I am pretty good at matching up tags in my editor and can find them (yay Vim and auto-indent!)
The more annoying one is when I don't quite match the DITA rules for XML schema. DITA has these rules which say things like: "the valid child elements of a 'section' tag are 'p' and 'codeblock' and 'blockquote'". But if I put some other sort of child element into my 'section' tag, I get some very strange error messages. At times I've had to resort to trying to read the DITA stylesheet source itself, but that's not often successful.

In practice, these issues tend to encourage me to keep my documentation simple, to build new documentation by copying existing documentation and modifying it, and to keep my changes as small as possible. Interestingly, these are not bad rules to live by, because small changes are easy to review, simple documentation is easy to read, and documentation which uses a consistent style and structure throughout is easy to use.

So I guess in a way it all works out in the end.