Wednesday, September 30, 2009

Low-level virtual machine subtleties

Virtual machines, such as the Java Virtual Machine, are astoundingly complex.

My friend Andrew sent me a link the other day to this great puzzle from the C# world. Similar things arise in Java, although it's not identical, since you have to explicitly invoke the equals() method to get logical equality (the == operator always performs reference equality).

Lippert's blog post is great, and very clearly written, even if I do have no idea whatsoever about the interview question that he parenthetically notes ("invent a performant algorithm for determining intransitivities in a simplified version of the 'better method' algorithm").

The discussion of string interning in C# reminded me, somewhat, of a fascinating recent performance optimization that Knut Anders Hatlen investigated in Derby: DERBY-3883. In this very specialized case, converting a logical equality test into a more sophisticated "check the hash equality first, then check the logical equality" implementation brought surprising benefits. It's worth reading the issue comments thoroughly for some clear explanations from Knut about why and when the performance optimization is useful.

The fact that Java performance behaviors can be unexpected and surprising is intriguing. A decade ago, optimizing and improving Java program performance seemed like it was substantially easier than it is nowadays. Partly this is because I find myself working to squeeze further levels of performance from code bases that have already been thoroughly and extensively examined, but partly it is also because modern VMs and modern hardware architectures have made performance analysis very tricky.

There is a wonderful set of notes that Joshua Bloch put together from the recent JVM languages summit which show some of the depths of this problem, and highlight the increasing frequency with which these problems are arising. As the paper by Lea, Bacon, and Grove observes:


The performance of two similar programs under varying mappings might differ by many orders of magnitude depending on issues such as branch prediction, cache misses, cache contention, the placement of memory fences, mappings of threads to cores or processors, OS scheduling policies, memory footprint, resolution of virtual dispatches, loop parallelization, SIMD or GPU parallelism, inlining, boxing of primitives, garbage collection policies, interpretation vs machine code compilation, reflection, opportunistic synchronization elimination, contention retry policies, remote messaging, load balancing, memory, file, and network layout, and so on.


I guess I should be pleased that (most of) the topics raised in this laundry list are familiar to me, but mostly I'm just reminded of how complicated modern systems software is, and how tricky it is to get it just right.

Speaking of GPU parallelism, I happened to see that NVidia have released their latest OpenCL libraries in the CUDA effort. I'm not sure how I feel about approaching the notion of a new "C-like language", since C itself is hard enough, but I may have a go at learning more about some of the Java-related CUDA apis.

No comments:

Post a Comment