Tuesday, May 26, 2009

Some high-level thoughts on Java Performance Tuning

So you think you want to do some performance tuning. Here's some high-level thoughts on the subject.

First, construct a benchmark. This will have at least these benefits:
  • It will force you to get specific about the software under test. For any significant and complex piece of software, there are a forest of inter-related variables: which features are you interested in tuning, what configuration will be used to run them, what workload will be applied, what duration of test, etc.
  • It will give you some real running software to measure, tune, adjust, etc. Without real running software, you won't be able to run experiments, and without experiments, you won't know if you're making progress or not.

OK, so now you have a benchmark. The next step is to run it! You need to run it, in order to start getting some data points about how it performs. You also need to run it, in order to start developing some theories about how it is behaving. Ideally, save all your benchmark results, recording as much information as you can about the configurations that were used, the software that was tested, and the results of the benchmark.

While the benchmark is running, watch it run. At this point, you're just trying to get an overall sense of how the benchmark runs. In particular, pay attention to the four major components of overall system performance:
  1. CPU cycles
  2. Disk activity
  3. Memory usage
  4. Network activity
You may, or may not, be able to see that your benchmark is constrained by one of these four resources. For example, if your machine immediately goes 100% CPU busy at the start of your benchmark, and stays there until the end, you are probably CPU-constrained. However, try not to jump to such a conclusion. Perhaps you didn't give the system enough memory? Or perhaps your benchmark is using too small of a configuration, and so everything fit tidily into memory, while in real-life your usage never experiences such a small problem, so real usage is never CPU-bound, but is always hitting the disk. Try to keep an open mind, and get a feel for whether your configuration is reasonable and your workload is realistic, and only then come to a conclusion about whether you are CPU, Disk, Memory, or Network-bound.

Perhaps, while running your benchmark, you didn't seem to be bound by any resource? Your CPU was never fully busy, you always had spare memory, your disks and network were never very active? Here are two other possibilities that you don't want to overlook at this stage:
  • Perhaps your benchmark just isn't giving the system enough work to do? Many systems are designed to handle lots of concurrently-overlapping work, so make sure that you're giving your benchmark a large enough workload so that it is busy. Try adding more work (more clients, or more requests, or larger tasks...) and see if throughput goes up, indicating that there was actually spare capacity available.
  • Perhaps your benchmark is fine, but your system has some internal contention or concurrency bottlenecks which is preventing it from using all the resources you're giving it. If it seems like you're not using all your resources, but adding more work doesn't help, then you should start to think about whether some critical resource is experiencing contention.
OK, so you have a solid benchmark, you've run it a number of times and started to develop a performance history, and you have some initial theories about where your bottlenecks might reside. Now let's start talking about Java-specific stuff.

If you're going to be this serious about Java performance tuning, you should be reading, because there's 15 years of smart people thinking about this and you can (and should) benefit from all their hard work. Here's some places to start:
OK, back to coding and tuning. We're almost to the point where we have to start talking about serious tools, but first, here's a couple simple ideas that will get you a surprisingly long way with your performance studies:

First, while you are running your benchmark, trying pressing Control-Break a few times. If you are running on a Linux or Solaris system, try sending signal QUIT (kill -QUIT my-pid) to your java process. This will tell your JVM (at least, if it's the Sun JVM) to dump a stack trace of the current location of all running threads in the virtual machine. Do this a half-dozen times, spaced randomly through your benchmark run. Then read through the stack traces, and consider what they are telling you: are you surprised about which threads are doing what? Did you see any enormously deep or intricate stack traces? Did you see a lot of threads waiting for monitor entry on some master synchronization object? I call this the "poor man's profiler", and it's amazing what it can tell you. It's particularly nice because you don't have to set anything up ahead of time; you can just walk up to a misbehaving system, toss a couple stack trace requests into it, and see if anything jumps out at you in the output.

The next step is to instrument your program, at a fairly high level, using the built-in Java clock: System.currentTimeMillis. This handy little routine is quite inexpensive to call, and for many purposes can give you a very good first guess at where problems may be lurking. Just write some code like this:


long myStartTime = System.currentTimeMillis();
int numLoops = 0;

for (some-kinda-interesting-loop)
{
.. lotsa code that might be revealing here ...
numLoops++;
}

long myElapsedTime = System.currentTimeMillis() - myStartTime;

System.out.println("My loop ran " + numLoops +
" times, and took " + myElapsedTime + " milliseconds.");


Usually it doesn't take too long to sprinkle some calls to currentTimeMillis around, and get some first-level reporting on where the time is going.

One thing to keep in mind when using this technique is that the currentTimeMillis clock ticks rather irregularly: it tends to "jump" by 15-16 milliseconds at a time. So it's hard to measure things more accurately than about 1/60th of a second. Starting with JDK 1.5, there is a System.nanoTime call which is much more precise, which gets around this problem.

If you've gotten this far, you should have:
  • A reliable benchmark, with a history of results
  • Some high-level numbers and theories about where the time is going
  • Some initial ideas about whether your system is CPU-bound, Disk-bound, memory-bound, etc.
This by itself may be enough that you can make an experimental change, to see if you can improve things (aren't you glad you kept those historical numbers, to compare with?).

But if you can't yet figure out what to change, but you at least know what code concerns you, it's time to put a serious profiler to work. You can get profilers from a number of places: there are commercial tools, such as JProfiler, or tools built into Eclipse and NetBeans, or standalone tools like VisualVM. At any rate, find one and get it installed.

When using a profiler, there are essentially 3 things you can do:

  1. You can look at elapsed time
  2. You can look at method invocation counts
  3. You can look at memory allocation behaviors
Each of them has an appeal, but each also has pitfalls to beware of. Method/line invocation counts are great for figuring out how often code is called, and for finding deeply-nested methods that are being run a lot more than you anticipated, which can lead to some very insightful realizations about how your code is working. But mindlessly trying to reduce invocation counts can lead to trying to optimize away code that in reality is very efficient; furthermore, you may be simply repeating the work that the JIT is doing automatically on your behalf for free.

Analyzing memory allocation is also a great way for finding code which is run a lot, and removing unnecessary and wasteful memory allocation can be a good way to speed up Java code. However, modern garbage collectors are just spectacularly, phenomenally efficient, and in recent years I've often found the cutting back on the generation of garbage by itself often doesn't really help much. However, if the generation of garbage is a clue (a "code smell", as they say) which leads you to locate a bunch of unnecessary and wasteful work that can be eliminated, then it's a good way to quickly locate a problem.

Lastly, there is the profiler's elapsed-time analysis. Sometimes, this can work really well, and the profiler can show you exactly where your code is spending its time. However, in a complicated server process, with lots of background threads, I/O operations, etc., figuring out how to interpret the elapsed-time graphs can be very tricky. I've often thought that the profiler was pointing a giant Flying Fickle Finger of Fate directly at some section of my code, worked industriously to isolate and improve that code, only to find that, overall, it didn't make any difference in the actual benchmark results.

1 comment:

  1. Hey Bryan, Thanks very much for this article. For a newbie like me, you gave some really valuable information that I can learn about the performance engineering.

    Please keep posting.. !

    ReplyDelete