Tuesday, July 14, 2009

To String.intern, or not to intern?

I don't have a lot of hands-on experience with String.intern.

This function has been around for a long time, but I recently started thinking about it as a possible tool for controlling memory usage. As you'll see, the Sun documentation describes the function as a tool for altering the behavior of String comparisons:

Returns a canonical representation for the string object.

A pool of strings, initially empty, is maintained privately by the class String.

When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.

All literal strings and string-valued constant expressions are interned. String literals are defined in §3.10.5 of the Java Language Specification

In my particular case, we maintain a large cache of object graphs, where the object data is retrieved from a database. Furthermore, it so happens that these object graphs contain a large number of strings which are used and re-used quite frequently.

So I was recently pawing through an enormous memory dump, and I was skimming through the dump of all the active String objects, and I was struck by how there was a lot of duplication, and that made me think of whether or not we were using String.intern appropriately.

So I did some research, and found several quite interesting essays on the topic.

My reaction so far is that:
  • Yes, it looks like String intern'ing could really help.
  • Unfortunately, the need to potentially configure PermGen space is a bummer.
  • And, it seems important to have a really good handle on what strings are worth interning. Too few, and I've just changed a bunch of code to no real effect. Too much, and I've exchanged a memory waste problem for a PermGen configuration problem, plus possibly burdened the VM by making it do more work on allocations for little gain.
In general, given my vague understanding of the state of the art in JVMs nowadays, it seems like the JVM teams are working on making memory allocation fast and cheap.

And, as we've discussed previously in this blog, memory is becoming cheap and widely available.

So, it doesn't seem to be immediately obvious that intern'ing will be worth it, because in general it seems like a bad strategy to be asking the CPU to be doing more work in order to conserve memory, unless we have a strong reason to believe that we have a lot of memory duplication and the memory savings are either
  • so substantial that they will outweigh the extra expense and hassle of managing the intern pool, or
  • so substantial that the conservation of that much memory will open up a broad new range of applications for the code (e.g., we can now handle some problem sizes that were just way too large for us to handle without interning).
So I think that for now I will read some more, and think about this some more, but I'm not going to race to start planting a lot of intern calls in the code.

Are there profiling features that look at a benchmark underway, and analyze whether or not interning would have been useful?

1 comment:

  1. Yes there are commands in the Eclipse Memory Analyzer to find duplicates of Strings.
    Check for example http://kohlerm.blogspot.com/2008/05/analyzing-memory-consumption-of-eclipse.html