Journal of a Programmer: The prevalence of caching

I've been thinking about caching this week.

At work, I'm neck deep in a complex re-implementation of an intricate caching algorithm, and I realized the other day that I've been fascinated by cache implementations for almost 25 years.

My first exposure to caches was at Computer Corporation of America in Boston in the mid 80's. The team was extending the Model 204 cache algorithms as part of the new BTree implementation that was being built (previously, Model 204 had supported only entry-order and hash access methods). I learned that caches are intricate and delicate, and have to be used properly.

At Ingres, in the early 90's, I was deeply involved in several projects in the area of the cache implementation. In one project, we were re-designing the recovery system from a page-oriented physical recovery system to an ARIES-style logical recovery system, which gave me a deep appreciation for the inter-connectedness of the page cache and the recovery system. In another project we were providing shared-memory multi-processor support. Since the page cache was going to be located in shared memory, we converted it to position-independent data structures. Later, I spent some time thinking about how to handle a variety of page sizes in the page cache in an efficient manner, without excessive memory fragmentation, and while being simultaneously efficient at caching all pages sizes in the range, but didn't get to the point of building working code.

At Sybase, I worked with the Store team on implementing a page cache for the ISS system; this was my first exposure to C++ and to object-oriented methods. We tried to build a flexible cache that would be extensible and re-usable as the code evolved. We implemented a working prototype but never reached production code. However, many of the ideas behind this system traveled with Siuling, Mike, and Nat to Cloudscape (and thus to Derby).

For most of this decade, I've been extending, enhancing, and maintaining a complex Object-Relational Mapping library at work. If you haven't seen an ORM before, Hibernate is a good example. Our system isn't as sophisticated as Hibernate, but it's powerful and intricate and carefully tailored to our needs.

In particular, our ORM library provides an object cache, and does its best to return cached items when possible. As with any cache, the primary issues are:

Using resources (mainly, memory) efficiently
Managing shared access to cached objects
Ensuring cache coherency

A lot of work has gone into each of these general areas. When it comes to using resources efficiently, the most important topic involves the replacement policy; that is, what do you do when the cache is full? Here, you can find a rich research literature. I was lucky enough to take a class from Professor Betty O'Neil at UMB, and think that the LRU-K algorithm is among the best available.

Testing a caching algorithm is also a tricky task, as there are several hidden traps:

The cache may malfunction, and return stale data rather than the most up-to-date data. Only a very carefully written test will be sensitive enough to catch this.
The cache may also malfunction, and return dirty (not-yet-committed) data. Again, the test has to be very aware to be able to detect this behavior.

The cache may also be returning correct, accurate data, but may still have a variety of performance problems:

The cache may be failing to cache data, and may be needlessly re-computing data from the underlying data source.
The cache may be caching data effectively, but may be preferring one sort of workload over another. For example, it may be working well for a workload of mostly read queries, but may not function well when updates are being performed. Or vice versa.
The cache may be violating some of its configuration. For example, it may not be implementing its replacement policy as designed. Or it may be exceeding the resource limits that have been assigned to it.
The cache may be doing all its processing correctly, but it still may be using too many resources, for example it may be experiencing high contention on thread synchronization, or it may be using inefficient algorithms for maintaining its data structures.

It's the last problem that I've been chasing this month.

Cache coherency is an complex problem, and this is the area where I've been spending a lot of time recently. In our system, one of the problems that is most challenging involves the situation where a transaction modifies an object, and that object is referenced by some other object that is resident in the cache. We have some complex graph traversal algorithms which trace the references and locate stale data which must be refreshed. In many ways it's similar to the type of code you see in garbage collection algorithms, although our problem is much simpler than the full GC problem.

In my particular case, the issue I've been struggling with is how to verify whether a modified data value happens to be currently referenced by an object which is referenceable from some object currently resident in the cache. That is, if object D is modified, and object A points to object B points to object C points to object D, and object A is currently resident in the cache, then I need to know this fact, because the modification of object D is relevant to object A. The problem is that, since objects are complex and have lots of inter-connectedness, there can be a variety of "pointer paths" that could possibly connect from A to D. The goal is to find any such object A instances as cheaply as possible, avoiding simple-to-implement-but-unusable-in-practice algorithms such viewing every object that is reachable from every cached object to see if object D is in that set. The implementation that we use is based on 2 building blocks:

We analyze the object schema to construct the possible reference paths of interest
We partition the cache by type, so that we can efficiently locate the cached instances of a desired type.

Then we can examine just the necessary cached instances, and traverse only the possible paths.

Journal of a Programmer

Saturday, May 23, 2009

The prevalence of caching

No comments:

Post a Comment

About Me

Blog Archive

Pages