Southwest Airlines is apparently trying to get people to use their new service from the San Francisco Airport (until recently, they served only Oakland and San Jose). So they are running a discount where it is much cheaper to fly on Southwest if you fly from SFO, than from Oakland or San Jose.
(Southwest does this sort of thing all the time, apparently. Recently my wife and daughter flew up north to visit the University of Oregon and the University of Washington. Southwest wanted $69 per person to fly to Portland, and $141 per person to fly to Seattle. So they flew to Portland.)
Anyway, my friend lives in Fremont, and his family happened to be making their college trip, to LA. So they noticed that Southwest wanted $119 to fly from either Oakland or San Jose to LA, but only $39 (!!) to fly from SFO. So they bought tickets from SFO.
Then, although he lives just 7 miles from either San Jose or Oakland, he needed to get the family to SFO. So, he simply dropped them off at Fremont BART.
This reminded me of our recent visit to the UCSC campus. During the tour, one of the prospective students asked how kids generally went to events up in the Bay Area. Apparently, there is a shuttle bus which runs between the Santa Cruz campus and Fremont BART.
"From there, you can get anywhere in the Bay Area", the guide chirped.
Short notes and essays about stuff that interests me (mostly technical stuff).
Pages
▼
Thursday, April 30, 2009
Recovery and JAD
Oh, how frustrating.
I was working on a minor bug fix; nothing big, nothing complex, just a little bit of work. I had been back-and-forth on the fix several times over a period of a few weeks, working with the system test team to get it just right.
I finally got the fix to a reasonable state.
Unfortunately, the codeline was closed at that point, so I couldn't submit the fix (it's targeted for a future release this summer).
And, the fix doesn't apply to any other releases, neither to older releases nor to the current trunk (it's highly release-specific, as it covers an edge case that only arises in upgrading to this particular release), so I didn't need to submit it to any other branches.
So, I just left the bug fix in my client, as a set of checked-out files.
If I had been smart, I would have run 'p4 diff' and saved the output somewhere, perhaps as an attachment in our bug-tracking system. But our internal bug-tracking system is really awkward and unpleasant to use, so I didn't do that. Dumb.
If I had been smart, I would have put the fix into a branch, for safekeeping. But creating a branch, although fairly lightweight, is still a bunch of work, and I was feeling lazy, so I didn't do that. Dumb.
If I had been smart, I would have set up a Git repository on my own machine, so I could check the fix into something somewhere, but Git is still very new to me, and I didn't do that. Dumb.
So I just left the bug fix in my client.
Unfortunately, I am often doing eight things at once, trying out various little bits of code on various projects in various ways. So it's quite common for me to make a change, evaluate it, think about it, and then revert it and move on to something else.
Unfortunately, I did that. And, accidentally, I reverted my bug fix, in the process. So I lost the changes. Dumb.
However, there is a glimmer of hope. Since I was working with the test team, I still have a copy of the binary release kit that I built for them to evaluate the fix. And, there is this marvelous tool called Jad, which can take a Java class file and de-compile it, to produce Java source code.
So, I think I can use the following technique:
Jad is a nifty tool; I've used it a lot. But it seems to be "abandonware" at this point. That's too bad. I wonder what people use nowadays. There appear to be a number of alternatives. Or does everyone just run javap?
I was working on a minor bug fix; nothing big, nothing complex, just a little bit of work. I had been back-and-forth on the fix several times over a period of a few weeks, working with the system test team to get it just right.
I finally got the fix to a reasonable state.
Unfortunately, the codeline was closed at that point, so I couldn't submit the fix (it's targeted for a future release this summer).
And, the fix doesn't apply to any other releases, neither to older releases nor to the current trunk (it's highly release-specific, as it covers an edge case that only arises in upgrading to this particular release), so I didn't need to submit it to any other branches.
So, I just left the bug fix in my client, as a set of checked-out files.
If I had been smart, I would have run 'p4 diff' and saved the output somewhere, perhaps as an attachment in our bug-tracking system. But our internal bug-tracking system is really awkward and unpleasant to use, so I didn't do that. Dumb.
If I had been smart, I would have put the fix into a branch, for safekeeping. But creating a branch, although fairly lightweight, is still a bunch of work, and I was feeling lazy, so I didn't do that. Dumb.
If I had been smart, I would have set up a Git repository on my own machine, so I could check the fix into something somewhere, but Git is still very new to me, and I didn't do that. Dumb.
So I just left the bug fix in my client.
Unfortunately, I am often doing eight things at once, trying out various little bits of code on various projects in various ways. So it's quite common for me to make a change, evaluate it, think about it, and then revert it and move on to something else.
Unfortunately, I did that. And, accidentally, I reverted my bug fix, in the process. So I lost the changes. Dumb.
However, there is a glimmer of hope. Since I was working with the test team, I still have a copy of the binary release kit that I built for them to evaluate the fix. And, there is this marvelous tool called Jad, which can take a Java class file and de-compile it, to produce Java source code.
So, I think I can use the following technique:
- Use Jad to decompile the relevant classes from my working kit
- Use Jad to decompile the same classes from the standard release kit
- diff the jad output, and attempt to recover my fix from that.
Jad is a nifty tool; I've used it a lot. But it seems to be "abandonware" at this point. That's too bad. I wonder what people use nowadays. There appear to be a number of alternatives. Or does everyone just run javap?
Wednesday, April 29, 2009
Extending the lifetime of data
I happened to be over at my wife's office last week.
She works at a small law firm, and one of the attorneys was trying to locate some information on an old case that the company had worked on.
Well, not that old, really: the case was from 2003. Their law firm has been around for decades, as have many of their attorneys, so to them this was actually a "recent" case.
But not to the computer, it wasn't.
Consider just some of the things that have happened since they worked on that case:
I'm sure this same sort of thing is happening around the world; every organization must be finding that their data is being blasted by an equivalent tornado of change. How do we, as the people in the world who care about IT, provide any sort of assurance that the important data of the world will survive, for not just 5 years, but for decades or centuries?
The current technique appears to basically involve a continual copying forward of all known information from the previous generation of hardware/software to the current generation of hardware/software, which is a righteous annoyance, as well as appearing to be a strategy which requires ever-increasing time as the volume of data grows. Furthermore, it doesn't deal with the issue of how to safely "archive" information that one no longer wishes to update or modify, but wants to preserve for a long time, as in my wife's example ("delete this from the active cases, but keep a backup that we can reliably recover data from for at least 15 years").
Ugh.
She works at a small law firm, and one of the attorneys was trying to locate some information on an old case that the company had worked on.
Well, not that old, really: the case was from 2003. Their law firm has been around for decades, as have many of their attorneys, so to them this was actually a "recent" case.
But not to the computer, it wasn't.
Consider just some of the things that have happened since they worked on that case:
- The company has moved their offices.
- The company has upgraded all their client machines from Windows 98 to Windows XP.
- The company has upgraded all their server software from Novell to Windows 2000 Server to Windows 2003 Server.
- The company has switched their out-sourced IT provider.
- The company has upgraded their server hardware, twice.
- The company has upgraded the legal case-annotation software that they use, at least twice, and has moved from a per-laptop configuration of that software to a server-based configuration.
- The company has changed the vendor of their backup software, and has upgraded that backup software several times, and has changed their policies for how they handle backups.
- Probably lots of other changes that I've forgotten.
I'm sure this same sort of thing is happening around the world; every organization must be finding that their data is being blasted by an equivalent tornado of change. How do we, as the people in the world who care about IT, provide any sort of assurance that the important data of the world will survive, for not just 5 years, but for decades or centuries?
The current technique appears to basically involve a continual copying forward of all known information from the previous generation of hardware/software to the current generation of hardware/software, which is a righteous annoyance, as well as appearing to be a strategy which requires ever-increasing time as the volume of data grows. Furthermore, it doesn't deal with the issue of how to safely "archive" information that one no longer wishes to update or modify, but wants to preserve for a long time, as in my wife's example ("delete this from the active cases, but keep a backup that we can reliably recover data from for at least 15 years").
Ugh.
Tuesday, April 28, 2009
Git and Mercurial
Source code control tools are just about the most important tool that a programmer uses, and over the years I've used many of them. Currently, I'm primarily using Perforce and Subversion, both of which are wonderful pieces of software that do many things very well.
The trendy thing nowadays, though, is what are called DVCS: Distributed Version Control Systems. Both Perforce and Subversion are centralized systems, meaning that there is a single master server which tracks the changes, and lots of clients, each of which interacts only with the server.
In the DVCS world, the two main systems are Git and Mercurial. Git is the "hottest" thing going in version control systems right now, and has all the attention. In fact, it was seeming to me that it was pretty much a done deal, and everybody was going to switch to Git, and Mercurial was doomed.
But then the other day I noticed that Google had decided to pick Mercurial. This is interesting for several reasons:
But now, I think I'll stop focusing only on Git, and try to pay some attention to Mercurial too.
The trendy thing nowadays, though, is what are called DVCS: Distributed Version Control Systems. Both Perforce and Subversion are centralized systems, meaning that there is a single master server which tracks the changes, and lots of clients, each of which interacts only with the server.
In the DVCS world, the two main systems are Git and Mercurial. Git is the "hottest" thing going in version control systems right now, and has all the attention. In fact, it was seeming to me that it was pretty much a done deal, and everybody was going to switch to Git, and Mercurial was doomed.
But then the other day I noticed that Google had decided to pick Mercurial. This is interesting for several reasons:
- First, Google give some details about why they made this choice, which are interesting to read.
- Second, it's Google, and so it automatically matters what they do. By themselves, this one decision may have just changed the world, and Mercurial may now be the system of choice.
But now, I think I'll stop focusing only on Git, and try to pay some attention to Mercurial too.
Monday, April 27, 2009
Upgrade done right
Today, I upgraded my Ubuntu machine to Jaunty Jackalope.
There was 1 network problem, which left me having fetched 1124 out of 1125 files. I hit cancel, restarted the upgraded, it downloaded the last file, and the rest was trouble-free. Given that there were probably tens or hundreds of thousands of 9.04 upgrades underway at the time, it's amazing that was the only problem I hit.
I don't think it's a surprise that Ubuntu is both (a) extremely successful, and (b) very easy to upgrade and keep upgrading. When I think about some of the software systems that I've been particularly impressed with over the last few years:
one thing that all these programs have in common is that they take upgrade very seriously. They work hard to make sure that upgrade is easy and reliable for the user, so that users both want to stay upgraded and find it easy to do so. They think about problems such as
There's a great post over in the Google Chrome world about how when they first started, they knew they had to absolutely nail the upgrade problem from the very beginning. Of course Google, being typically awesomely Google, have even open-sourced Google Update itself, together with some design documents which show their perspective on how an updater should behave.
Now, if only all the software in the world could do such a good job.
There was 1 network problem, which left me having fetched 1124 out of 1125 files. I hit cancel, restarted the upgraded, it downloaded the last file, and the rest was trouble-free. Given that there were probably tens or hundreds of thousands of 9.04 upgrades underway at the time, it's amazing that was the only problem I hit.
I don't think it's a surprise that Ubuntu is both (a) extremely successful, and (b) very easy to upgrade and keep upgrading. When I think about some of the software systems that I've been particularly impressed with over the last few years:
one thing that all these programs have in common is that they take upgrade very seriously. They work hard to make sure that upgrade is easy and reliable for the user, so that users both want to stay upgraded and find it easy to do so. They think about problems such as
- whether you can skip a version when upgrading several versions at a time,
- whether you can downgrade back to an old version if the upgrade caused problems,
- whether you can upgrade parts of your distributed system at different times,
- and so forth.
There's a great post over in the Google Chrome world about how when they first started, they knew they had to absolutely nail the upgrade problem from the very beginning. Of course Google, being typically awesomely Google, have even open-sourced Google Update itself, together with some design documents which show their perspective on how an updater should behave.
Now, if only all the software in the world could do such a good job.
The Sword of Data
My Dad was intrigued by the phrase "the sword of data" in Douglas Bowman's essay about leaving Google for Twitter. It's a much-remarked essay and quite interesting.
"Live by the sword, die by the sword" is a Biblical metaphor, from the Book of Matthew, where Jesus tells the Apostle Peter not to take arms against the Roman soldiers. In its broadest sense, which is how Bowman is using it, it describes the concept that the manner in which you choose to live your daily life affects the outcome of your actions.
It would seem that Bowman is being broadly critical of Google for being too analytical, too concrete, too scientific, too focused on hard facts, computer algorithms, and experimentally-verified data, and insufficiently accepting of style, creativity, intuition, subjectivity, aesthetics, instinct, and design. The heart of Bowman's critique is as follows:
Is it a fair criticism? Certainly it's not a new one; Google has been championing this approach from day one, from the search engine, to the ad engine, to the news aggregator, even in the IPO pricing scheme used when Google when public.
And, most famously, there's Google's approach to their home page.
However, minimalism is not the inverse of design. I think that Google does have a design approach; being Google, they even state their design approach as they see it:
I'm not sure I have a clear stand on this. I prefer Yahoo news to Google news, at least partly because I think that the human in the loop improves the quality of the information that is presented. And I am deeply fond of all the various iPods that have cluttered my life over the last 5 years. But I also find Google search amazingly effective, and I think that the auction engine's core algorithms are among the most fascinating ideas I've seen.
So put me down as an interested observer of a great debate!
"Live by the sword, die by the sword" is a Biblical metaphor, from the Book of Matthew, where Jesus tells the Apostle Peter not to take arms against the Roman soldiers. In its broadest sense, which is how Bowman is using it, it describes the concept that the manner in which you choose to live your daily life affects the outcome of your actions.
It would seem that Bowman is being broadly critical of Google for being too analytical, too concrete, too scientific, too focused on hard facts, computer algorithms, and experimentally-verified data, and insufficiently accepting of style, creativity, intuition, subjectivity, aesthetics, instinct, and design. The heart of Bowman's critique is as follows:
When a company is filled with engineers, it turns to engineering to solve problems. Reduce each decision to a simple logic problem. Remove all subjectivity and just look at the data.
Is it a fair criticism? Certainly it's not a new one; Google has been championing this approach from day one, from the search engine, to the ad engine, to the news aggregator, even in the IPO pricing scheme used when Google when public.
And, most famously, there's Google's approach to their home page.
However, minimalism is not the inverse of design. I think that Google does have a design approach; being Google, they even state their design approach as they see it:
A minimalist aesthetic makes sense for most Google products because a clean, clutter-free design loads quickly and doesn't distract users from their goals. Visually appealing images, color, and fonts are balanced against the needs for speed, scannable text, and easy navigation.
I'm not sure I have a clear stand on this. I prefer Yahoo news to Google news, at least partly because I think that the human in the loop improves the quality of the information that is presented. And I am deeply fond of all the various iPods that have cluttered my life over the last 5 years. But I also find Google search amazingly effective, and I think that the auction engine's core algorithms are among the most fascinating ideas I've seen.
So put me down as an interested observer of a great debate!
Sunday, April 26, 2009
Was I Confickered?
I spent most of last week trying to figure out if I had a virus infection on my primary machine. At the end of it all, I found myself still confused about exactly what had occurred. Here's what happened:
About a week ago, we started to get some indications that something was wrong in our internal network. Specifically:
It's possible that I disabled the system services myself. Since I routinely use this machine for complicated long-running performance tests, I occasionally do things like disable background system services for a while to avoid interference with the tests, then forget to re-enable them. Though I don't rememer doing that in this case.
At this point, the virus infection appears to have subsided, which is good.
But it's frustrating that I ended up from the experience not understanding some basic things such as:
Such is life in the modern world.
Update: Conficker is astonishingly sophisticated.
About a week ago, we started to get some indications that something was wrong in our internal network. Specifically:
- Various automated jobs and test runs started to fail, with error messages indicating that accounts were locked out
- Various machines running Symantec Anti Virus File System Auto Protect started popping up dialogs indicating that AutoProtect had detected and removed an instance of W32.Downadup.B.
- Using tools such as the Conficker Eyechart, the Microsoft MSRT, and Symantec's scanner, we tried to determine which machines were actively infected. These machines we shut down and disconnected from the network.
- For those machines, we then removed the virus, verified that all Windows Updates were applied, re-scanned the machines, and then restored them to the network.
- We monitored network and security activity, looking for machines that we had missed.
- The Symantec Auto-protect pop-ups on other machines claimed that they had received the virus from my machine. Multiple machines detected this, so it's hard to dismiss it as a single outlier.
- On my machine, when we examined it in detail, the Automatic Updates and Background Intelligent Transfer Service services were disabled. This is one of the symptoms of the Conficker virus; it shuts these services down to try to prevent the infected host from running Windows Update.
- Multiple virus scans of the machine, by multiple virus scanners, failed to detect the virus, although the various virus scanners detected the virus on other machines successfully.
- None of the special registry entries that the virus is supposed to create were present.
- None of the mystery files in the system directory were present.
- The in-memory DNS hooks that the virus uses to disable Windows Update (and which are checked by the eyechart) were not present, and the eyechart displayed without errors.
- Our network scanners did not detect the suspicious network traffic that was present with other infected machines (e.g., the traffic which was trying to test for machine users with weak passwords).
It's possible that I disabled the system services myself. Since I routinely use this machine for complicated long-running performance tests, I occasionally do things like disable background system services for a while to avoid interference with the tests, then forget to re-enable them. Though I don't rememer doing that in this case.
At this point, the virus infection appears to have subsided, which is good.
But it's frustrating that I ended up from the experience not understanding some basic things such as:
- which machines were infected, and why? Many of the infected machines were unpatched, and were not running the AutoProtect scanner, so they were not well defended. But at least two machines which were infected should have been protected.
- how did the virus originally enter the network? None of the infected machines appear to have been the source of the virus.
- Was my machine infected? If so, how, and is it still infected?
Such is life in the modern world.
Update: Conficker is astonishingly sophisticated.
Saturday, April 25, 2009
Derby XPLAIN is in the trunk
Today I committed the first set of changes for DERBY-2487, the new XPLAIN feature.
This is a new Derby feature. Here's how I described it in the commit comment:
Felix contributed the work to the Derby community in the spring of 2007, but then it sort of fell through the cracks. Nobody picked it up right away, although some people looked at it a bit (including me). To tell the truth, it was a bit intimidating, as it was a large patch.
I came across the work again this winter, when I was thinking about possible projects for the Google Summer of Code. I decided to have a look at the work, and see what could be done with it. I thought that the quality of the code was quite reasonable, and there was clearly a lot of value in the functionality. There were two problems that had to be solved:
There were several significant issues raised, but the most important issue was that the patch functioned by storing query statistics information into a fixed set of system catalogs, which caused us concern for several reasons:
The other major concern that was raised involved the details of the visitor pattern used for traversing the statistics tree and building the data to be stored in the tables. The implementation was visiting each node in a clean fashion, but the construction of the data was being done with a single complicated subroutine which used "if (instanceof)" tests to handle each node in a separate fashion. Rick Hillegas suggested a refactoring in which each node was responsible for capturing its own data to be stored in the xplain tables. This, too, ended up simplifying the patch considerably and appears to be working well, although I'm still concerned that there is too much coupling between the XPLAIN implementation and the ResultSetStatistics subclasses.
Overall, the commit seems to have gone well. The Derby community is amazing! Within a few hours of the commit, Knut noticed that I had overlooked a test which is only run in the JDK 6/JDBC 4.0 environment, and determined that a small change was needed to that test to allow for the presence of the new system procedures GET_XPLAIN_SCHEMA and GET_XPLAIN_MODE. Overnight while I was sleeping, he had already patched the test so that it passes.
This is a new Derby feature. Here's how I described it in the commit comment:
This feature introduces an alternate handling of runtime statistics information. Derby can now be configured so that it will capture the statistics in a machine-readable form and will store them into a set of simply-structured tables in a schema which is specified by the user. We call this behavior "XPLAIN style", and we call the tables which are used the "XPLAIN tables".The evolution of this feature is a bit interesting. The primary work on it was done by Felix Beyer, a student at the Technical University of Dresden, in Germany, in 2006-2007. You can find a good summary of the initial discussion in the mailing list archive.
Having captured statistics about statement execution, you can then analyze the statement behavior by querying these tables. For example, you can determine how much time was taken, what resources were used, what query plan was chosen, and so on.
Felix contributed the work to the Derby community in the spring of 2007, but then it sort of fell through the cracks. Nobody picked it up right away, although some people looked at it a bit (including me). To tell the truth, it was a bit intimidating, as it was a large patch.
I came across the work again this winter, when I was thinking about possible projects for the Google Summer of Code. I decided to have a look at the work, and see what could be done with it. I thought that the quality of the code was quite reasonable, and there was clearly a lot of value in the functionality. There were two problems that had to be solved:
- We had to ensure that we had the legal right to incorporate this work into Derby. Although it had been posted to Derby's bug-tracking system, with the appropriate "grant" flags checked, the Derby community was a bit concerned because of the size of the work, and wanted Felix to sign the Apache Individual Contributor License Agreement, which is the standard Apache licensing document.
- The patch had been sitting "on the shelf" for more than 2 years, and unfortunately it no longer cleanly applied to the trunk. Derby development had moved on in the interim.
There were several significant issues raised, but the most important issue was that the patch functioned by storing query statistics information into a fixed set of system catalogs, which caused us concern for several reasons:
- It meant that all Derby databases would incur the overhead of the new system catalogs, even if they never used this feature.
- It meant that it was hard to isolate different uses of this feature in the same database. Although the system catalogs included information about the user, session, and transaction which captured the data, it was still complex to interpret the data when multiple sessions were captured.
- It meant that it was hard to manage the data after it was captured. For example, it was hard to delete some, but not all, of the captured statistics.
The other major concern that was raised involved the details of the visitor pattern used for traversing the statistics tree and building the data to be stored in the tables. The implementation was visiting each node in a clean fashion, but the construction of the data was being done with a single complicated subroutine which used "if (instanceof)" tests to handle each node in a separate fashion. Rick Hillegas suggested a refactoring in which each node was responsible for capturing its own data to be stored in the xplain tables. This, too, ended up simplifying the patch considerably and appears to be working well, although I'm still concerned that there is too much coupling between the XPLAIN implementation and the ResultSetStatistics subclasses.
Overall, the commit seems to have gone well. The Derby community is amazing! Within a few hours of the commit, Knut noticed that I had overlooked a test which is only run in the JDK 6/JDBC 4.0 environment, and determined that a small change was needed to that test to allow for the presence of the new system procedures GET_XPLAIN_SCHEMA and GET_XPLAIN_MODE. Overnight while I was sleeping, he had already patched the test so that it passes.
Thursday, April 23, 2009
Negative zero in Java
Knut made a posting on the Derby list which caught my attention. He was investigating DERBY-2447, which is a lingering intermittent bug in the Derby regression tests that produces the signature: "expected [0.0] but was [-0.0]"
Knut noticed that Derby's NumberDataType has a very interesting bit of code which is designed to convert negative zero to positive zero:
That is certainly an interesting line of code!
The point of this code is that the "if" statement should evaluate to true in two situations:
Knut is wondering whether certain compilers might see that line and optimize it entirely away, because the compiler thinks that the statement would have no effect (not understanding that the statement could be true when v was negative zero).
It seems like the line of code could also be phrased as:
In this case, it might be more obvious that the statement's intent is to transform negative zero into positive zero (it would also presumably transform positive zero into positive zero, just as the current statement does, but that's fine).
Optimizing compilers are fascinating creatures. In my "work life", I maintain a small library of software which implements an object-relational API over various relational databases. Part of that library provides a small O/R query language. In the query language, we currently do NOT perform any sort of optimization; we crudely translate the user's query into a corresponding SQL query, and then we let the DBMS consider the query.
Our assumption is that the DBMS is much better at optimizing than we are, partly because I'm just one programmer and DBMS companies have huge staffs, but also partly because at the level of my library, I have only limited information available about the objects referred to by the query, while the DBMS has much more information about tables, columns, views, indexes, etc.
Java language compilers are a whole different world than DBMS query language compilers, of course, but some of the principles are similar:
Knut noticed that Derby's NumberDataType has a very interesting bit of code which is designed to convert negative zero to positive zero:
// Normalize negative floats to be "positive" (can't detect easily without using Float object because -0.0f = 0.0f)
if (v == 0.0f) v = 0.0f;
That is certainly an interesting line of code!
The point of this code is that the "if" statement should evaluate to true in two situations:
- When "v" is zero.
- When "v" is negative zero.
Knut is wondering whether certain compilers might see that line and optimize it entirely away, because the compiler thinks that the statement would have no effect (not understanding that the statement could be true when v was negative zero).
It seems like the line of code could also be phrased as:
if (v == -0.0f) v = 0.0f;
In this case, it might be more obvious that the statement's intent is to transform negative zero into positive zero (it would also presumably transform positive zero into positive zero, just as the current statement does, but that's fine).
Optimizing compilers are fascinating creatures. In my "work life", I maintain a small library of software which implements an object-relational API over various relational databases. Part of that library provides a small O/R query language. In the query language, we currently do NOT perform any sort of optimization; we crudely translate the user's query into a corresponding SQL query, and then we let the DBMS consider the query.
Our assumption is that the DBMS is much better at optimizing than we are, partly because I'm just one programmer and DBMS companies have huge staffs, but also partly because at the level of my library, I have only limited information available about the objects referred to by the query, while the DBMS has much more information about tables, columns, views, indexes, etc.
Java language compilers are a whole different world than DBMS query language compilers, of course, but some of the principles are similar:
- common sub-expression elimination
- short-circuiting intermediate results
- considering alternate implementations, and choosing the most efficient
Wednesday, April 22, 2009
Examining Java Memory Usage
Imagine that you have written some sort of a cache in your Java program.
Implementing a cache in Java is quite simple, although there are refinements and elaborations which can make this quite complicated, too. The simplest sort of Java cache typically consists of:
Note that this is only interesting when the objects in the cache are of differing sizes. If all of the items in your cache are identically sized, this is not really a very interesting problem. But many Java programs have objects of all sorts of differing sizes, so it's actually quite common that a cache will find itself holding objects which have a variety of sizes.
So imagine that you are trying to draw a circle around the cache, such that the cache's own data structures, and all of the cached objects, are inside the circle, and the rest of your Java program's memory usage is outside of the circle, and you want to know how much memory is inside that circle ("this cache is currently using 65Mb").
I've been unable to figure out a clean way to do this.
Ideally, I'd like to find a solution which the cache could implement itself, in its own Java code, but I'd also be glad if I could just find any solution to the problem, for now.
Here's some ideas that I've thought of so far:
Ideas? Suggestions? Am I overlooking something really simple and easy here?
Implementing a cache in Java is quite simple, although there are refinements and elaborations which can make this quite complicated, too. The simplest sort of Java cache typically consists of:
- A collection object, such as a java.util.Hashtable or java.util.HashMap, to hold the cached items and allow them to be easily searched.
- Usually there is a companion data structure which supports the cache's replacement policy.
- Typically the cache is packaged up in some manager class, which also holds configuration information, which is responsible for the creation and management of the cache, and which has apis to fetch objects from the cache, and (depending on the cache's policy) also has apis to add objects to the cache, evict objects from the cache, and manage updates to cached objects.
Note that this is only interesting when the objects in the cache are of differing sizes. If all of the items in your cache are identically sized, this is not really a very interesting problem. But many Java programs have objects of all sorts of differing sizes, so it's actually quite common that a cache will find itself holding objects which have a variety of sizes.
So imagine that you are trying to draw a circle around the cache, such that the cache's own data structures, and all of the cached objects, are inside the circle, and the rest of your Java program's memory usage is outside of the circle, and you want to know how much memory is inside that circle ("this cache is currently using 65Mb").
I've been unable to figure out a clean way to do this.
Ideally, I'd like to find a solution which the cache could implement itself, in its own Java code, but I'd also be glad if I could just find any solution to the problem, for now.
Here's some ideas that I've thought of so far:
- Use a tool which does this for you. Do such tools exist? I've tried messing about with things like jmap/jhat/hprof, and it looks like the OQL support ought to allow me to figure out how much memory my cache is using, but I haven't yet figured out how to get it to work.
- Run a series of experiments, using a workload which is capable of causing the cache to be bigger or smaller, and adjust the JVM's memory limit using the -Xmx argument, such that for each various workload, the -Xmx argument is set to the smallest possible value that enables the test to pass. Then correlate the size of the workload with the size of the cache: "with 1000 entries in the cache, we can just run with -Xmx128m, and with 5000 entries in the cache, we can just run with -Xmx160m, so 4000 cache entries must use 32M of space".
- Call Runtime.freeMemory before creating the cache. Then put some entries into the cache and call Runtime.freeMemory again. Then put some more entries into the cache and call it again. And so forth. Then correlate the changes that are seen in the return from Runtime.freeMemory with the objects that were put into the cache: "After adding 1000 entries to the cache, Runtime.freeMemory returned a value that was 10M smaller, so 1000 cache entries must use 10M of space".
- Enumerate the cache, and use reflection to examine the entries in the cache, as well as the cache itself. For each instance of each object "inside the circle", figure out what the actual runtime type of that object is, and thus what fields the object has. For each such field, figure out its runtime type, recursively. Eventually, you'll get down to the point where the fields that you're looking at are of Java built-in types, such as String, int, boolean, Long, etc. For each one of those instances, you can determine its actual size (though this is certainly implementation dependent). For example, a String generally takes 2 bytes per character in the String, plus a certain amount of additional overhead imposed by the implementation of the String class, whereas an int generally takes 4 bytes total. Add up all of these sizes, for the object and all of its sub-objects, and you have some sort of first-level estimate of the amount of memory consumed by the cache. (Aside: a truly complete implementation of this concept should allow for the fact that there may be object sharing among the items in the cache, so if two entries share a sub-object, for example they both point to the same instance of the same String, then we want to be careful not to count the shared sub-object's size more than once. But that's just a detail, and the algorithm is I think still tractable.)
Ideas? Suggestions? Am I overlooking something really simple and easy here?
Tuesday, April 21, 2009
GSoC 2009
I've been following Google Summer of Code for several years now, primarily through the lens of my work with Derby.
Two summers ago, I was an interested observer.
Last summer, I signed up to be a mentor, but in the end didn't get matched with a student intern.
This summer, I signed up again, and have been matched with a student, Eranda Sooriyabandara, of Sri Lanka. I will be actively mentoring in the Derby-test-and-fix project.
This will be my first summer doing this, so I'll have a lot to learn.
Two summers ago, I was an interested observer.
Last summer, I signed up to be a mentor, but in the end didn't get matched with a student intern.
This summer, I signed up again, and have been matched with a student, Eranda Sooriyabandara, of Sri Lanka. I will be actively mentoring in the Derby-test-and-fix project.
This will be my first summer doing this, so I'll have a lot to learn.
Monday, April 20, 2009
Derby
I started working with Derby in the summer of 2005. We use Derby internally as part of an internal tool called the Build Farm; I'll write more about that sometime. I was somewhat familiar with Derby, for I knew many of the Cloudscape team (Nat Wyatt, Mike Matrigali, Siuling Ku, Rick Hillegas, Martin Waldron, ...) from my time at Sybase. I had been searching for a way to gain more knowledge about Derby, and so when IBM open-sourced the Cloudscape software as Derby, I was eager to give it a try.
For 4 years, Derby has run nearly flawlessly in production use, and the more I learn about it, the more impressed I get. Among the reasons I enjoy working with Derby are:
For Derby 10.3, I worked on the ALTER TABLE support.
This winter and spring, I've been working on XPLAIN support.
For 4 years, Derby has run nearly flawlessly in production use, and the more I learn about it, the more impressed I get. Among the reasons I enjoy working with Derby are:
- It's a great community. People are respectful, pleasant, and dedicated.
- It's a great piece of software. Derby contains some of the most sophisticated and elegant Java code I've seen. Some extremely talented engineers have worked very carefully on this code.
- It's a database, and I'm fascinated by database software.
- It's open source, and I'm fascinated to learn about open source.
For Derby 10.3, I worked on the ALTER TABLE support.
This winter and spring, I've been working on XPLAIN support.
Sunday, April 19, 2009
About me (Professionally)
Since I've been around computers and software for more than 30 years, it's hard to sum things up in a tidy fashion. Some of my earliest experiences include messing about on my Dad's Cromemco Z80 and then later on his Apple IIe. The Apple IIe was well-loved in our house. After several years, it developed a bad habit where, as it heated up, the chips would start to pop out and separate from the motherboard. My dad figured out that we could simply open up the box, lean inside, and push down hard on the chips to re-seat them.
Later, in college, I spent a lot of time in Library Automation, at the Regenstein Library at the University of Chicago. This was my first exposure to serious software development, and also where I developed my fascination with databases. We weren't doing traditional OLTP-style database work, but lots and lots of bibliographic analysis, so it was mostly sorting and searching.
After college we spent 3 years in Boston, where I worked for Computer Corporation of America, on their Model 204 DBMS. This was a very specialized, non-traditional database system written in hand-coded IBM 370 Assembler Language. This was where I started to pick up lots of basic system software knowledge, such as how to write multi-threaded software, how to manage resources carefully, and how to handle errors thoroughly.
In the late 1980's we switched coasts, and I worked on several relational database systems: first Ingres, and then Sybase. At Ingres, I learned about the guts of the storage layer of modern DBMS systems: file structures, concurrency control, recovery, cache management, access methods, etc. At Sybase, I learned about object-oriented systems (I was involved in an experimental object-oriented database project there), and finally left C language programming and moved to C++.
In the mid-1990's I switched fields and got into Application Servers and Middleware, working first at Forte, then at Sun, and most recently at AmberPoint. I learned a lot about distributed systems, messaging protocols, CORBA, web services, etc. When I began working at Forte we were still working in C++, but by the time I left everything was Java, and I've been working primarily in Java all of this century.
That's a whirlwind tour of 30 years!
Later, in college, I spent a lot of time in Library Automation, at the Regenstein Library at the University of Chicago. This was my first exposure to serious software development, and also where I developed my fascination with databases. We weren't doing traditional OLTP-style database work, but lots and lots of bibliographic analysis, so it was mostly sorting and searching.
After college we spent 3 years in Boston, where I worked for Computer Corporation of America, on their Model 204 DBMS. This was a very specialized, non-traditional database system written in hand-coded IBM 370 Assembler Language. This was where I started to pick up lots of basic system software knowledge, such as how to write multi-threaded software, how to manage resources carefully, and how to handle errors thoroughly.
In the late 1980's we switched coasts, and I worked on several relational database systems: first Ingres, and then Sybase. At Ingres, I learned about the guts of the storage layer of modern DBMS systems: file structures, concurrency control, recovery, cache management, access methods, etc. At Sybase, I learned about object-oriented systems (I was involved in an experimental object-oriented database project there), and finally left C language programming and moved to C++.
In the mid-1990's I switched fields and got into Application Servers and Middleware, working first at Forte, then at Sun, and most recently at AmberPoint. I learned a lot about distributed systems, messaging protocols, CORBA, web services, etc. When I began working at Forte we were still working in C++, but by the time I left everything was Java, and I've been working primarily in Java all of this century.
That's a whirlwind tour of 30 years!
I've kept blogs internally, but haven't had much success with a public blog. But I want to have one, so it's time to give it another try. It will take me a while to learn how to use the blogging tools.
I want to write mostly about technical things that occupy my time, which means, primarily:
I'll post some more about myself later.
Here are some topics I want to start working on soon, so if I don't write more about them, remind me!
I want to write mostly about technical things that occupy my time, which means, primarily:
- Databases
- System software (servers, operating systems, networking, etc.)
- Development tools and the development process
- Programming languages
- Testing
I'll post some more about myself later.
Here are some topics I want to start working on soon, so if I don't write more about them, remind me!
- Derby
- GSoC
- Build Farm
- Performance Testing
- 1829/Javascript