As part of DERBY-3002, I've been re-designing the implementation of the GROUP BY statement in Derby. Specifically, I've been working on implementing ROLLUP grouping, which is a feature that enables the computation of multiple levels of grouped aggregation with a single pass over the input data.
I believe that I've got this feature working, and in addition to adding support for ROLLUP grouping, I've also, as a side effect, added support for multiple distinct aggregates in the same result set.
However, my implementation is substantially different than the previous implementation, as the previous implementation generally computed aggregates using sort observers, whereas my new implementation does not use sort observers. Instead, the implementation sorts the data (unless it's already in sorted order) and then processes the data in a single pass over the sorted values, computing the various levels of aggregates as it goes.
During review of the changes, Knut Anders wondered what the overall affects on performance were. This is a good question, and I'm trying to figure out how to answer it. I think that the best thing to do is to write a benchmark, and then we can run it against both the old and the new implementations, and measure the results.
Writing a GROUP BY benchmark is simpler than many other possible benchmarks, but it is still not an easy job. For the most part, the overall structure of a reasonable GROUP BY benchmark is probably something like:
for each interesting GROUP BY statement,
note the current time
execute the GROUP BY statement and
read all the results
note the current time again and
compute the elapsed time
print a simple result stating
the statements that were run and
the elapsed time for each statement
The interesting questions, then, are:
- what statements should we run, and
- what data should we run them against?
- the new statements, which can only be run against Derby with my proposed DERBY-3002 patch, and
- an equivalent set of statements which can be run against older versions of Derby. For example, ROLLUP can be replaced by an equivalent set of UNION'ed GROUP BY statements. And multiple DISTINCT aggregates can be replaced by some sort of join of multiple subqueries, each with its own single DISTINCT aggregate.
- There should be enough data to make the queries interesting, and require the database to do enough work so that differences in the implementation performance become important. If there isn't much data, almost any algorithm would be good enough.
- The data needs to include the right data types; should we use strings? numbers? large objects? dates and times?
- The distribution of the data is very important: how many groups are there in the data, and what is the ratio between the number of groups, and the total amount of data?
One place to start is to incorporate the GROUP BY examples from the Derby documentation, and from the ROLLUP specification.
Another possibility is to use statements and data from existing benchmarks; for example, there are the TPC-* benchmarks from tpc.org; TPC-D might be the best place to start looking for GROUP BY examples. Are there other such benchmarks?
Where else might I look for inspiration for building a GROUP BY benchmark?