I got distracted by some other work, and time passed. But recently Knut noticed that I hadn't done anything on the project in a while, and encouraged me to return to it.
So I have.
In SQL, the GROUP BY statement instructs the DBMS to partition the data into separate groups based on the values of one or more columns; usually it is used with computations which evaluate a COUNT, SUM, AVG, or other function across the groups. So, for example, it is common to see:
SELECT D.NAME, COUNT(E.ID) FROM EMPLOYEE E, DEPARMENT D WHERE E.DEPT_ID = D.ID GROUP BY D.NAME
SELECT D.NAME, SUM(PROJ.BUDGET) FROM PROJECT PROJ, DEPARTMENT D WHERE PROJ.DEPT_ID = D.ID GROUP BY D.NAME
SELECT D.NAME, AVG(E.SALARY) AS AVG_SALARY FROM EMPLOYEE E, DEPARTMENT D WHERE E.DEPT_ID = D.ID GROUP BY D.NAME ORDER BY AVG_SALARY
You can group by more than one column, so if you wanted to find the budget for each department, for each project type, for each month, you'd write something like:
SELECT D.NAME, P.MONTH, P.TYPE, SUM(P.BUDGET)
FROM DEPARTMENT D, PROJECT P
WHERE D.ID = P.DEPT_ID
GROUP BY D.NAME, P.MONTH, P.TYPE
This statement produces a detailed breakdown, but when producing a report, the user commonly wants to "roll up" these detailed groups so that they get an overall sum for:
- All the project types for this department this month (GROUP BY D.NAME, P.MONTH), as well as
- All the project types for the entire budget year for this department (GROUP BY D.NAME)
The
GROUP BY ROLLUP
feature is designed for precisely this purpose, as it allows the user to issue a single statement which produces groupings at all those levels of detail in a single result set:
SELECT D.NAME, P.MONTH, P.TYPE, SUM(P.BUDGET)
FROM DEPARTMENT D, PROJECT P
WHERE D.ID = P.DEPT_ID
GROUP BY ROLLUP ( D.NAME, P.MONTH, P.TYPE )
The prototype work seems to be still quite relevant, as that part of Derby hasn't changed substantially. Knut noticed a few things that need cleaning up, so I'll start with those.
I suspect that the major pieces to tidy this up and ready it for commit will be:
- More testing. I wrote a couple trivial tests, but more are required.
- Documentation. Some additional information is required to describe what the new feature is and how to use it.
- Performance. The ROLLUP feature is essentially a shorthand for issuing N separate SQL statements, so as long as it doesn't perform any worse than the separate SQL statements, it's a useful feature because of its convenience. But in principle it should enable some optimizations which would cause it to perform substantially better in many cases, so I'll want to have a look at whether we can do any of those improvements.
If I can find a little bit of time, I should be able to accomplish this over the next few months, and see if I can get the project over the hurdle from prototype to operational.
No comments:
Post a Comment