Tuesday, February 9, 2010

FAT32 filesystem limits, DRDA CMDCHKRM and DERBY-3729

As part of DERBY-3729, I've been diving back into DRDA, an area I hadn't visited in several years.

DRDA is IBM's Distributed Relational Database Architecture protocol. It is the client-server protocol which was chosen for the implementation of the Apache Derby network server functionality. DRDA is extremely rich and powerful, but unfortunately it is not simple. However, it is very thoroughly documented; the complete DRDA specification is available from The Open Group.

In the particular case of DERBY-3729, the issue involves a fairly simple question:

When the server runs out of disk space, or is otherwise unable to write any more data to the database, how should it notify the client?

In general, DRDA accomodates the return of error message information from the server to the client. However, in this case, the issue is complicated because of the severity of the error. In Derby, as in (probably) all standard SQL implementations, there are a range of severity levels:

  • Warnings, which are simply returned to the client following the statement execution. For example, I think "value was truncated to fit" is a warning.

  • Errors which cause the current statement to be aborted. For example, I think "unique key constraint was violated" is a statement-level error.

  • Errors which cause the current transaction to be aborted. For example, I think "deadlock occurred and you were chosen as a victim" is a transaction-abort error.

  • Errors which cause the current connection/session to be closed. These are not very common in Derby except due to internal errors. One example, though, is when you try to connect, but give invalid arguments for the connection parameters. Then you get an error and your (never-really-created) connection is closed.

  • Errors which cause the current database to be closed. These usually involve I/O errors which are affecting Derby's storage engine. Note that since Derby can support multiple databases simultaneously, it is possible that one database is on a bad disk drive while other databases are still OK, so only the failing database is shut down.

  • Errors which cause the entire system to shut down. These are extremely rare, and mostly involve internal logic errors that are detected in critical pieces of the Derby engine. For example, if an unexpected exception occurs while aborting a transaction, Derby concludes that something is horribly wrong and shuts the entire system down rather than risking further damage.

Back to the case at hand: DERBY-3729.

In this case, it wasn't that the disk was full; rather, since the user had configured their system using the FAT32 filesystem format, individual files are limited to 4 gigabytes in size. Once a file gets that large, attempts to make it larger are rejected, with an error that turns into an IOException in the Derby storage engine. The IOException is caught and treated as a database-severity error, which results in:

  • The statement and its transaction are rolled back (if possible)

  • The session and its connection are closed

  • The database is shut down

  • The client is informed that a "command check" has occurred"

A "command check" is the DRDA message which is used when an error of such severity has occurred that it caused the connection to be closed. It is conveyed as a CMDCHKRM, which is DRDA jargon for "command check response message".

Unfortunately, the normal Derby client-server error message communication mechanism requires that the connection remain open, because generally the Derby server just sends a message "code" to the client, which then requests the full error message details from the server by making additional protocol calls back-and-forth. In this case, since the connection is closed, we only get one shot to convey any error message information, which is via the CMDCHKRM message itself.

It turns out that the CMDCHKRM message always contains a SQLCARD, which is a SQL Communications Area Response Data object, which allows a small amount of error information to be carried inline.

So, to try to resolve DERBY-3729, I:

  • Enhanced the server-side error message text to reflect that there are other possible causes of I/O errors besides the disk being full, such as a filesystem limit (FAT32) or a quota being reached.

  • Enhanced the client-side error message text to reflect that, when a CMDCHKRM is received, there may be additional information available in the server-side derby.log file.

  • Enhanced the client code which processes the CMDCHKRM message to look for the SQLCARD, and to fetch whatever summary error message text is present in that object, and include it in the client-side exception that is thrown.

Hopefully this will help the next user who runs into this problem.

I wonder how long the FAT32 filesystem format will still be in use?

1 comment:

  1. The correct answer is: handle a disk failure by leaving a message in the log. on the disk.

    this is compatible with NTFS disk fault handling.