And last month I was digging my way through Google's paper on their Megastore system.
Well, this month I've been continuing to work my way through the papers from this winter's Conference on Innovative Data Systems Research (CIDR), and this year the Best Paper award went to another logging systems implementation paper: Hyder - A Transactional Record Manager for Shared Flash, from a team at Microsoft Research, headed by Philip Bernstein, who literally wrote the book on transaction logging systems many years ago (I still have my copy!).
The three projects all look at similar problems, but all are pointed in different directions:
- Aether is concerned with single-system multi-core parallelism, and proposes the use of sophisticated high-concurrency non-blocking data structures and algorithms.
- Megastore is concerned with offering traditional transaction-processing ACID guarantees on top of a massively-replicated infrastructure, and proposes a Paxos algorithm using timestamping for implementing a single logical log across the many-node systems.
- Hyder, like Megastore is concerned with multi-system scale-out, but proposes the use of multi-version database concurrency control and a melding protocol to merge the individual system transaction histories into a totally ordered outcome.
The melding protocol proposed by Hyder is the heart of the work, and is described in great detail. Here's the overview of the idea, from the abstract:
Each transaction executes on a snapshot, logs its updates in one record, and broadcasts the log record to all servers. Each server rolls forward the log against its locally-cached partial-copy of the last committed state, using optimistic concurrency control to determine whether each transaction commits.
Another very interesting aspect of the Hyder work is the tradeoff between traditional BTree architecture and replication protocols, and in particular that there can be significant benefit in minimizing log record size:
An updated tree is logged in a transaction's intention. For good performance, it is important to minimize its size. For this reason, binary trees are a good choice. A binary tree over a billion keys has depth 30. A similar number of keys can be indexed by a four-layer B-tree with 200-key pages. But an update of that Btree creates a new version of four pages comprising the root-to-leaf path, which consumes much more space than a 30-node path.
The paper is somewhat indefinite with regards to this, observing that: "A comparative evaluation of these tradeoffs would be a worthwhile investigation." Sort of a details-are-left-as-an-exercise observation, unfortunately. However, this is active research and I'm pleased that they are sharing their thoughts and ideas; I'd always rather read about the road not taken, and why, then get everything all nicely wrapped up in a package, without the background information that helps understand how they arrived at these conclusions and what other ideas they considered.
All three papers are fascinating and worth reading, and I'm certainly enjoying this renaissance in transactional processing implementation of late. All of a sudden, the only thing I'm short of is time to read all these ideas!