Monday, October 15, 2012

Drilling down, spreading the load

A few related, but independent, postcards from the bleeding edge crossed my eyeballs the last few days:

  • The column-oriented gang at Google just keep cranking things out. There was the Dremel work of last summer, and an entirely-unrelated (I think) project called Supersonic that claims to provide
    an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs

    And now we have PowerDrill, introduced with the wonderful title: Processing a trillion cells per mouse click

    we present the column-oriented datastore developed as one of the central components of PowerDrill. It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions) and extensive algorithmic engineering on key data structures. The main goal of the latter being to reduce the main memory footprint and to increase the efficiency in processing typical user queries. In this combination we achieve large speed-ups. These enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.
  • Meanwhile, I'm quite enjoying the slide deck posted by the Tokutek team from their tutorial session at the XLDB conference: Data Structures and Algorithms for Big Databases
    The tutorial was organized as follows:
    • Module 0: Tutorial overview and introductions. We describe an observed (but not necessary) tradeoff in ingestion, querying, and freshness in traditional database.
    • Module 1: I/O model and cache-oblivious analysis.
    • Module 2: Write-optimized data structures. We give the optimal trade-off between inserts and point queries. We show how to build data structures that lie on this tradeoff curve.
    • Module 2 continued: Write-optimized data structures perform writes much faster than point queries; this asymmetry affects the design of an ACID compliant database.
    • Module 3: Case study – TokuFS. How to design and build a write-optimized file systems.
    • Module 4: Page-replacement algorithms. We give relevant theorems on the performance of page-replacement strategies such as LRU.
    • Module 5: Index design, including covering indexes.
    • Module 6: Log-structured merge trees and fractional cascading.
    • Module 7: Bloom filters.
    The slides are superb, but I bet it was even greater to attend the presentation; it seems that they packed pretty much every important topic in the last 10 years of file structure design into a single presentation!

No comments:

Post a Comment