Nowadays the so-called "NoSQL" techniques are all the rage. Everywhere you look it's Dynamo, Cassandra, BigTable, MongoDB, etc. Everybody seems to want to talk about relaxed consistency, eventual consistency, massive scalability, approximate query processing, and so on.
There's clearly a lot of value in these new paradigms, and it's indeed hard to see how Internet-scale systems could be built without them.
But, for an old-time DBMS grunt like me, raised on the work of people like Gray, Stonebraker, Mohan, Epstein, Putzolu, and so forth, it's a breath of extremely fresh air to come across a recent Google paper:
Large-scale Incremental Processing Using Distributed Transactions and Notifications.
Google, of course, are pioneers and leaders in Internet-scale data management, and their systems, such as BigTable, Map/Reduce, and GFS, are well known. But this paper is all about how traditional database techniques still have a role to play in Internet-scale data management.
The authors describe
Percolator and
Caffeine, systems for performing incremental consistent updates to the Google web indexes:
An ideal data processing system for the task of maintaining the web search index would be optimized for incremental processing; that is, it would allow us to maintain a very large repository of documents and update it efficiently as each new document was crawled. Given that the system will be processing many small updates concurrently, an ideal system would also provide mechanisms for maintaining invariants despite concurrent updates and for keeping track of which updates have been processed.
They describe how they use ideas from traditional DBMS implementations, such as transaction isolation, and two-phase commit, to provide certain guarantees that make new approaches to maintaining Google's multi-petabyte indexes feasible:
By converting the indexing system to an incremental system, we are able to process individual documents as they are crawled. This reduced that average document processing latency by a factor of 100, and the average age of a document appearing in a search result dropped by nearly 50 percent.
Since Google have for many years been the poster child for Internet-scale data management, it's an event of significant importance in this age of
NoSQL architectures and
CAP-theorem analysis to read a paragraph such as the following from Google's team:
The transaction management of Percolator builds on a long line of work on distributed transactions for database systems. Percolator implements snapshot isolation by extending multi-version timestamp ordering across a distributed system using two-phase commit.
What goes around comes around. Reading the paper, I was reminded of the days when I first got interested in DBMS technology. In the late 1970's, data processing tended to be done using what was then called "batch" techniques: During the day, the system provided read-only access to the data, and accumulated change requests into a separate spooling area (typically, written to 9-track tapes); overnight, the day's changes would be run through a gigantic sort-merge-apply algorithm, which would apply the changes to the master data, and make the system ready for the next day's use. Along came some new data processing techniques, and systems could provide "online updates": operators could change the data, and the system could incrementally perform the update while still making the database available for queries by other concurrent users.
Now it's 40 years later, and the same sort of changes are still worth doing. The authors report that the introduction of Percolator and Caffeine provided a revolutionary improvement to the Google index:
In our previous system, each day we crawled several billion documents and fed them along with a repository of existing documents through a series of 100 MapReduces. Though not all 100 MapReduces were on the critical path for ever document, the organization of the system as a series of MapReduces meant that each document spent 2-3 days being indexed before it could be returned as a search result.
The Percolator-based indexing system (known as Caffeine) crawls the same number of documents, but we feed each document through Percolator as it is crawled. The immediate advantage, and main design goal, of Caffeine is a reduction in latency: the median document moves through Caffeine over 100x faster than the previous system.
The paper is very well written, thorough, and complete. If you are even tangentially involved with the world of
"Big Data", you'll want to carve out an afternoon and spend it digging through the paper, chasing down the references, studying the pseudocode, and thinking about the implications. Thanks Google for publishing these results, I found them very instructive!