- Optimistic Replication Algorithms, by Yasushi Saito. This paper is more than a decade old, but I just came across it recently. It's a survey of overall replication strategies, with a categorization of the various approaches to help compare and contrast them.
Optimistic replication algorithms allow data presented to users to be stale but in a controlled way. A key feature that separates optimistic replication algorithms from pessimistic counterparts is the way object updates are handled: whereas pessimistic algorithms update all the replicas at once and possibly block read requests during the update application, optimistic algorithms propagate updates in background and allow any replica to be read directly most of the time.According to the paper's taxonomy, the implementation that I've been working on at my day job is a single-master log-transfer system with eventual consistency.
- Btrfs: The Swiss Army Knife of Storage by Josef Bacik. This is a filesystems paper:
Btrfs is a new file system for Linux that has been under development for four years now and is based on Ohad Rodeh’s copy-on-write B-tree . Its aim is to bring more efficient storage management and better data integrity features to Linux . It has been designed to offer advanced features such as built-in RAID support, snapshotting, compression, and encryption . Btrfs also checksums all metadata and will checksum data with the option to turn off data checksumming .This is a great example of the sort of paper that I often call a "principles and techniques" paper: rather than diving into low-level details, the paper describes the high-level principles that the btrfs implementation is using to build a production quality filesystem. It's extremely readable while also being very informative. One of the major differentiators of btrfs from other recent filesystems has to do with how they ensure consistency. For the past 10-20 years, the primary technique has been journaling, but btrfs has a new approach:
Traditional Linux file systems have used journals to ensure metadata consistency after crashes or power failures . In the case of ext this means all metadata is written twice, once to the journal and then to its final destination . In the case of XFS this usually means that a small record of what has changed is written to the journal, and eventually the changed block is written to disk . If the machine crashes or experiences a power failure, these journals have to be read on mount and re-run onto the file system to make sure nothing was lost . With Btrfs everything is copied on write . That means whenever we modify a block, we allocate a new location on disk for it, make our modification, write it to the new location, and then free the old location . You either get the change or you don’t, so you don’t have to log the change or replay anything the next time you mount the file system after a failure—the file system will always be consistent .
- RemusDB: Transparent High Availability for Database Systems by Minhas, Rajagopalan, Cully, Aboulnaga, Salem, and Warﬁeld. This paper presents a very interesting approach to providing a high-availability database, using the technology of modern cloud-computing-style server virtualization:
Two servers are used to provide HA for a DBMS. One server hosts the active VM, which handles all client requests during normal operation. As the active VM runs, its entire state including memory, disk, and active network connections are continuously checkpointed to a standby VM on a second physical server.
Remus’s checkpoints capture the entire state of the active VM, which includes disk, memory, CPU, and network device state. Thus, this captures both the state of the database and the internal execution state of the DBMS, e.g., the contents of the buffer pool, lock tables, and client connection state. After failover, the DBMS in the standby VM begins execution with a completely warmed up buffer pool, picking up exactly where the active VM was as of the most recent checkpoint, with all session state, TCP state, and transaction state intact. This fast failover to a warm backup and no loss of client connections is an important advantage of our approach.
- Fast Crash Recovery in RAMCloud, by Ongaro, Rumble, Stutsman, Ousterhout, and Rosenblum. This paper is also concerned with high availability and crash recovery, but this team takes a different approach:
RAMCloud keeps only a single copy of data in DRAM; redundant copies are kept on disk or ﬂash, which is both cheaper and more durable than DRAM. However, this means that a server crash will leave some of the system’s data unavailable until it can be reconstructed from secondary storage.In order to accomplish their recovery-time objective, they have a massively-parallel recovery algorithm:
RAMCloud’s solution to the availability problem is fast crash recovery: the system reconstructs the entire contents of a lost server’s memory (64 GB or more) from disk and resumes full service in 1-2 seconds. We believe this is fast enough to be considered “continuous availability” for most applications.
Each server scatters its backup data across all of the other servers, allowing thousands of disks to participate in recovery. Hundreds of recovery masters work together to avoid network and CPU bottlenecks while recovering data. RAMCloud uses both data parallelism and pipelining to speed up recovery.
- Modular Data Storage with Anvil, by Mammarella, Hovsepian, and Kohler. This paper discusses an experimental database system which is designed to allow researchers and system builders to experiment with various storage systems by showing how a database storage system can be constructed in a component-based fashion, based on a small set of carefully-designed storage system modules:
Two basic goals guided the design of Anvil. First, we want Anvil modules to be ﬁne-grained and easy to write. Implementing behaviors optimized for speciﬁc workloads should be a matter of rearranging existing modules (or possibly writing new ones). Second, we want to use storage media effectively by minimizing seeks, instead aiming for large contiguous accesses. Anvil achieves these goals by explicitly separating read-only and write-mostly components, using stacked data storage modules to combine them into read/write stores. Although the Anvil design accommodates monolithic read/write stores, separating these functions makes the individual parts easier to write and easier to extend through module layering.The basic Anvil component is an object called a dTable; the paper shows a number of details about dTables and also gives some good examples of how they can be composed, layered, and extended.
- Interval Tree Clocks: A Logical Clock for Dynamic Systems, by Almeida, Baquero, and Fonte. This paper discusses the nearly 40 year old problem of trying to reason about the ordering of events in distributed systems.
This paper addresses causality tracking in dynamic settings and introduces Interval Tree Clocks (ITC), a novel causality tracking mechanism that generalizes both Version Vectors and Vector Clocks. It does not require global ids but is able to create, retire and reuse them autonomously, with no need for global coordination; any entity can fork a new one and the number of participants can be reduced by joining arbitrary pairs of entities; stamps tend to grow or shrink adapting to the dynamic nature of the system. Contrary to some previous approaches, ITC is suitable for practical uses, as the space requirement scales well with the number of entities and grows modestly over time.The paper is very carefully and clearly written, and I enjoyed this paper the most out of the whole set. The best thing about the paper is its use of examples and illustrations: the examples are carefully chosen to be complex enough to capture the power of their approach, while still being small enough to fit in a terse paper.
But more importantly, the paper uses a simply brilliant graphical notation to allow you to visualize the tree-manipulation techniques. The essence of the ITC approach is to use tree data structures to encode event histories, but standard notation for describing the technique is very hard to follow. The diagrammatic approach that the paper uses is beautiful and very elegantly conveys the essence of the technique.
By the way, it appears that the authors have subsequently contributed a running implementation of their approach in multiple languages as open source. Very nice!
Who knows if any of these papers are of interest to you. I found them all interesting, well-written, and worth the time. If the subject matter of any of them appeals to you, I think you won't be disappointed by studying them.