I very much enjoyed reading three papers which all happened to be part of the same session of the 11th USENIX Symposium on Operating Systems Design and Implementation.
The session that captivated me was called "Pest Control," and it included these papers:
- Torturing Databases for Fun and Profit
Here we propose a method to expose and diagnose violations of the ACID properties. We focus on an ostensibly easy case: power faults. Our framework includes workloads to exercise the ACID guarantees, a record/replay subsystem to allow the controlled injection of simulated power faults, a ranking algorithm to prioritize where to fault based on our experience, and a multi-layer tracer to diagnose root causes.
- All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
We find that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which we term persistence properties. We develop a tool named BOB that empirically tests persistence properties, and use it to demonstrate that these properties vary widely among six popular Linux file systems. We build a framework named ALICE that analyzes application update protocols and finds crash vulnerabilities, i.e., update protocol code that requires specific persistence properties to hold for correctness.
- SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration
In this paper, we propose SKI, the first tool for the systematic exploration of possible interleavings of kernel code. SKI finds kernel bugs in unmodified kernels, and is thus directly applicable to different kernels. To achieve control over kernel interleavings in a portable way, SKI uses an adapted virtual machine monitor that performs an efficient analysis of the kernel execution on a virtual multiprocessor platform. This enables SKI to determine which kernel execution flows are eligible to run, and also to selectively control which flows may proceed.
In a lot of ways, these papers are all very similar. They all examine the very challenging problem of finding bugs in extremely complex system software.
And they all take the approach of building a tool to find such bugs.
And all of their tools use lower-level capabilities to examine and interact with the software under test: one tool captures the SCSI commands that are sent to the storage system, and is capable of then manipulating those captures to simulate various power failures and disk errors that can occur; another tool captures the system calls that are made from the application to the operating system, and can manipulate those system calls in various ways; the third tool uses virtualization infrastructure to capture and manipulate the actions between the operating system and the (virtual) hardware.
This sort of tool-based testing is wonderful, I think.
I've seen similar tools (in concept) that do things with captured/replayed network traces; that do things with captured/replayed transaction logs; that do things with captured/replayed web server logs; etc.
It's extremely hard to go "the last mile" when testing complex system software, so I'm always enthusiastic when I see people building powerful testing tools.