These recent publications would suggest that the time has finally come to deploy serious test tools for bullet-proofing large scale distributed systems.
- Challenges in Designing at Scale: Formal Methods in Building Robust Distributed Systems
PlusCal and TLA+ have proven very effective at establishing and maintaining the correctness through change of the fundamental components on which DynamoDB is based.
- Using TLA+ for teaching distributed systems
TLA is a tool for specifying distributed algorithms/protocols and model checking them.
- My experience with using TLA+ in distributed systems class
Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and correctness verification.
- Combining static model checking with dynamic enforcement using the Statecall Policy Language
The Statecall Policy Language (SPL) was designed to make model checking more accessible to regular programmers.
- SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems
It was used to test 10 versions(both old and current) of Zookeeper, Hadoop/YARN, and Cassandra across 7 different protocols (leader election, atomic broadcast, cluster management, speculative execution, read/write, hinted handoff, and gossiper).
- Lineage-driven Fault Injection
MOLLY doesn’t blindly explore a state space, instead MOLLY reasons backwards from a successful outcome (hopefully common!), to figure out what might have caused it to fail, and then probes paths that exercise those potential causes.
- Paper review: "Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems"
Table 5 shows that almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes, and 84% will manifest on no more than 2 nodes.