Thursday, June 6, 2013

Network failures and system resilience

On his Aphyr blog, Kyle Kingsbury has been doing some superb work.

First, there's his recent article surveying the ways that networks can fail, defeating our well-intentioned-but-inadequate attempts to survive such failures: The Network is Reliable

This post is meant as a reference point–to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments. Processes, servers, NICs, switches, local and wide area networks can all fail, and the resulting economic consequences are real. Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance. The consequences of these outages range from increased latency and temporary unavailability to inconsistency, corruption, and data loss. Split-brain is not an academic concern: it happens to all kinds of systems–sometimes for days on end. Partitions deserve serious consideration.

And don't stop there; make sure you read Kingsbury's series of articles on system architectures for handling network partitions:

  • Call me maybe: Carly Rae Jepsen and the perils of network partitions
    This article is part of Jepsen, a series on network partitions. We're going to learn about distributed consensus, discuss the CAP theorem's implications, and demonstrate how different databases behave under partition.
  • Call me maybe: Postgres
    Previously on Jepsen, we introduced the problem of network partitions. Here, we demonstrate that a few transactions which “fail” during the start of a partition may have actually succeeded.
  • Call me maybe: Redis
    Previously on Jepsen, we explored two-phase commit in Postgres. In this post, we demonstrate Redis losing 56% of writes during a partition.
  • Call me maybe: MongoDB
    Previously in Jepsen, we discussed Redis. In this post, we'll see MongoDB drop a phenomenal amount of data.
  • Call me maybe: Riak
    Previously in Jepsen, we discussed MongoDB. Today, we'll see how last-write-wins in Riak can lead to unbounded data loss.
  • Call me maybe: Final Thoughts
    Previously in Jepsen, we discussed Riak. Now we'll review and integrate our findings.
  • Asynchronous replication with failover
    In response to my earlier post on Redis inconsistency, Antirez was kind enough to help clarify some points about Redis Sentinel's design.

I've been really enjoying and learning from these articles; I hope Kingsbury continues to write and publish more great work!

No comments:

Post a Comment