There's some sort of inflection point where a "long read" becomes an "epic article."
I'm not quite sure where that inflection point is, but I'm certain that Jay Kreps's article: The Log: What every software engineer should know about real-time data's unifying abstraction is on the "epic article" side of that dividing line.
Kreps starts out with a history of how database implementations have used logs for decades to record the detailed changes performed by a database transaction, so that those changes can be undone in case of failure or application abort, redone in case of media crash, or simply analyzed and audited for additional information.
Then, in an insightful observation, he shows the duality between understanding logs, and the "distributed state machine" approach to distributed systems design:
You can reduce the problem of making multiple machines all do the same thing to the problem of implementing a distributed consistent log to feed these processes input. The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync.
I love the phrase "squeeze all the non-determinism out of the input stream;" I may have to make up some T-shirts with that!
Later, ruminating on the wide variety of uses of historical information in systems, Kreps brings up a point dear to my heart:
This might remind you of source code version control. There is a close relationship between source control and databases. Version control solves a very similar problem to what distributed data systems have to solve—managing distributed, concurrent changes in state. A version control system usually models the sequence of patches, which is in effect a log. You interact directly with a checked out "snapshot" of the current code which is analogous to the table. You will note that in version control systems, as in other distributed stateful systems, replication happens via the log: when you update, you pull down just the patches and apply them to your current snapshot.
Data about the history of a system is incredibly valuable, and it's so great to see an author focus on this point, because it's often overlooked.
One difference between log processing systems and version control systems, of course, is that log processing systems record only a single timeline, while version control systems allow you to have multiple lines of history. Version control systems provide this with their fundamental branching and merging features, which makes them simultaneously more complex and more powerful than database systems.
Also, of course, it makes them slower; no one is suggesting that you would actually use a version control system to replace your database, but it is very valuable, as Kreps observes, to contemplate the various ways that the history of changes provides you with great insight into the workings of your system and the behavior of your data. That is why many uses of major version control systems (such as the one I build in my day job) use them for many types of objects other than source code: art, music, circuit layouts, clothing patterns, bridge designs, legal codes, the list goes on and on.
I'll return to this point later.
But back to Kreps's article. He suggests that there are three ways to use the basic concept of a log to build larger systems:
- Data Integration—Making all of an organization's data easily available in all its storage and processing systems.
- Real-time data processing—Computing derived data streams.
- Distributed system design—How practical systems can by simplified with a log-centric design.
Data Integration involves the common problem of getting more value out of the data that you manage by arranging to export it out of one system and import it into another, so that the data can be re-used by those other systems in the manner that best suits them. Says Kreps about his own experience at LinkedIn:
New computation was possible on the data that would have been hard to do before. Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems.
As Kreps observes, modeling this process using an explicit log reduces the coupling between those systems, which is crucial in building a reliable and maintainable collection of such systems.
The log also acts as a buffer that makes data production asynchronous from data consumption. This is important for a lot of reasons, but particularly when there are multiple subscribers that may consume at different rates. This means a subscribing system can crash or go down for maintenance and catch up when it comes back: the subscriber consumes at a pace it controls. A batch system such as Hadoop or a data warehouse may consume only hourly or daily, whereas a real-time query system may need to be up-to-the-second. Neither the originating data source nor the log has knowledge of the various data destination systems, so consumer systems can be added and removed with no change in the pipeline.
This is just the producer-consumer pattern writ large, yet too often I see this basic idea missed in practice, so it's great to see Kreps reinforce it.
Often, proposals like this bog down in the "it'll be too slow" paranoia, so it's also great to see Kreps address that concern:
A log, like a filesystem, is easy to optimize for linear read and write patterns. The log can group small reads and writes together into larger, high-throughput operations. Kafka pursues this optimization aggressively. Batching occurs from client to server when sending data, in writes to disk, in replication between servers, in data transfer to consumers, and in acknowledging committed data.
I can wholeheartedly support this view. The sequential write and sequential read behavior of logs is indeed astoundingly efficient, and a moderate amount of effort in implementing your log will provide you with a logging facility that can support a tremendous throughput. I've seen production systems handle terabytes of daily logging activity with ease.
After discussing the traditional uses of logs, for system reliability and and for system integration, Kreps moves into some less common areas with his discussion of "Real-time Stream Processing".
The key concept that Kreps focuses on here involves the structuring of the consumers of log data:
The real driver for the processing model is the method of data collection. Data which is collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.
This is actually not such a new idea as it might seem. When database systems were first being invented, in the 1970's, they were initially built with entirely separate update mechanisms: you had your "master file," which was the most recent complete and consistent set of data, against which queries were run by your users during the day, and your "transaction file," or "batch file," to which changes to that data were appended during the day. Each night, the updates accumulated in the transaction file were applied to the master file by the nightly batch processing jobs (written in Job Control Language, natch), so that the master file was available with the updated data when you opened for business the next morning.
During the 1980's, there was a worldwide transition to more powerful systems, which could process the updates in "real time," supporting both queries and updates simultaneously, so that the results of your change were visible to all as soon as you committed the change to the system.
This was a huge breakthrough, and so, at the time, you would frequently see references to "online" DBMS systems, to distinguish these newer systems from the older master file systems, but that terminology has been essentially jettisoned in the last 15 years, as all current Relational Database Management Systems are "online" in this sense.
Nowadays, those bad old days of stale data are mostly forgotten, as almost all database systems support a single view of accurate data, but you still hear echoes of that history in the natterings-on about NoSQL databases, and "eventual consistency", and the like.
Because, of course, building distributed systems is hard, and requires tradeoffs. But since there's no reason why any application designer would prefer to run queries that get the wrong answer, other than "it's too expensive for us right now to build a system that always gives you the right answer," I predict that eventually "eventual consistency" will go the way of "master file" in computing, and somebody 30 years from now will look at these times and wistfully remember what it was like to be a young programmer when we old greybeards were struggling with such issues.
Anyway, back to Kreps and his discussion of "Real-time Stream Processing". After describing the need for consuming data in a continuous fashion, and relating tales of woe involving a procession of commercial vendors who promised to solve his problems but actually only provided weaker systems, Kreps notes that LinkedIn now use their logging infrastructure to keep all their various systems up to date in, essentially, real time.
The only natural way to process a bulk dump is with a batch process. But as these processes are replaced with continuous feeds, one naturally starts to move towards continuous processing to smooth out the processing resources needed and reduce latency.
LinkedIn, for example, has almost no batch data collection at all. The majority of our data is either activity data or database changes, both of which occur continuously.
It's interesting to see that LinkedIn have open-sourced this part of their system: Apache Kafka is publish-subscribe messaging rethought as a distributed commit log., and Apache Samza is a distributed stream processing framework.
They've also done a great job in providing lots of additional documentation and design information. For example, one of the basic complications about maintaining transaction logs is that you have to deal with the "garbage collection problem", so it's great to see that the Kafka team have documented their approach.
Kreps closes his article with some speculation about systems of the future, noting that the trends are clear:
if you squint a bit, you can see the whole of your organization's systems and data flows as a single distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. You can view the stream processing systems like Storm or Samza as just a very well-developed trigger and view materialization mechanism. Classical database people, I have noticed, like this view very much because it finally explains to them what on earth people are doing with all these different data systems—they are just different index types!
Yes, exactly so! This observation that many different data types can be handled by a single database system is indeed one of the key insights of the DBMS technology, while the related observation that many different kinds of data (finance data, order processing data, multimedia data, design documents) can benefit from change management, configuration tracking, and historical auditing is one of the key insights from the realm of version control systems, and the idea that both abstractions find their basis in the underlying system transaction log is one of the key insights of Kreps's paper.
Kreps suggests that this perspective offers the system architect a powerful tool for organizing their massive distributed systems:
Here is how this works. The system is divided into two logical pieces: the log and the serving layer. The log captures the state changes in sequential order. The serving nodes store whatever index is required to serve queries (for example a key-value store might have something like a btree or sstable, a search system would have an inverted index). Writes may either go directly to the log, though they may be proxied by the serving layer. Writing to the log yields a logical timestamp (say the index in the log). If the system is partitioned, and I assume it is, then the log and the serving nodes will have the same number of partitions, though they may have very different numbers of machines.
The serving nodes subscribe to the log and apply writes as quickly as possible to its local index in the order the log has stored them.
The client can get read-your-write semantics from any node by providing the timestamp of a write as part of its query—a serving node receiving such a query will compare the desired timestamp to its own index point and if necessary delay the request until it has indexed up to at least that time to avoid serving stale data.
The serving nodes may or may not need to have any notion of "mastership" or "leader election". For many simple use cases, the serving nodes can be completely without leaders, since the log is the source of truth.
You'll have to spend a lot of time thinking about this, as this short four-paragraph description covers an immense amount of ground. But I think it's an observation with a great amount of validity, and Kreps backs it up with the weight of his years of experience building such systems.
I would suggest that Kreps's system architecture can be further distilled, to this basic precept:
- Focus on the data, not on the logic. The logic will emerge when you understand the data.
That, in the end, is what 50 years of building database systems, version control systems, object-oriented programming languages, and the like, teaches us.
Finally, Kreps closes his article with a really super set of "further reading" references. I wish more authors would do this, and I really thank Kreps for taking the time to put together these links, since there were several here I hadn't seen before.
Although I began my career in database systems, where I worked with traditional recovery logs in great detail, designing and implementing several such systems, I've spent the last third of my career in distributed systems, where I spend most of my time working with replicated services, kept consistent using asynchronous log shipping.
This means I've spent more than three decades working with, implementing, and using logs.
So I guess that means I was the perfect target audience for Kreps's article; Kreps and I could be a pair of those "separated at birth" twins, since it sounds like he spends all his time thinking about, designing, and building the same set of distributed system infrastructure facilities that I do.
But it also means that it's an article with real practical merit.
People devote their careers to this.
Entire organizations are formed to implement software like this.
If you're a software engineer, and you hear developers in your organization talking about logs, replication, log-shipping, or similar topics, and you want to learn more, you should spend some serious time with Kreps's meaty article: read and re-read it; follow all the references; dissect the diagrams; discuss it with your colleagues.
It's worth it; it's an epic software engineering article.