Pages

Saturday, February 4, 2017

Big news in the world of source control

Source control software is no longer the center of my professional life.

But you don't just relinquish something that occupied a decade of your mental attention overnight.

So it was that I spent much of the last week obsessed by two significant developments in the SCM industry.

Firstly, there was the widely-reported system outage at major Git hosting provider GitLab.

GitLab has been one of the biggest success stories of the past few years. The company is not even four years old, but already has several hundred full-time employees, thousands of customers, and millions of users. Their growth rate has been astonishing, and they have received quite a bit of attention for their open (and unusual) business organization.

(I know a number of GitLab employees; they're wonderful people.)

Anyway, as befits rather an unusual company, they had a rather unusual outage.

Well, the outage itself was not that interesting.

For reasons that I think are not yet well-understood, the GitLab site came under attack by miscreants:

At 2017/01/31 6pm UTC, we detected that spammers were hammering the database by creating snippets, making it unstable. We then started troubleshooting to understand what the problem was and how to fight it.

Then, while trying to defend against and recover from the attack, a harried and over-tired systems administrator made a very simple fumble-fingered mistake, typing a command interactively at the keyboard which deleted their primary production database (instead of deleting what he thought was the damaged spare standby database):

YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

Ah, yes: an interactive "rm -rf pgdata". Been there, done that.

What was interesting about the outage was the way that this (quite unusual) company responded to the problem in a (quite unusual) way.

Almost as soon as things started to occur, they made the decision to create a public Google Docs document, and they live-streamed the event on their Twitter account, and shared their attempts to control and recover from the event, in real-time, inviting the community and the public at large to contribute and assist and understand what was going wrong and what they were doing to recover. You can read the document, it's fascinating.

Moreover, the incident resulted in a number of thoughtful and considered essays from various people. Because the GitLab incident specifically involved their Postgres database, some of the best analyses came from the Postgres community, such as this one: PG Phriday: Getting Back Up

Unfortunately, scenarios beyond this point is where process breaks down. What happens if we have a replica that falls behind and needs to be rebuilt? For all of its benefits, pg_basebackup still cannot (currently) skip unchanged files, or make small patches where necessary. Relying on it in this case would require erasing the replica and starting from scratch. This is where GitLab really ran into trouble.

Yet we started with synchronized files, didn’t we? Could we use rsync to “catch up”? Yes, but it’s a somewhat convoluted procedure. We would first need to connect to the upstream server and issue a SELECT pg_start_backup('my_backup') command so Postgres knows to archive transaction logs produced during the sync. Then after the sync is completed, we would need to stop the backup with SELECT pg_stop_backup(). Then we would have to make our own recovery.conf file, obtain all of the WAL files the upstream server archived, and so on.

None of that is something a system administrator will know, and it’s fiddly even to an experienced Postgres DBA. A mistake during any of that procedure will result in a non-functional or otherwise unsafe replica. All of that is the exact reason software like Barman exists. Supplied utilities only get us so far. For larger or more critical installations, either our custom scripts must flawlessly account for every failure scenario and automate everything, or we defer to someone who already did all of that work.

In my new day job, I'm learning about the extraordinary complexities of the operational aspects of cloud computing. For a relatively-detailed exploration of the topic, there is nothing better than Google's Site Reliability Engineering book, which I wrote about a few months ago. But the topic is deep, and you can spend your entire career working in this area.

Meanwhile, in another corner of the git universe, the annual Git Merge conference is underway, and there is a lot of news there as well, including major improvements to GitLFS, and a detailed report from Facebook (who continue to use their custom version of Mercurial in preference to git).

But the big announcement, truly the centerpiece of the entire conference, came from, of all places, Microsoft:

Over the last year, we have continued to invest in Git and have lots of exciting information to share about the work we’ve done to embrace Git across the company, for teams of any size. During this talk, we plan to discuss, in depth, how we are using git internally with a specific focus on large repositories. We’ll discuss the architecture of VSTS’s git server which is built on Azure and the customizations we’ve had to make to both it and git.exe in order to enable git to scale further and further. These customizations will cover changes that we’ve contributed back to the git core open source project as well as changes that we haven’t talked about externally yet. We’ll also lay out a roadmap for the next steps that we plan to take to deal with repositories that are significantly larger than git can scale to today.

Well, that was rather a tease. So what was that "exciting information" that Microsoft promised?

Here it is: Scaling Git (and some back story).

As Brian Harry observes, to tell this story properly, you have to back up in time a bit:

We had an internal source control system called Source Depot that virtually everyone used in the early 2000’s.
(I think it's no secret; heck, it's even posted on Wikipedia, so: Source Depot is a heavily-customized version of Perforce.)

And, as Harry notes, the Microsoft Source Depot instances are the single biggest source code repositories on the planet, significantly bigger than well-known repositories such as Google's:

There aren’t many companies with code bases the size of some of ours. Windows and Office, in particular (but there are others), are massive. Thousands of engineers, millions of files, thousands of build machines constantly building it, quite honestly, it’s mind boggling. To be clear, when I refer to Window in this post, I’m actually painting a very broad brush – it’s Windows for PC, Mobile, Server, HoloLens, Xbox, IOT, and more.

I happen to have been up-code and personal with this code base, and yes: it's absolutely gigantic, and it's certainly the most critical possession that Microsoft owns.

So making the decision to change the tool that they used for this was no easy task:

TFVC and Source Depot had both been carefully optimized for huge code bases and teams. Git had *never* been applied to a problem like this (or probably even within an order of magnitude of this) and many asserted it would *never* work.

The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.

In SCM circles, this is known as "the monorepo problem."

The monorepo problem is the biggest reason why most truly large software engineering organizations have struggled to move to git. Some organizations, such as Google's Android team, have built massive software toolchains around git, but even so the results are still unsatisfactory (and astonishingly expensive in both human and computer resources).

Microsoft, of course, were fully aware of this situation, so what did they do? Well, let's switch over to Saeed Noursalehi: Announcing GVFS (Git Virtual File System)

Today, we’re introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened. GVFS also actively manages how much of the repo Git has to consider in operations like checkout and status, since any file that has not been hydrated can be safely ignored. And because we do this all at the file system level, your IDEs and build tools don’t need to change at all!

In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.

With GVFS, this means that they now have a Git experience that is much more manageable: clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes.

As Harry points out, this is very very complex engineering, and required solving some very tricky problems:

The file system driver basically virtualizes 2 things:
  1. The .git folder – This is where all your pack files, history, etc. are stored. It’s the “whole thing” by default. We virtualized this to pull down only the files we needed when we needed them.
  2. The “working directory” – the place you go to actually edit your source, build it, etc. GVFS monitors the working directory and automatically “checks out” any file that you touch making it feel like all the files are there but not paying the cost unless you actually access them.

As we progressed, as you’d imagine, we learned a lot. Among them, we learned the Git server has to be smart. It has to pack the Git files in an optimal fashion so that it doesn’t have to send more to the client than absolutely necessary – think of it as optimizing locality of reference. So we made lots of enhancements to the Team Services/TFS Git server. We also discovered that Git has lots of scenarios where it touches stuff it really doesn’t need to. This never really mattered before because it was all local and used for modestly sized repos so it was fast – but when touching it means downloading it from the server or scanning 6,000,000 files, uh oh. So we’ve been investing heavily in is performance optimizations to Git. Many of them also benefit “normal” repos to some degree but they are critical for mega repos.

But even more remarkably, Microsoft is GIVING THIS AWAY TO THE WORLD:

While GVFS is still in progress, we’re excited to announce that we are open sourcing the client code at https://github.com/Microsoft/gvfs. Feel free to give it a try, but please be aware that it still relies on a pre-release file system driver. The driver binaries are also available for preview as a NuGet package, and your best bet is to play with GVFS in a VM and not in any production environment.

In addition to the GVFS sources, we’ve also made some changes to Git to allow it to work well on a GVFS-backed repo, and those sources are available at https://github.com/Microsoft/git. And lastly, GVFS relies on a protocol extension that any service can implement; the protocol is available at https://github.com/Microsoft/gvfs/blob/master/Protocol.md.

Remember, this is Microsoft.

And, this is git.

But, together, it just all changed:

So, fast forward to today. It works! We have all the code from 40+ Windows Source Depot servers in a single Git repo hosted on VS Team Services – and it’s very usable. You can enlist in a few minutes and do all your normal Git operations in seconds. And, for all intents and purposes, it’s transparent. It’s just Git. Your devs keep working the way they work, using the tools they use. Your builds just work. Etc. It’s pretty frick’n amazing. Magic!

As a side effect, this approach also has some very nice characteristics for large binary files. It doesn’t extend Git with a new mechanism like LFS does, no turds, etc. It allows you to treat large binary files like any other file but it only downloads the blobs you actually ever touch.

To say that this is a sea change, a complete reversal of everything you might possibly have expected, is certainly understating the case.

Let's review:

  • Microsoft showed up at one of the largest open-source-free-software-community professional conferences
  • To talk about their work using the open source community's dearest-to-the-heart tool (git)
  • And not only did Microsoft not disparage the tool, they actively celebrated it
  • and added a massive, massive new feature to it
  • and revealed that, as of now and with that feature, they're actually using git themselves, for ALL of their own tens of thousands of users, on the LARGEST source base on the planet, in a SINGLE monolithic git repo
  • And gave that tool away, back to the open source community, on GitHub!

So, anyway, this is just a little corner of the industry (even if it did consume an entire third of my professional career, and spawn the largest IPO of the last 2 years).

But, for those of you who care, take notice: the entire SCM industry just changed today.

No comments:

Post a Comment