Pages

Friday, February 28, 2014

Names and values

One of the greatest educators in the Computer Science field, Niklaus Wirth, has reached his eightieth birthday, which was reason for a recent celebration.

Among many other things, Professor Wirth is the subject of a very old, very dumb, Computer Science joke; when asked how to pronounce his name, he was reported to have replied:

If you call me by name, it is VEERT.

If you call me by value, it is WORTH.

The joke of course refers to one of the basic techniques used in programming languages, which is exhaustively explained by this tedious Wikipedia page.

I rather prefer the joke.

Anyway, I was thinking recently about a completely different aspect of names versus values, which arises in the work I do at my day job.

When Version Control Systems keep track of objects, they pay particular attention to two aspects of the object:

  1. The NAME of the object, e.g.
    derby/trunk/java/engine/org/apache/derby/iapi/util/Operator.java
  2. The CONTENT of the object, e.g.
    package org.apache.derby.iapi.util;
    
    /**
            Provides an interface for an operator that operates on a range of objects
            E.g in a cache.
    */
    public interface Operator {
    
            /**
                    Operate on an input object
            */
            public void operate(Object other);
    }
    

The name of the object is an obvious way to give you a handle to the object, and many of the operations that the Version Control System provides involve using the name:

  • What version of Operator.java is the most recent one?
  • What version of Operator.java was used in release 10.8?
  • Who has Operator.java checked out?
  • Which branches are there of Operator.java

But a really interesting aspect of Version Control Systems is that they are tools for helping you track and manage change.

And one of the things that can change about an object is its name.

That is, you can rename a file.

A lot of the original Version Control Systems, back in the 1970's and 1980's, didn't handle renaming a file very well. If you changed the name of a file, often the best you could do was to delete the file with the old name, and add the file back with the new name, leaving you with two different objects, and a gaping discontinuity in the history of the object.

And that's wrong, because there aren't two different objects; there is one object, and what you did was to change its name.

A high quality Version Control System takes care of these problems for you, and systems like git and Perforce have rich functionality in this area and have no problem understanding when a file is renamed.

But it's interesting that they do this in quite different ways.

Perforce maintains a databases of names. Each file name is a separate entry in its database, and it records things such as when two names refer to the same file (that is, when you have made a branch of a file), or when a file's name was changed (that is, when you have renamed a file). At the very lowest levels of the database, in fact, these things are the same, for branching a file and renaming a file are the same action as regards the NAME of the file; the difference is that when you branch a file, you continue creating new versions of both files (the file with the old name, and the file with the new name), whereas when you rename a file, that is the last version of the file with the old name, and all future versions of that file are versions of the file with the new name. And, of course, sometimes you rename a file and then you rename it back to its old name. Perforce can track any of those cases in its database equally well.

The point is that, in Perforce, the "primary key" for the object is its name. The name of the object is the primary way that Perforce refers to the objects, and it's the way it organizes the objects internally in its repository. (Try poking around in the server's filesystem for a while and you'll see what I mean.)

git takes a different approach, preferring to organize its database most naturally by creating a database of CONTENT. Quoting from Pro Git:

Git is a content-addressable filesystem. Great. What does that mean? It means that at the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time.

...

You can see a file in the objects directory. This is how Git stores the content initially — as a single file per piece of content, named with the SHA-1 checksum of the content and its header.

...

The next type you’ll look at is the tree object, which solves the problem of storing the filename and also allows you to store a group of files together. Git stores content in a manner similar to a UNIX filesystem, but a bit simplified. All the content is stored as tree and blob objects, with trees corresponding to UNIX directory entries and blobs corresponding more or less to inodes or file contents. A single tree object contains one or more tree entries, each of which contains a SHA-1 pointer to a blob or subtree with its associated mode, type, and filename.

There's a fundamental duality here, which is at the heart of how both systems work. Objects have names, and objects have values, and sometimes you care about the name of the file, and sometimes you care about the content of the file.

Perforce tends to keep track of files by their name, which means that usually you start with the name of a file and then access its content, and occasionally Perforce has to go hunting through its database to find the other names of the file, such as when you are moving work from one branch to another and the file has been renamed in the one branch but not the other.

Git, on the other hand, tends to keep track of files by their content, which means that usually you start by knowing the SHA-1 of a file, which is how you access it, and if you want to know the name of the file Git will look that up for you. And if you want to rename a file, you don't have to tell Git that; it will just figure it out because it notices that the SHA-1 signature of the new object matches the SHA-1 signature of an object it already knows about.

This is part of the answer to some fundamental complaints about both systems. Since Perforce spends time maintaining a sophisticated database of file names, when you create a branch, that causes Perforce to create many new file names in its database (one new file name for each file in the branch, together with records which relate the old name to the new one), and thus in Perforce, "branches are expensive". For small branches, Perforce does this so fast that you never notice it, but for large branches you can detect the pause. (And if you add many thousands of branches with many millions of files in them, you may have to buy a bigger server.)

Meanwhile, since Git generally doesn't care about the name of the file, more about its content, creating a branch is very cheap, because Git just has a single operation to perform, no matter how many files are in the branch, so in Git, "branches are cheap."

But on the other hand, this mean that Git has to frequently figure out the SHA-1 of objects, so it spends a lot of time reading files and computing their SHA-1 signatures. For small files, Git does this so fast that you never notice it, but for large files the scanning and cryptography is significant, so in Git, "tracking large files is expensive."

Whereas in Perforce, since it only looks at the content once, when you submit the file, and then just tracks the file by its name, Perforce handles files of any size easily, and tracks gigantic files containing hundreds of gigabytes of content just as fast as it tracks tiny little Operator.java.

These sorts of dualities arise all the time in Computer Science, which, like any engineering field, is full of tradeoffs.

If you make one thing cheap to compute and fast to look up, you tend to have to make a tradeoff, such that other things are harder to compute and slower to look up.

You can't say that one decision is unambiguously "better"; you have to think about the tradeoffs, and why they were made, and what they mean about how to use the system effectively.

Or, you can just think about Professor Wirth, and whether you should call him by name, or call him by value.

No comments:

Post a Comment