Journal of a Programmer: 2011

Saturday, December 31, 2011

The Disneyland art of Claude Coats

I'm just old enough to remember growing up in Orange County (well, Whittier, but it was the same thing), in the early 1970's, just as Disneyland was building out and completing the most important of the signature rides: Pirates of the Caribbean, Submarine Voyage, and of course the Haunted Mansion.

So I loved this fabulous essay about the artist Claude Coats, who started his career in the movie-making side of Walt Disney Studios (Pinocchio, Fantasia), then made the transition to artist and set designer at Disneyland: Long-Forgotten: Claude Coats: The Art of Deception and the Deception of Art.

The essayist makes the point that the skill of the background painter is to create the world that the other artists will fill with music, animation, and story:

You the viewer are invited to imagine yourself on the other side of the frame (the opposite dynamic of Davis). You see yourself as the character in the landscape. Though never intended for public display, those sketches are among the most beautiful and seductive examples of Mansion art. Who wouldn't want go exploring in this?

I love the description of the Disneyland ride as a voyage through a painting:

As you look at some of those Coats backgrounds up above, like Gepetto's cottage and the Sorcerer's Apprentice interiors, you almost wish you could step into them and look around, so inviting are they. With Rainbow Caverns, Coats finally enabled you do just that: ride right through one of his moody, atmospheric paintings.

And of course, this is the basis of the oft-remarked "suspension of disbelief":

The whole drawing depicts a dissolve between there and here, inside and outside, human artifice and wild nature. This is not an exit point for characters stepping over into our presence; this is a place that invites you to enter.

The essay is filled with gorgeous sketches and paintings, so go have a look!

Wednesday, December 28, 2011

Prince of Persia: still alive after 25 years!

I've been having fun looking through the notes put together by "mrsid", a programmer who took up the challenge of re-implementing the classic Apple II game "Prince of Persia", by reverse-engineering a running copy of the game, while simultaneously reading Jordan Mechner's original diary and design notes:

In the meantime I found Jordan Mechner's blog. He had the courage and insight to post all of his old journals from the 1980s. He meticulously kept a log of his daily work. What a great read that was. Just a few days before I started looking for Prince of Persia information Jordan also posted this article on his blog. It contained a link to a PDF, which turned out to be the Prince of Persia source code documentation.
I was amazed. The source was lost on Apple II disks, but the document written just a few days after the release in 1989 was there, with all kinds of juicy little details about the graphics engine, the data structures, lists of images, and more. It was like someone had handed me the key to a long lost treasure.

Mrsid presents the notes as follows:

Hopefully there will be more essays posted in the future, as the complete notes are not yet available.

In the meantime, though, it's great fun to read Jordan's original notes, as well as mrsid's reverse engineering analysis, and it's also quite cool to see the discussion back-and-forth between the two of them in the comments on the blog.

I still vividly remember my eldest daughter playing Prince of Persia in the early 1990's on our fresh new Mac IIsi -- my how she loved that game, and how we loved that computer!

Monday, December 26, 2011

It's not just a game ...

... it's a new way to celebrate the holidays.

Saturday, December 24, 2011

Sometimes I think I understand computers ... sometimes not

I spent 4 hours trying to set things up so that my Windows 7 laptop could print to a USB-attached printer on my Ubuntu Linux desktop.

Most of it went pretty easily: ensure that CUPS and Samba were installed and configured on the Linux machine, and verify that the Samba configuration allowed printer sharing.

But then, no amount of fiddling with the Add Printer wizard on the Windows 7 machine was finding success.

Finally, this weird sequence worked:

Choose Add Printer
Choose Add a local printer. Ignore all the warnings about how you should only do this if you have a locally-attached, non-USB cabled printer. :)
Choose Create a new port.
Choose Local Port. Click Next.
When prompted to enter a port name, type in \\computername\printername

It is so weird that in order to print to a printer on another machine, you have to (a) tell Windows to define a locally-attached printer, and then stuff a remote machine network address into the 'local port' field.

But hey, it worked...

Great investigation of a Google synonym query

This in-depth exploration of an unexpected Google query result is fascinating.

But that’s the thing, what seems easy and straightforward to us is actually quite difficult for a machine.

Indeed.

DVCS and change authenticity

In the world of version control, distributed version control systems such as Git and Mercurial are all the rage.

These systems are indeed extremely powerful, but they all suffer from a fundamental issue, which is how the various nodes in the distributed system can establish the necessary trust to verify authenticity of push and pull requests.

(Disclosure: at my day job, we make a version control system, which has a centralized architecture and a wholly different trust and authentication mechanism. So I'm more than just an interested observer here.)

Now, this issue has been known and discussed for quite some time, but it has acquired greater urgency this fall after a fairly significant compromise of the main Linux kernel systems. As Jonathan Corbet notes in that article

We are past the time where kernel developers are all able to identify each other. Locking down kernel.org to the inner core of the development community would not be a good thing; the site is there for the whole community. That means there needs to be a way to deal with mundane issues like lost credentials without actually knowing the people involved.

The emerging proposal to deal with this problem includes several new features in Git:

Signed commits
and Pulling Signed Tags,
both of which are now operational in the development mainline of the Git trunk.

I suspect that this problem is a deep and hard and fundamental one. It seems to me that the DVCS infrastructure is building a fairly complex mechanism: here's how Linus will use this technology to ensure the integrity of the Linux kernel, as described by Junio Hamano (the lead Git developer):

To make the whole merge fabric more trustworthy, the integration made by his lieutenants by pulling from their sub-lieutenants need to be made verifyable the same way, which would (1) make the number of signed tags even larger and (2) make it more likely somebody in the foodchain gets lazy and refuses to push out the signed tags after he or she used them for their own verification.

But reading this description, I'm instantly reminded of a very relevant observation made by Moxie Marlinspike in the context of the near-complete-collapse of the SSL Certificate Authority chain of trust this spring:

Unfortunately the DNSSEC trust relationships depend on sketchy organizations and governments, just like the current CA system.
Worse, far from providing increased trust agility, DNSSEC-based systems actually provide reduced trust agility. As unrealistic as it might be, I or a browser vendor do at least have the option of removing VeriSign from the trusted CA database, even if it would break authenticity with some large percentage of sites. With DNSSEC, there is no action that I or a browser vendor could take which would change the fact that VeriSign controls the .com TLD.
If we sign up to trust these people, we're expecting them to willfully behave forever, without any incentives at all to keep them from misbehaving. The closer you look at this process, the more reminiscent it becomes. Sites create certificates, those certificates are signed by some marginal third party, and then clients have to accept those signatures without ever having the option to choose or revise who we trust. Sound familiar?

I'm not saying I have the answer; indeed, the very smartest programmers on the planet are struggling intensely with this problem. It's a very hard problem. As the researchers at the EFF recently noted:

As currently implemented, the Web's security protocols may be good enough to protect against attackers with limited time and motivation, but they are inadequate for a world in which geopolitical and business contests are increasingly being played out through attacks against the security of computer systems.

Returning to the world of DVCS systems, for a moment, I've just felt, all along, that the fundamental weakness of DVCS systems was going to turn out to be their weak authenticity guarantees; indeed, this is the core reason that organizations like Apache have been very reluctant to open their infrastructure up to DVCS-style source control, even given all its other advantages.

And it seems like the people who are trying to repair the Certificate Authority technology are also skeptical that a 100% distributed solution can be effective; as Adam Langley says:

We are also sacrificing decentralisation to make things easy on the server. As I've previously argued, decentralisation isn't all it's cracked up to be in most cases because 99.99% of people will never change any default settings, so we haven't given up much. Our design does imply a central set of trusted logs which is universally agreed. This saves the server from possibly having to fetch additional audit proofs at runtime, something which requires server code changes and possible network changes.

And the EFF's Sovereign Keys proposal has a similar semi-centralization aspect:

Master copies of the append-only data structure are kept on machines called "timeline servers". There is a small number, around 10-20, of these. The level of trust that must be placed in them is very low, because the Sovereign Key protocol is able to cryptographically verify the important functions they perform. Sovereign Keys are preserved so long as at least one server has remained good. For scalability, verification, and privacy purposes, lots of copies of the entire append-only timeline structure are stored on machines called "mirrors".

With the new Git technology, as I understand it, the user who accepts a pull request from a remote repository now faces a new challenge:

The integrator will see the following in the editor when recording such a merge:

The one-liner merge title (e.g 'Merge tag rusty-for-linus of git://.../rusty.git/');

The message in the tag object (either annotated or signed). This is where the contributor tells the integrator what the purpose of the work contained in the history is, and helps the integrator describe the merge better;

The output of GPG verification of the signed tag object being merged. This is primarily to help the integrator validate the tag before he or she concludes the pull by making a commit, and is prefixed by '#', so that it will be stripped away when the message is actually recorded; and

The usual "merge summary log", if 'merge.log' is enabled.

This will be a challenging task to require of all developers in this chain of trust. Is it feasible? One thing for sure, the Git team are to be commended for facing this problem head on, for openly discussing it, and for trying to push the problem forward. It is exciting to watch them struggle with the issues, and I've learned an immense amount from reading their discussions.

So I think it will be very interesting to see how the Git team fares with this problem, as they, too, have some wonderfully talented people at work on the problems.

Friday, December 23, 2011

Holiday weekend link dump

As always, apologies in advance for the link dump. It's just been so busy recently, that I haven't found the time to explore things in detail.

Still, if you're looking for a few holiday-weekend things to read, try these:

If you haven't been paying much attention to the Carrier IQ controversy, Rich Kulawiec over at TechDirt has a great summary of what's been going on, with an amazing number of links to chase and study. As Kulawiec puts it:
Debate continues about whether Carrier's IQ is a rootkit and/or spyware. Some have observed that if it's a rootkit, it's a rather poorly-concealed one. But it's been made unkillable, and it harvests keystrokes -- two properties most often associated with malicious software. And there's no question that Carrier IQ really did attempt to suppress Eckhart's publication of his findings.
But even if we grant, for the purpose of argument, that it's not a rootkit and not spyware, it still has an impact on the aggregate system security of the phone: it provides a good deal of pre-existing functionality that any attacker can leverage. In other words, intruding malware doesn't need to implement the vast array of functions that Carrier IQ already has; it just has to activate and tap into them.
Many of us may be taking some holiday break, but the busy beavers at CalTrans are embarking on the final major step of the Bay Bride reconstruction: threading the suspension cable over the top of the main bridge tower.
The cable is 2.6 feet in diameter and nearly a mile long. It weighs 5,291 tons, or nearly 10.6 million pounds, and is made up of 137 steel strands, each one composed of 127 steel wires.
The strands will go up to the top of the center tower and down to the San Francisco side of the span, where they will be looped underneath the deck of the bridge, then threaded back up to the tower and back down to the Oakland side of the bridge. There, crews will anchor the other end of the strands.

I found it interesting to read about how they finish the cable installation:

Ney said it will take a few months to complete the installation. Once all the strands are installed, crews will bind them together and coat them with zinc paste.
I'm familiar with the notion of sacrificial zinc anodes in sailboats, where the zinc is used to avoid destruction of a more valuable metal part (such as your stainless steel propellor). Is the zinc paste on the cable used for the same purpose?
The bubble is back! Everywhere you look, there is article after article after article after article about the desparate competition for software engineers that's underway right now.
Of course, at least part of the problem is that it's still the case that, all too frequently, people who think they can program, actually can't. Contrary to many people, I'm not in a hurry to blame our education system for this. I think programming is very hard, and it doesn't surprise me that there's a high failure rate. Some would say that anybody can learn to program, but I think there's a real underlying talent at issue, and just like I would be a lousy lawyer or a lousy surgeon or a lousy soprano, some people will have more aptitude for writing software than others.
Anyway, I can attest to a fair amount of the insanity, though happily I'm pretty well insulated from it. But we have clearly entered into a very exciting new time in software, with a variety of hot technologies, such as cloud computing and mobile applications, providing the fuel. Though I think Marc Andreessen may be a bit too giddy about the prospects for it all, I have to agree with his assessment that:
Six decades into the computer revolution, four decades since the invention of the microprocessor, and two decades into the rise of the modern Internet, all of the technology required to transform industries through software finally works and can be widely delivered at global scale.

A large part of that transformational technology is, once again, being driven by Amazon. As they continue releasing new features at breakneck speed, Wired magazine takes a step back and wonders: what does it mean that Amazon Builds World’s Fastest Nonexistent Supercomputer. I remember my first introduction to system virtualization, when I got to use IBM's VM software back in the early 1980's; it definitely takes a while to get your head around what's really going on here!
Moving on to something entirely unrelated to software, in the world of professional football, the dispute between Uruguayan superstar Luis Suarez and French superstar Patrice Evra has been gathering a lot of attention. My friend Andrew has reprinted a well-phrased essay on the topic, which is well worth reading.
Professional sports is of course mostly entertainment, yet somehow it is more than that; it is undeniably one of the largest parts of modern life. On that note, a recent issue of The New Yorker carries a fine review of the life of Howard Cosell, who played a major part in the development of professional sports into one of America's major passions.
On Monday nights, Cosell called out players for their mistakes with orotund rhetoric and moral high dudgeon. Just as music fans used to go to the New York Philharmonic to watch Leonard Berstein's gymnastics more than to hear, yet again, Beethoven's Fifth, people tuned in to hear -- and howl at -- Cosell. Even if you loathed him, his performance was what made Monday nights memorable.

And lest you think that this is just something minor, don't miss this wonderful article in The Economist: Little red card: Why China Fails at Football..
Solving the riddle of why Chinese football is so awful becomes, then, a subversive inquiry. It involves unravelling much of what might be wrong with China and its politics. Every Chinese citizen who cares about football participates in this subversion, each with some theory—blaming the schools, the scarcity of pitches, the state’s emphasis on individual over team sport, its ruthless treatment of athletes, the one-child policy, bribery and the corrosive influence of gambling. Most lead back to the same conclusion: the root cause is the system.
Is it sports, or is it life? It's definitely not "just" entertainment.
OK, getting back to things I understand better, it's been quite the year for Mozilla and Firefox. Ever since they changed their release process and their version numbering back in the spring, it's been a continual stream of Firefox releases, so it's no suprise that Firefox 11 is soon to be available, with yet more features and functionality. But the Firefox team are pushing beyond just making great browsers, branching into areas like web-centric operating systems, Internet identity management, and building entire applications in the browser, as David Ascher explains. Ascher's article notes that within the Mozilla Foundation, people are now thinking significantly "beyond the browser":
we’re now at a distinct point in the evolution of the web, and Mozilla has appropriately looked around, and broadened its reach. In particular, the browser isn’t the only strategic front in the struggle to promote and maintain people’s sovereignty over their online lives. There are now at least three other fronts where Mozilla is making significant investments of time, energy, passion, sweat & tears. They’re still in their infancy, but they’re important to understand if you want to understand Mozilla

Meanwhile, how is Mozilla handling this? Aren't they just a few open source hackers? How can they do all this? Well, as Kara Swisher points out, Mozilla has some pretty substantial financial backing:
Mozilla is set to announce that it has signed a new three-year agreement for Google to be the default search option in its Firefox browser.
It’s a critical renewal for the Silicon Valley software maker, since its earlier deal with the search giant has been a major source of revenue to date.

Meanwhile, what is all this doing to the life of the ordinary web developer? As Christian Heilmann observes, it brings not just excitement, but also stress and discomfort, but underlying this is the fact that the web is no longer just a place for experimentation, but has transitioned into being the production platform of our daily lives:
We thought we are on a good track there. Our jobs were much more defined, we got more respect in the market and were recognised as a profession. Before we started showing a structured approach and measurable successes with web technologies we were just “designers” or “HTML monkeys”.
Are you just feeling flat-out overwhelmed by all this new technology? Well, one wonderful thing is that the web also provides the technology to stay up to date:
MIT President Susan Hockfield said, “MIT has long believed that anyone in the world with the motivation and ability to engage MIT coursework should have the opportunity to attain the best MIT-based educational experience that Internet technology enables. OpenCourseWare’s great success signals high demand for MIT’s course content and propels us to advance beyond making content available. MIT now aspires to develop new approaches to online teaching.”
Now, if I can just find that free time that I misplaced...
Just because software is open source, it can still fade away into the sunset. Which is a shame, because I was really hoping to get a distribution with Issue 46 fixed, because I hit it all the time! Yes, yes, I know, I should just download the source and build it. Or find a new debugger. Or something.
Forgive the breathless style, and read the well-written summary of the Buckshot Yankee incident at the Washington Post: Cyber-intruder sparks massive federal response — and debate over dealing with threats. As author Ellen Nakashima observes, we're still struggling with what we mean when we toss about terms like "cyber war" and the new Cyber Command unit:
“Cyber Command and [Strategic Command] were asking for way too much authority” by seeking permission to take “unilateral action . . . inside the United States,” said Gen. James E. Cartwright Jr., who retired as vice chairman of the Joint Chiefs in August.
Officials also debated how aggressive military commanders can be in defending their computer systems.
“You have the right of self-defense, but you don’t know how far you can carry it and under what circumstances, and in what places,” Cartwright said. “So for a commander who’s out there in a very ambiguous world looking for guidance, if somebody attacks them, are they supposed to run? Can they respond?”
Finally (since something has to go last), Brad Feld has ended the year by winding down the story of Dick and Jane's SayAhh startup in a most surprising fashion (well, to me, at least): SayAhh Has Shut Down.
If you weren't following the SayAhh series, Feld had been writing a series of articles about a hypothetical software startup, using them to illustrate many of the perils and complexities that can arise when trying to build a new company from scratch. I'm not sure if there's a clean index to all the articles he wrote, but you can start here for the first article, and then mostly follow along via his blog. I think it's great that he ended the series in such a realistic fashion, though I'm quite interested to see how his readership feels about that!

I hope your holidays are enjoyable, safe, and filled with family and friends.

Thursday, December 22, 2011

I found a new Derby bug!

It doesn't happen very often that I find a bug in Derby, so it's worth noting: https://issues.apache.org/jira/browse/DERBY-5554.

I'm just enough disconnected from day-to-day Derby development at this point to not immediately understand what the bug is.

Note that the crash is in a generated method:


Caused by: java.lang.NullPointerException
at org.apache.derby.exe.acf81e0010x0134x6972x0511x0000033820000.g0(Unknown Source)

Derby uses generated methods in its query execution for tasks such as projection and restriction of queries.

So, for example, the generated method above is probably implementing part of the "where" clause of the query that causes the crash.

Derby generated methods are constructed at runtime, by dynamically emitting java bytecodes into a classfile format, and then dynamically loading that class into the Java runtime. It's quite clever, but quite tricky to diagnose, because it's hard to see the actual Java code that is being run.

A long time ago, I tracked down some debugging tips for working on crashes in generated code, and collected them here: http://wiki.apache.org/db-derby/DumpClassFile.

It's late, and I'm tired (and suffering from a head cold), but if I get some time over the holiday weekend I'll try to look into this crash some more.

Of course, perhaps somebody like Knut Anders or Rick will have already figured the problem out by then :)

Wednesday, December 21, 2011

Teaching yourself to become a thinker

I finally got around to reading Bill Deresiewicz's fascinating lecture: Solitude and Leadership. It's over two years old now, but it has aged very well, so if you haven't yet seen it, I encourage you to wander over and give it a read.

Although Deresiewicz spends much of the lecture talking about leadership, that wasn't my favorite part of his talk. Rather, I particularly enjoyed his analysis of thinking. He proposes that modern civilization isn't doing enough to develop a culture of thinkers:

What we don’t have, in other words, are thinkers. People who can think for themselves. People who can formulate a new direction: for the country, for a corporation or a college, for the Army—a new way of doing things, a new way of looking at things.

What does Deresiewicz mean by a "thinker"? He gives as an example (he is speaking to a West Point audience in 2009, after all) General David Petraeus:

He has a Ph.D. from Princeton, but what makes him a thinker is not that he has a Ph.D. or that he went to Princeton or even that he taught at West Point. I can assure you from personal experience that there are a lot of highly educated people who don’t know how to think at all.
No, what makes him a thinker—and a leader—is precisely that he is able to think things through for himself. And because he can, he has the confidence, the courage, to argue for his ideas even when they aren’t popular. Even when they don’t please his superiors. Courage: there is physical courage, which you all possess in abundance, and then there is another kind of courage, moral courage, the courage to stand up for what you believe.

Admiring General Petraeus for his mental and moral courage is fair, but even more interesting to me is Deresiewicz's observation on what (I think) is the more important aspect of being a thinker: creativity and originality:

Thinking means concentrating on one thing long enough to develop an idea about it. Not learning other people’s ideas, or memorizing a body of information, however much those may sometimes be useful. Developing your own ideas. In short, thinking for yourself. You simply cannot do that in bursts of 20 seconds at a time, constantly interrupted by Facebook messages or Twitter tweets, or fiddling with your iPod, or watching something on YouTube.
I find for myself that my first thought is never my best thought. My first thought is always someone else’s; it’s always what I’ve already heard about the subject, always the conventional wisdom. It’s only by concentrating, sticking to the question, being patient, letting all the parts of my mind come into play, that I arrive at an original idea. By giving my brain a chance to make associations, draw connections, take me by surprise. And often even that idea doesn’t turn out to be very good. I need time to think about it, too, to make mistakes and recognize them, to make false starts and correct them, to outlast my impulses, to defeat my desire to declare the job done and move on to the next thing.

It's a wonderful observation, and it's so very, very true. Creativity, originality, and inspiration require patience, reflection, and concentration.

It brings to mind a wonderful lesson that my dear friend Neil Goodman taught me over twenty years ago, when I was still just learning to program and I was trying to understand the way that Neil approached a problem.

I was asking him how he knew when he was done with the design phase of his project, and ready to move on to the coding phase. Neil replied with an answer that was very evocative of Deresiewicz's advice to "outlast your impulses". Neil said (as best I remember):

Work on your design. At some point, you will think you are done, and you are ready, but you are not. You have to resist that feeling, and work on your design some more, and you will find more ways to improve it. Ask others to review it; re-read and re-consider it yourself. Again, you will think you are done, and you are ready, but you are not. You must resist, resist, resist! Force yourself to continue iterating on your design, paying attention to every part of it, over and over. Even if you feel that you can't possibly improve it any more, still you must return to it. Only then, will you reach the point when you are ready to write code.

Of course, for the full effect, you have to have Neil himself (in his earnest, impassioned, gentle-giant sort of way) deliver the message, but hopefully the point comes through.

Incidentally (perhaps Deresiewicz didn't get to pick his own title?), I don't think that solitude really captures the idea properly. Or maybe solitude is the right thing in a military context, but in the software engineering field, where I spend all my time, I don't think that solitude is either necessary nor useful for original, creative thinking. You need to get feedback and reactions from others, and you can't do that without communicating, and without listening. But you do need to exercise a number of activities which are certainly related to solitude: contemplation, reflection, consideration, etc. So if I had the chance to re-title his essay, I might suggest that he have a title more like "Leadership and the ability to think for yourself."

But that's quite a bit wordier :)

I've wandered quite a bit far afield, but hopefully I've intrigued you enough with Deresiewicz's essay that you'll wander over and give it a read, and maybe (hopefully) you will find it time well spent.

Monday, December 19, 2011

Two command line shell tutorials

I just wouldn't be me if I didn't notice things like these and want to immediately post them to my blog...

Now get out there and open those terminal sessions!

Friday, December 16, 2011

Did Iran capture the drone by hacking its GPS?

There's a fascinating report in today's Christian Science Monitor speculating that Iran used vulnerabilities in the drone's GPS technology to simply convince the drone to land in Iran:

"GPS signals are weak and can be easily outpunched [overridden] by poorly controlled signals from television towers, devices such as laptops and MP3 players, or even mobile satellite services," Andrew Dempster, a professor from the University of New South Wales School of Surveying and Spatial Information Systems, told a March conference on GPS vulnerability in Australia.
"This is not only a significant hazard for military, industrial, and civilian transport and communication systems, but criminals have worked out how they can jam GPS," he says.
The US military has sought for years to fortify or find alternatives to the GPS system of satellites, which are used for both military and civilian purposes. In 2003, a “Vulnerability Assessment Team” at Los Alamos National Laboratory published research explaining how weak GPS signals were easily overwhelmed with a stronger local signal.
“A more pernicious attack involves feeding the GPS receiver fake GPS signals so that it believes it is located somewhere in space and time that it is not,” reads the Los Alamos report. “In a sophisticated spoofing attack, the adversary would send a false signal reporting the moving target’s true position and then gradually walk the target to a false position.”

Here's the link to the ten-year-old Los Alamos National Laborary report: GPS Spoofing Countermeasures, which in turn has references to a number of other references to read.

Very interesting stuff.

Given that location-based devices have become so prevalent in our lives (smartphones, cars, etc.), it's interesting to contemplate how we might improve the reliability and trustworthiness of the location awareness of our automated assistants. The guys over at SpiderLabs Anterior had some great articles on this recently:

Thursday, December 15, 2011

Bowl Season is here!

College Football Bowl Game Season is upon us, filled with memorable events such as "The Famous Idaho Potato Bowl".

With 35 bowl games, you need to know what to watch, and what to miss.

So get prepared, and head on over to Yahoo! Sports's bowl game preview: Dashing through the bowls … and coaching moves.

You'll find useful advice such as this:

Fight Hunger Bowl (25)
Dec. 31
Illinois vs. UCLA

Who has momentum? This is the bowl where momentum goes to die. Bruins have lost three of their past four but still look great compared to an Illinois team that has lost six in a row.
Who has motivation? Tail-spinning teams playing for interim coaches make this a potential low-intensity debacle. But both have hired new coaches who presumably will be watching to see who wants to make an impression heading into 2012.
Who wins a mascot fight? Joe Bruin by default, since Chief Illiniwek was forcibly retired in 2007 and not replaced.
Dash fact: Illinois hasn’t scored more than 17 points in a game since Oct. 8.
Dash pick: UCLA 14, Illinois 10. It’s New Year’s Eve. Find something better to do with your time.

Intriguingly, the two best bowl games of the year look to involve Bay Area teams:

The Fiesta Bowl, on Jan 2, features Stanford and Oklahoma State
The Holiday Bowl, on Dec 28, features Texas vs. California

Both games should be well-matched, well-played, and quite fun to watch.

The bubble is back!

Wow! One billion dollars is a lot of money.

Will they use some of that to start addressing the widely-reported discord?

South America joins the Amazon cloud

Amazon Web Services is growing at an incredible rate. They just opened their eighth AWS region, in Sao Paolo, Brazil.

Here's a nifty interactive map of the AWS global infrastructure, so you can see the various regions and their information.

The U.S. West (Northern California) region is actually located just a few miles from my house. (I think that the city planners ran out of imagination, as it's at the corner of Investment Boulevard and Corporate Ave, not far from the Industrial Boulevard exit on the freeway.)

It's not much to look at, as a physical building; it's what's inside that counts!

Delightfully illustrated HTTP status codes

The HyperText Transfer Protocol (HTTP) indicates a lot of information in the "status code". Everyone is of course familiar with codes 404 ("Not found"), 200 ("OK"), and 500 ("Internal server error"), but there are dozens more codes, each with their own precise meaning.

If you're having trouble remembering which code is which, or don't understand a particular code and would like a more "vivid" example, head over to Flickr, for this delightful set of HTTP Status Codes illustrated by cats.

This is definitely the geekiest humor of the year!

I particularly enjoy

but you should really view the entire set for the full effect (apologize to your co-workers in advance before laughing out loud!).

Wednesday, December 14, 2011

Raymond Chen skewers the executive re-org email

Delightful!. Raymond Chen perfectly captures the classic executive reorg email.

The bit about auto-summarize at the end is pretty insightful, as well.

Tuesday, December 13, 2011

Looking for something to read?

It's winter time, you're stuck indoors, how about something to read?

Well, you're in luck: both Longform and GiveMeSomethingToRead are just out with their end-of-the-year lists:

I like both lists; there's lots of interesting material here.

There's a fair amount of overlap, naturally, but also each list has a few gems that the other list missed.

FWIW, my favorite out of both lists is My Summer at an Indian Call Center, from Mother Jones Magazine.

Update: Here's another set of lists!

Friday, December 9, 2011

The Popular Mechanics article on Flight 447 is enthralling

If you haven't yet had a chance to go read the Popular Mechanics article about Air France flight 447 then stop what you are doing right now, go sit down for 5 minutes, and read the article.

It is chock full of insight after insight after insight. Here's just one, picked almost at random:

While Bonin's behavior is irrational, it is not inexplicable. Intense psychological stress tends to shut down the part of the brain responsible for innovative, creative thought. Instead, we tend to revert to the familiar and the well-rehearsed. Though pilots are required to practice hand-flying their aircraft during all phases of flight as part of recurrent training, in their daily routine they do most of their hand-flying at low altitude—while taking off, landing, and maneuvering. It's not surprising, then, that amid the frightening disorientation of the thunderstorm, Bonin reverted to flying the plane as if it had been close to the ground, even though this response was totally ill-suited to the situation.

Although, as the article points out, there were multiple technical issues in play (weather, location, time of day, fatigue, etc.), in the end it boils down to human factors: training, experience, and, most of all, communication:

The men are utterly failing to engage in an important process known as crew resource management, or CRM. They are failing, essentially, to cooperate. It is not clear to either one of them who is responsible for what, and who is doing what.

Every sentence in this article is a nugget, with observations about human behavior, suggestions of areas for study and improvement, and, in the end, a realization that people are flawed and make mistakes, and what we need to do most of all is to think, talk, and help each other:

when trouble suddenly springs up and the computer decides that it can no longer cope—on a dark night, perhaps, in turbulence, far from land—the humans might find themselves with a very incomplete notion of what's going on. They'll wonder: What instruments are reliable, and which can't be trusted? What's the most pressing threat? What's going on?

Don't miss this incredible recap of the tragedy of Flight 447.

Thursday, December 8, 2011

Following up

I love follow-up. Follow-up is good: study something, then study the follow-ups, and you will learn more.

So, a few follow-ups:

I've been fascinated by the Air France Flight 447 investigation (background here and here), so if you were equally interested, don't miss this wonderful article in this month's Popular Mechanics: What Really Happened Aboard Air France 447:
Two years after the Airbus 330 plunged into the Atlantic Ocean, Air France 447's flight-data recorders finally turned up. The revelations from the pilot transcript paint a surprising picture of chaos in the cockpit, and confusion between the pilots that led to the crash.
I've been following Jim Gettys and his studies into the network queueing phenomenon known as "BufferBloat". If you've been paying attention to this, too, you won't want to miss this discussion in the ACM Queue column. Meanwhile, Patrick McManus, who is hard at work on the new SPDY networking protocol, has an essay on the topic in which he notes some recent published research, and worries that there is still more research needed:
A classic HTTP/1.x flow is pretty short - giving it a signal to backoff doesn't save you much - it has sent much of what it needs to send already anyhow. Unless you drop almost all of that flow from your buffers you haven't achieved much. Further, a loss event has a high chance of damaging the flow more seriously than you intended - dropping a SYN or the last packet of the data train is a packet that will have very slow retry timers, and short flows are comprised of high percentages of these kinds of packets.
Understanding TCP's behaviors is certainly complicated; I recently wrote about this at some length on the Perforce blog.

Problems such as the complexity of modern systems such as those controlling airplanes, nuclear reactors, etc., or the unexpected inter-actions of networking equipment across the planet, continue to be extremely hard. Only dedicated study of many years or decades is going to bring progress, so I'm pleased to note such progress when it occurs!

Wednesday, December 7, 2011

The astonishing sophistication of ATM skimmer criminals

On his superb blog, Brian Krebs has just posted the latest entry in his astonishing series of investigative articles reporting on modern ATM card skimming criminals, and the devices they use to capture ATM data from compromised ATM machines: Pro Grade (3D Printer-Made?) ATM Skimmer.

Looking at the backside of the device shows shows the true geek factor of this ATM skimmer. The fraudster who built this appears to have cannibalized parts from a video camera or perhaps a smartphone (possibly to enable the transmission of PIN entry video and stolen card data to the fraudster wirelessly via SMS or Bluetooth).

Everything Brian Krebs writes is worth reading, but I find these ATM articles to be just gripping. Modern life is so complex!

Tuesday, December 6, 2011

Really slow news day

One of the "AP Top Stories" this morning is headlined

Malia Obama, 13, is nearly as tall as her father.

Monday, December 5, 2011

Three practical articles, two practical books

Here's a nifty ad-hoc collection of some nice practical articles and books to keep you grounded and focused on what really matters:

First, a nice review of Robert C. ("Uncle Bob") Martin's The Clean Coder: 9 things I learned from reading The Clean Coder by Robert C. Martin, on how professional developers conduct themselves . I haven't read the book (yet), but the review makes me interested enough to keep this book on the list for when I'm next above water on my technical reading. An excerpt from Christoffer Pettersson's review:
As a professional developer, you should spend time caring for your profession. Just like in any other profession, practice gives performance, skill and experience.
It is your own responsibility to keep training yourself by reading, practicing and learning - actually anything that helps you grow as a software developer and helps you get on board with the constant industry changes.
Second, a nifty essay by Professor John Regehr about the ins and outs of "testcase reduction", one of those rarely-discussed but ultra-important skills that gets far too little recognition. At my day job, I'm lucky to have a colleague who is just astonishingly good at testcase reduction; he just has the knack. An excerpt from Regehr's essay:
testcase reduction is both an art and a science. The science part is about 98% of the problem and we should be able to automate all of it. Creating a test case from scratch that triggers a given compiler bug is, for now at least, not only an art, but an art that has only a handful of practitioners.
Third, a nice article on Ned Batchelder's blog about "Maintenance Hatches", those special visibility hooks that let humans observe the operation of complex software in some high-level and comprehensible fashion, for diagnosis and support purposes. In my world, these hatches are typically trace logs, which cause the software to emit detailed information about its activity, and can be turned on and off, and ratcheted up to higher or lower levels, as needed. I like Batchelder's terminology, which conjures up the image of opening a cover to the machinery to observe it at work:
On a physical machine, you need to be able to get at the inner workings of the thing to observe it, fiddle with it, and so on. The same is true for your software.
Fourth, an interesting online book about the practicalities of building a modern web application: The Twelve-Factor App. I guess I was sold when I saw the first factor was to ensure you are using a source code control system (may I recommend one?). From the introduction to the book:
This document synthesizes all of our experience and observations on a wide variety of software-as-a-service apps in the wild. It is a triangulation on ideal practices app development, paying particular attention to the dynamics of the organic growth of an app over time, the dynamics of collaboration between developers working on the app’s codebase, and avoiding the cost of software erosion.
Last, but far from least, check out this amazing online book about graphics programming: Learning Modern 3D Graphics Programming by Jason McKesson. From the introduction:
This book is intended to teach you how to be a graphics programmer. It is not aimed at any particular graphics field; it is designed to cover most of the basics of 3D rendering. So if you want to be a game developer, a CAD program designer, do some computer visualization, or any number of things, this book can still be an asset for you.

The Internet is a wonderful place; so much stuff to read and explore!

Saturday, December 3, 2011

Three papers on Bloom filters

If you should find yourself noticing that:

You're interested in these mysterious things called Bloom filters, that seem to be popping up over and over,
and you'd like to learn more about how Bloom filters work, but you're not sure where to start.

then you're roughly in the position I was in a little while ago, and maybe this article will be of interest to you.

After reading a dozen or so Bloom-filter-related papers spanning 40 years of interest in the subject, I've whittled down the list to three papers that I can recommend to anybody looking for a (fairly) quick and (somewhat) painless introduction to the world of Bloom filters:

The first paper is Burton Bloom's original description of the technique that he devised back in the late 1960's for building a new data structure for rapidly computing set membership. The paper is remarkably clear, even though in those days we had not yet settled on standard terminology for describing computer algorithms and their behaviors.

The most striking part of Bloom's work is this description of its behavior:

In addition to the two computational factors, reject time and space (i.e. hash area size), this paper considers a third computational factor, allowable fraction of errors. It will be shown that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing the reject time. In some practical applications, this reduction in hash area size may make the difference between maintaining the hash area in core, where it can be processed quickly, or having to put it on a slow access bulk storage device such as a disk.

This discussion of probabilistic, approximate answers to queries must have been truly startling and disturbing in the 1960's, when computer science was still very young: "falsely identified", "fraction of errors"? Horrors! Even nowadays, computer science has a hard time dealing with the notions of randomness and probabilistic computations, but we're getting better at reasoning about these things. Back then, Bloom felt the need to justify the approach by noting that:

In order to gain substantial reductions in hash area size, without introducing excessive reject times, the error-free performance associated with conventional methods is sacrificed. In application areas where error-free performance is a necessity, these new methods are not applicable.

In addition to suggesting the notion of a probabilistic solution to the set membership problem, the other important section of Bloom's original paper is his description of the Bloom filter's core algorithm:

Method 2 completely gets away from the conventional concept of organizing the hash area into cells. The hash area is considered as N individual addressable bits, with addresses 0 through N - 1. It is assumed that all bits in the hash area are first set to 0. Next, each message in the set to be stored is hash coded into a number of distinct bit addresses, say a1, a2, ..., ad. Finally, all d bits addressed by a1 through ad are set to 1.
To test a new message a sequence of d bit addresses, say a1', a2', ... ad', is generated in the same manner as for storing a message. If all d bits are 1, the new message is accepted. If any of these bits is zero, the message is rejected.

Bloom's work was, rather quietly, adopted in various areas, particularly in database systems. I'll return to that subject in a future posting, but for now, let's spin the clock ahead 25 years, to the mid-1990's, when the second paper, by a team working on Web proxy implementations, brought significant popularity and a new audience to the world of Bloom filters.

The Summary Cache paper describes an intriguing problem: suppose you want to implement a set of independent proxy caches, each one physically separate from the others, and you'd like your implementation to arrange to have a cache miss on one proxy be able to quickly decide if the item is available in the cache of another proxy:

ICP discovers cache hits in other proxies by having the proxy multicast a query message to the neighboring caches whenever a cache miss occurs. Suppose that N proxies [are] configured in a cache mesh. The average cache hit ration is H. The average number of requests received by one cache is R. Each cache needs to handle (N - 1) * (1 - H) * R inquiries from neighboring caches. There are a total [of] N * (N - 1) * (1 - H) * R ICP inquiries. Thus, as the number of proxies increases, both the total communication and the total CPU processing overhead increase quadratically.

How do they solve this problem? Well, it turns out that a Bloom filter is just the right tool for this:

We then propose a new cache sharing protocol called "summary cache." Under this protocol, each proxy keeps a compact summary of the cache directory of every other proxy. When a cache miss occurs, a proxy first probes all the summaries to see if the request might be a cache hit in other proxies, and sends a query messages [sic] only to those proxies whose summaries show promising results. The summaries do not need to be accurate at all times. If a request is not a cache hit when the summary indicates so (a false hit), the penalty is a wasted query message. If the request is a cache hit when the summary indicates otherwise (a false miss), the penalty is a higher miss ratio.

Bloom filters continued to spread in usage, and interesting varieties of Bloom filters started to emerge, such as Counting Bloom Filters, Compressed Bloom Filters, and Invertible Bloom Filters. In particular, Professor Michael Mitzenmacher of Harvard has been collecting, studying, and improving upon our understanding of Bloom filters and their usage for several decades.

About 10 years ago, Mitzenmacher collaborated with Andrei Broder, who is now at Yahoo! Research but was then with Digital Equipment Company Research to write the third paper, which is the best paper of the three papers I mention in this article. (Note that Broder was a co-author of the second paper as well.)

The Network Applications of Bloom Filters paper accomplishes two major tasks:

First, it provides a clear, modern, and complete description and analysis of the core Bloom filter algorithm, including most importantly the necessary mathematics to understand the probabilistic behaviors of the algorithm and how to adjust its various parameters.
Second, it provides a wide-ranging survey of a variety of different areas in which Bloom filters can be used, and have successfully been used.

Most importantly, for people considering future use of Bloom filters, the paper notes that:

The theme unifying these diverse applications is that a Bloom filter offers a succinct way to represent a set or a list of items. There are many places in a network where one might like to keep or send a list, but a complete list requires too much space. A Bloom filter offers a representation that can dramatically reduce space, at the cost of introducing false positives. If false positives do not cause significant problems, the Bloom filter may provide improved performance. We call this the Bloom filter principle, and we repeat is for emphasis below.
The Bloom filter principle: Whenever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated..

So there you have it: three papers on Bloom filters. There is a lot more to talk about regarding Bloom filters, and hopefully I'll have the time to say more about these fascinating objects in the future. But this should be plenty to get you started.

If you only have time to read one paper on Bloom filters, read Broder and Mitzenmacher's Network Applications of Bloom Filters. If you have more time, also read the Summary Cache paper, and if you have gone that far I'm sure you'll take the time to dig up Bloom's original paper just for completeness (it's only 5 pages long, and once you've read Broder and Mitzenmacher, the original paper is easy to swallow).

Thursday, December 1, 2011

The Foundations discussions

There continues to be an active discussion over Mikeal Rogers's essay about Apache and git, which I wrote about a few days ago.

Here's (some of) what's been going on:

Simon Phipps wrote a widely read essay in Computerworld UK about the notion of an open source foundation, as separate from the open source infrastructure, and relates the story of a project which suffered greatly because it hadn't established itself with the support of a larger entity:
the global library community embraced Koha and made it grow to significant richness. When the time came, the original developers were delighted to join a US company that was keen to invest in - and profit from - Koha. Everything was good until the point when that company decided that, to maximise their profit, they needed to exert more control over the activities of the community.
A detailed article at the Linux Weekly News website provides much more of the details of this story. Phipps's point is that part of these problems arose because the developers of the project didn't engage in the open discussion of the long term management of the project that would have occurred had they hosted their project at one of the established Open Source foundations such as Apache or the Software Freedom Conservancy.
Stephen O'Grady also wrote an essay on the difference between foundations and infrastructure acknowledging that "foundations who reject decentralized version control systems will fall behind", but further asserting that:
GitHub is a center of gravity with respect to development, but it is by design intensely non-prescriptive and inclusive, and thus home to projects of varying degrees of quality, maturity and seriousness.
[ ... ] GitHub, in other words, disavows responsibility for the projects hosted on the site. Foundations, conversely, explicitly assume it, hence their typically strict IP policies. These exclusive models offer a filter to volume inclusive models such as GitHub’s.
[ ... ] If you’re choosing between one project of indeterminate pedigree hosted at GitHub and an equivalent maintained by a foundation like Apache, the brand is likely to be a feature.
Mikeal Rogers, whose original essay kicked off the entire discussion, has since followed up with some subsequent thoughts about foundations and institutions:
Simon believes it is the job of an institution (in this case a foundation) to protect members from each other and from the outside world. In the case of legal liabilities this makes perfect sense. In the case of community participation this view has become detrimental.
If you believe, as I do, that we have undertaken a cultural shift in open source then you must re-examine the need for institutional governance of collaboration. If the values we once looked to institutions like Apache to enforce are now enforced within the culture by social contract then there is no need for an institution to be the arbiter of collaboration between members.
Ben Collins-Sussman, a longtime Apache member, chimes in with his thoughts on the value of the Apache Foundation, pointing to the explicit codification of "community":
the ASF requires that each community have a set of stewards (“committers”), which they call a “project management committee”; that communities use consensus-based discussions to resolve disputes; that they use a standardized voting system to resolve questions when discussion fails; that certain standards of humility and respect are used between members of a project, and so on. These cultural traditions are fantastic, and are the reason the ASF provides true long-term sustainability to open source projects.
Jim Jagielski, another longtime Apache member, adds his thoughts, observing that it is important to not get caught up in statistics about popularity, adoption rate, etc., but to concentrate on communities, culture, and communication aspects:
The ASF doesn't exist to be a "leader"; it doesn't exist to be a "voice of Open Source"; it doesn't exist to be cool, or hip, or the "place to be" or any of that.
[ ... ]
It exists to help build communities around those codebases, based on collaboration and consensus-based development, that are self-sustaining; communities that are a "success" measured by health and activity, not just mere numbers.
Ceki Gulcu (poorly transliterated by me, sorry), a longtime Open Source Java developer, observes that what one person sees as consensus and meritocratic collaboration, another might see as endless discussion and fruitless debate:
Apache projects cannot designate a formal project leader. Every committer has strictly equal rights independent of past or future contributions. This is intended to foster consensus building and collaboration ensuring projects' long term sustainability. Worthy goals indeed! However, one should not confuse intent with outcome. I should also observe that committer equality contradicts the notion of meritocracy which Apache misrepresents itself as.
As I have argued in the past, the lack of fair conflict resolution or timely decision making mechanisms constitute favorable terrain for endless arguments.

It seems to be a fairly fundamental debate: some believe that the open source foundations provide substantial benefit, others feel that they reflect a time that no longer exists, and are no longer necessary.

Overall, it's been a fascinating discussion, with lots of viewpoints from lots of different perspectives.

I'll continue to be interested to follow the debate.

Tools and Utilities for Windows

Scott Hanselman has posted a voluminous annotated list of the tools and utilities that he uses for developing software.

Most of these tools are specific to Windows 7, and more precisely to developing Web applications using Microsoft tools such as Visual Studio and C# and DotNet.

Still, it is a tremendous list, and if you're looking for a tool or utility for your personal development environment, there are a lot of references to chase here.

Wednesday, November 30, 2011

Wings

The following sounds like a description of an airplane, like something you might hear from somebody like Bert Rutan:

when we addressed the wing, we started with a complicated rule, to limit what a designer could do. We added more and more pieces as we thought of more and more outcomes, and we came to a point where it was so complicated—and it was still going to be hard to control, because the more rules you write the more loopholes you create – that we reverted to a simple principle. Limit the area very accurately, and make it a game of efficiency.

But it's not from Rutan at all; it's an excerpt from Wings, the Next Generation, an article discussing the sailboats to be used in next summer's America's Cup qualification matches.

Now, everybody knows that sails, and airplane wings, actually have very much in common, so it really isn't surprising that this sounds like aerospace design. However, as Paul Cayard notes in the article, the wings on a competition sailboat have a few special constraints:

the America’s Cup rules don’t allow stored power, so two of our eleven guys—we think, two—will be grinding a primary winch all the race long. Not to trim, but to maintain pressure in the hydraulic tank so that any time someone wants to open a hydraulic valve to trim the wing, there will be pressure to make that happen.

It will be fascinating to see these boats in person, racing on the bay, but I'm glad I won't have to be one of those grinders!

Tuesday, November 29, 2011

Apache, Subversion, and Git

Over the long weekend, a number of people seem to have picked up and commented on Mikeal Rogers's essay about Apache and its adoption of the source code control tool, Git. For example, Chris Aniszczyk pointed to the essay, and followed it up with some statistics and elaboration. Aniszczyk, in turn, points to a third essay (a year old), by Josh Berkus, describing the PostgresQL community's migration to git, and a fourth web page describing the Eclipse community's migration to git. (Note: Both Eclipse and PostgresQL migrated from CVS to git).

I find the essays by Rogers and Aniszczyk quite puzzling, full of much heat and emotion, and I'm not sure what to take from them.

Rogers seems to start out on a solid footing:

For a moment, let's put the git part of GitHub on the back burner and talk about the hub.
On GitHub the language is not code, as it is often characterized, it is contribution. GitHub presents a person to person communication system for contributions. Documentation, issues, and of course code, travel between personal repositories.
The communication medium is the contribution itself. Its value, its merit, its intention, all laid naked for the world to see. There is no hierarchy or politic embedded in the system. The creator of a project has a clear first mover advantage but the possibility is always there for its position to be supplanted by a fork, creating a social imperative to manage contributions in a satisfactory manor [sic] to her community.

This is all well-written and clear, I think. But I don't understand how this is a critique of Apache. In my seven years of experience with the Derby project at Apache, this is exactly how an Apache software project works:

Issues are raised in the Apache issue-tracking system;
discussion is held in the issue comments and on mailing lists;
various contributors suggest ideas;
someone "with an itch to scratch" dives into the problem and constructs a patch;
the patch is proposed by attaching it to the issue-tracking system;
further discussion and testing occurs, now shaped by the concrete nature of the proposed patch;
a committer who becomes persuaded of the desirability of the patch commits it to the repository;
eventually a release occurs and the change becomes widely distributed.

This is the process as I have seen it and participated in it, since back in 2004, and, I believe, was how it was done for years before that.

So what, precisely, is it that Apache is failing at?

Here is where Rogers's essay seems to head into the wilderness, starting with this pronouncement:

Many of the social principles I described above are higher order manifestations of the design principles of git itself.
[ ... ]
The problem here is less about git and more about the chasm between Apache and the new culture of open source. There is a growing community of young new open source developers that Apache continues to distance itself from and as the ASF plants itself firmly in this position the growing community drifts farther away.

I don't understand this at all. What, precisely, is it that Apache is doing to distance itself from these developers, and what does this have to do with git?

Rogers offers as evidence this email thread (use the "next message by thread" links to read the thread), but from what I can tell, it seems like a very friendly, open, and productive discussion about the mechanics of using git to manage projects at Apache, with several commenters welcoming newcomers into the community and encouraging them to get involved.

This seems like the Apache way working successfully, from what I can tell.

Aniszczyk's follow-on essay, unfortunately, doesn't shed much additional light. He states that "what has been happening recently regarding the move to a distributed version control system is either pure politicking [sic] or negligence in my opinion."

So, again, what is it that he is specifically concerned about? Here, again, the essay appears to head into the wilderness. "Let's try to have some fun with statistics," says Aniszczyk, and he presents a series of charts and graphs showing that:

git is very popular
lots of job sites, such as LinkedIn, are advertising for developers who know git
There is no 3.

At this point, Aniszczyk says "I knew it was time to stop digging for statistics."

But again, I am confused about what he finds upsetting. The core message of his essay appears to be:

The first is simple and deals with my day job of facilitating open source efforts at Twitter. If you’re going to open source a new project, the fact that you simply have to use SVN at Apache is a huge detterent [sic] from even going that route.
[ ... ]
All I’m saying is that it took a lot of work to start the transition and the eclipse community hasn’t even fully completed it yet. Just ask the PostgreSQL community how quick it was moving to Git. The key point here is that you have to start the transition soon as it’s going to take awhile for you to implement the move (especially since Apache hosts a lot of projects).

Once again, I'm lost. Why, exactly, is it a huge deterrent to use svn? And why, exactly, does Apache need to convert its existing projects from svn to git? Just because LinkedIn is advertising more jobs that use git as a keyword? That doesn't seem like a valid reason, to me.

Note that, as I mentioned at the start of this article, the PostgresQL team migrated from CVS to git, not from Subversion to git. I can completely understand this. The last time I used CVS was in 2001, 10 full years ago; even at that time, CVS had some severe technical shortcomings and there was sufficient benefit to switching that it was worth the effort. So I'm not at all surprised by the PostgresQL community's decision. The article by Berkus, by the way, is definitely worth reading, full of wisdom about platform coverage, tool and infrastructure support, workflow design, etc.

So, to summarize (as I understand it):

PostgresQL and Eclipse are migrating from CVS to git, successfully (although it is taking a significant amount of time and resources)
Apache is working to integrate git into its policies and infrastructure, but still uses Subversion as its primary scm system
Some people seem to feel like Apache is making the wrong decision about this

But what I don't understand, at the end of it all, is in what way this is opposed to "the Apache way?" From everything I can see, the Apache way is alive and well in these discussions.

UPDATE:Thomas Koch, in the comments, provides a number of substantial, concrete examples in which git's powerful functionality can be very helpful. The most important one that Thomas provides, I think, is this:

It is much easier to make a proper integration between review systems, Jenkins and Jira, if the patch remains in the VCS as a branch instead of leaving it.

I completely agree. Working with patch files in isolation is substantially worse than making reference to a branched change that is under SCM control. Certainly in my work with Derby I have seen many a contributor make minor technical errors while manipulating a patch file, that on the whole just adds friction to the overall process. Good point, Thomas!

Monday, November 28, 2011

Burton Bloom and the now-forgotten Computer Usage Company

Burton Bloom's original paper on Bloom Filters is entitled Space/Time Trade-offs in Hash Coding with Allowable Errors, and his by-line is given as

Burton H. Bloom
Computer Usage Company, Newton Upper Falls, Massachusetts

with the additional parenthetical note that

Work on this paper was in part performed while the author was affiliated with Computer Corporation of America, Cambridge, Massachusetts.

Now, I'm quite familiar with Computer Corporation of America; I was an employee of theirs from 1985-1988, and I vividly remember my days working in the 4 Cambridge Center building.

But that was 15 years after Bloom's paper was published, and when I was there, I don't recall anything about "Computer Usage Company".

What was Computer Usage Company?

Unfortunately, a web search reveals only the slightest details:

There is, of course, a Wikipedia page
There is a Facebook page
and there is a short entry at the Computer History Museum website

But there is very little else. George Trimble's homepage no longer exists, and most of the links from the existing summary pages at Wikipedia and elsewhere point to articles in the IEEE Annals of the History of Computing, which (like Bloom's original paper at the ACM site) is protected behind a paywall and can't be read by commoners.

Computer Usage Company is credited with being "the world's first computer software company", but it seems on the verge of disappearing into dust. It's a shame; you'd think the software industry would work harder to keep information about these early pioneers alive.

I wonder if the IEEE keeps any statistics regarding how many people have actually paid the $30 to purchase this 20-year-old, five page memoir? I would have been intrigued to read it; I might even have paid, say, $0.99 or something like that to get it on my Kindle. But thirty dollars?

Saturday, November 26, 2011

Ho-hum, just an 11-1 season

It's amazing to me that Stanford are, at this point, clinging to hopes for a BCS at-large bid. Should it really be this hard to get two Pac-12 teams into the BCS? I guess that the SEC are still hoping they will field 3 teams in the 10 team BCS schedule...

Quote-un-Quote: 50 interviews of indie game developers

Here's a great body of work: "Fifty Independent Videogame Developers; Fifty Interviews; Fifty Weeks".

The interviewer, who goes by the handle "moshboy", describes the intent of the project here:

all I wanted to do was get some words of insight out of a few independent videogame developers that weren’t known to put many of their own words ‘out there’. In the beginning, the idea was to interview those that had rarely or never been interviewed before.

His project succeeded, and produced a fascinating body of work:

Sometimes the quotes are a snapshot of a developer’s mindset from a certain time period, while most lean toward quoting some insight from their thoughts regarding videogame development.

The complete set of interviews are here.

Since I'm unfortunately not familiar with most of these developers, I found that a fun way to approach the work was just to scroll around in the list and randomly pick an interview.

Great job, moshboy, and thanks not only for embarking on the project and carrying it through, but for publishing the results for us all!

Friday, November 25, 2011

Distributed set difference computation using invertible Bloom filters

Recently I've been slowly but steadily working my way through a meaty but rewarding recent paper entitled: What's the Difference? Efficient Set Reconciliation without Prior Context.

The subject of the paper is straightforwardly expressed:

Both reconciliation and deduplication can be abstracted as the problem of efficiently computing the set difference between two sets stored at two nodes across a communication link. The set difference is the set of keys that are in one set but not the other. In reconciliation, the difference is used to compute the set union; in deduplication, it is used to compute the intersection. Efficiency is measured primarily by the bandwidth used (important when the two nodes are connected by a wide-area or mobile link), the latency in round-trip delays, and the computation used at the two hosts. We are particularly interested in optimizing the case when the set difference is small (e.g., the two nodes have almost the same set of routing updates to reconcile, or the two nodes have a large amount of duplicate data blocks) and when there is no prior communication or context between the two nodes.

The paper itself is well-written and clear, and certainly worth your time. It's been particularly rewarding for me because it's taken me down a path of investigating a lot of new algorithms that I hadn't previously been studying. My head is swimming with

Invertible Bloom Filters (a variation on counting Bloom filters, which in turn are a variation on basic Bloom filters, an algorithm that is now 40 years old!)
Tornado codes
Min-wise sketches
Characteristic Polynomial Interpolation
Approximate Reconciliation Trees

and many other related topics.

I hope to return to discussing a number of these sub-topics in later posts, whenever I find the time (heh heh). One of the things that's challenging about a lot of this work is that it's based on probabilistic algorithms, which take some time getting used to. I first studied these sorts of algorithms as an undergraduate in the early 1980's, but they still throw me when I encounter them. When studying probabilistic algorithms, you often encounter sections like the following (from the current paper):

The corollary implies that in order to decode an IBF that uses 4 independent hash functions with high probability, then one needs an overhead of k + 1 = 5. In other words, one has to use 5d cells, where d is the set difference. Our experiments later, however, show that an overhead that is somewhat less than 2 suffices.

The question always arises: what happens to the algorithm in those cases where the probabilities fail, and the algorithm gives the wrong answer (a false positive, say)? I believe, that, in general, you can often structure the overall computation so that in these cases the algorithm still gives the correct answer, but does more work. For example, in the deduplication scenario, you could perhaps structure things so that the set difference code (which is trying to compute the blocks that are identical in both datasets, so that they can be eliminated from one set as redundant and stored only in the other set) fails gracefully on a false positive. Here, a false positive would need to cause the overall algorithm to conclude that two blocks which are in fact distinct, but which collide in the data structure and hence appear to be identical, are treated as distinct and retained in both datasets.

That is, the algorithm could be designed so that it errs on the side of safety when the probabilities cause a false positive to be returned.

Alternatively, some probabilistic algorithms instead fail entirely with very low probability, but fail in such as way as to allow the higher-level code to either simply re-try the computation (if it involves random behaviors, then with high probability it will work the next time), or to vary the computation in some crucial aspect, to ensure that it will succeed (which is the case in this particular implementation).

Most treatments of probabilistic algorithms describe these details, but I still find it important to always keep them in my head, in order to satisfy myself that such a probabilistic algorithm is safe to deploy in practice.

Often, the issue in using probabilistic algorithms is to figure out how to set the parameters so that the behavior of the algorithm performs well. In this particular case, the issue involves estimating the size of the set difference:

To efficiently size our IBF, the Strata Estimator provides an estimate for d. If the Strata Estimator over-estimates, the subsequent IBF will be unnecessarily large and waste bandwidth. However if the Strata Estimator under-estimates, then the subsequent IBF may not decode and cost an expensive transmission of a larger IBF. To prevent this, the values returned by the estimator should be scaled up so that under-estimation rarely occurs.

That is, in this particular usage of the probabilistic algorithms, the data structure itself (the Invertible Bloom Filter) is powerful enough that the code can detect when it fails to be decoded. Using a larger IBF solves that problem, but we don't want to use a wastefully-large IBF, so the main effort of the paper involves techniques to compute the smallest IBF that is needed for a particular pair of sets to be diff'd.

If you're interested in studying these sorts of algorithms, the paper is well-written and straightforward to follow, and contains an excellent reference section with plenty of information on the underlying work on which it is based.

Meanwhile, while wandering through Professor Eppstein's web site, I came across this nifty Wikipedia book on data structures that he put together as course material for a class. Great stuff!

Thursday, November 24, 2011

Stanford crypto class

I'm not sure how this will turn out, but I've signed up for Professor Dan Boneh's online Cryptography class, which starts this winter.

Wednesday, November 23, 2011

The recent events at U C Davis and U C Berkeley

I mostly avoid political topics on my blog, but the current events on the University of California campuses are very important and need more attention. Here is a superb essay by Professor Bob Ostertag of U.C. Davis about the events of the last week, and a follow-up essay discussing ongoing events.

Meanwhile, it's interesting that some of the most compelling and insightful commentary is being published outside the U.S., for example this column and this column in the Guardian.

I don't know what the answers are. But I do know that the debate is important, and I salute the Davis and Berkeley communities for not backing down from the questions, and for opening their minds to the need to hold that debate, now. Our universities, and our children, are our future.

Monday, November 21, 2011

Danah Boyd on privacy in an online world

It's somewhat of a shock to realize that it's been more than a decade since Scott McNealy made his famous pronouncement on online privacy:

You have zero privacy anyway. Get over it.

Well, people haven't actually just got over it. It's an important, complex, and intricate issue, and happily it is getting the sort of attention it needs.

So you should set aside a bit of time, and dig into some of the fascinating work that danah boyd has published recently, including:

A detailed analysis of the impact of the Children's Online Privacy Protection Act: “Why Parents Help Their Children Lie to Facebook About Age: Unintended Consequences of the ‘Children’s Online Privacy Protection Act’” in the online journal First Monday,
and her remarks prepared for the Wall Street Journal: Debating Privacy in a Networked World for the WSJ

Both articles are extremely interesting, well-written, and deeply and carefully considered. Here's an excerpt from the WSJ discussion:

The strategies that people use to assert privacy in social media are diverse and complex, but the most notable approach involves limiting access to meaning while making content publicly accessible. I’m in awe of the countless teens I’ve met who use song lyrics, pronouns, and community references to encode meaning into publicly accessible content. If you don’t know who the Lions are or don’t know what happened Friday night or don’t know why a reference to Rihanna’s latest hit might be funny, you can’t interpret the meaning of the message. This is privacy in action.

And here's an excerpt from the First Monday article:

Furthermore, many parents reported that they helped their children create their accounts. Among the 84 percent of parents who were aware when their child first created the account, 64 percent helped create the account. Among those who knew that their child joined below the age of 13 — even if the child is now older than 13 — over two–thirds (68 percent) indicated that they helped their child create the account. Of those with children who are currently under 13 and on Facebook, an even greater percentage of parents were aware at the time of account creation. In other words, the vast majority of parents whose children signed up underage were involved in the process and would have been notified that the minimum age was 13 during the account creation process.

As Joshua Gans notes in a great essay on Digitopoly, this is not an easy situation for a parent to be in, and the stakes are actually quite high:

And there are actually many reasons why I would want to allow her to do that. First and foremost, this is the opportunity for me to monitor her interactions on Facebook — requiring she be a friend at least for a few years. That allows me some access and the ability to educate. Second, all of her friends were on Facebook. This is where tween interactions occur. Finally, I actually think that it is the evolving means of communication between people. To cut off a child from that seems like cutting them off from the future.

I can entirely sympathise; my wife and I had similar deep discussions about these questions with our children (although at the time it was MySpace and AOL, not Facebook ).

They are your kids; you know them best. In so many ways, Facebook is just another part of life that you can help them with, like all those other temptations of life (drugs, sex, etc.). Talk to them, tell them honestly and openly what the issues are, and why it matters. Keep an eye on what they are doing, and let them know you'll always be there for them.

There are no simple answers, but it's great that people like boyd and Gans are pressing the debate, raising awareness, and making us all think about what we want our modern online world to be like. Here's boyd again:

We must also switch the conversation from being about one of data collection to being one about data usage. This involves drawing on the language of abuse, violence, and victimization to think about what happens when people’s willingness to share is twisted to do them harm. Just as we have models for differentiating sex between consenting partners and rape, so too must we construct models that that separate usage that’s empowering and that which strips people of their freedoms and opportunities.

This isn't going to be easy, but it's hard to think about anything that is more important that the way in which people talk with each other.

So don't just "get over it". Think about it, research it, talk about it, and help ensure that the future turns out the way it should.

Following up on Jonathan's Card

This morning, the O'Reilly web site is running a condensed interview with Jonathan Stark, discussing, with the benefit of several months of hindsight, the intriguing "Jonathan's Card" events of the summer.

If you didn't pay much attention to Jonathan's Card as it was unfolding in real time, this is a good short introduction, with a summary of the events and some links to follow-up material.

Friday, November 18, 2011

The science of Maverick's

Here's a wonderful multi-media piece diving deep into the earth science behind the surfing marvel that is Maverick's. Enjoy!

Thursday, November 17, 2011

The Lewis Chessmen at NYC's Met

Here's a nice story in the New York Times about the Lewis Chessmen, a 1000-year-old set of carved walrus tusk chess pieces, on exhibit at the Metropolitan Museum of Art in New York City.

Too bad I'm on the wrong side of the country; I'd love to see these. Unfortunately, according to the Met website,

After the showing in New York, they will return to London.

I guess I'll just have to figure out a way to travel to see them in their permanent home at the British Museum!

Tuesday, November 15, 2011

Software Patents, Microsoft, Android, and Barnes & Noble

If you have any interest at all in the software industry, you'll be absolutely fascinated to read this detailed article at the GrokLaw website about the legal dispute between Microsoft and Barnes & Noble over Android-related patents.

It is well-known that Microsoft claims that Android infringes on Microsoft's patents; Microsoft themselves explain this on their website, saying they "simply cannot ignore infringement of this scope and scale", and that:

The Microsoft-created features protected by the patents infringed by the Nook and Nook Color tablet are core to the user experience.

and

Our agreements ensure respect and reasonable compensation for Microsoft's inventions and patent portfolio. Equally important, they enable licensees to make use of our patented innovations on a long-term and stable basis.

However, what has never been known (until now), is precisely what those patented innovations are. As Mary-Jo Foley observed more than 6 months ago, Microsoft refuses to identify the patents, and why it believes Android infringes upon them, unless a Non Disclosure Agreement is signed agreeing not to reveal that information.

Barnes & Noble apparently refused to sign that agreement, and instead found counsel to represent them, and now the information about the patents in question is no longer a secret.

According to the Barnes & Noble filings, the primary Microsoft patent which Android infringes is a 16-year-old patent (U.S. Patent 5,778,372), which patents:

A method of remotely browsing an electronic document residing at a remote site on a computer network and specifying a background image which is to be displayed with the electronic document superimposed thereon comprising in response to a user's request to browse to the electronic document.

Apparently, changing the background on your screen when a document is displayed is patented.

I understand software quite well.

I don't understand law at all, and specifically I don't understand intellectual property law.

However, I find the GrokLaw analysis of the Barnes & Noble v. Microsoft dispute absolutely fascinating.