Friday, May 28, 2010

Search Engine Optimization is a deep subject

Happily, I've never had to worry too much about Search Engine Optimization, but I've watched a few people try. It's a deep subject. SEO, of course, is the problem of figuring out how to tune and adjust your website so that search engines, such as Google or Bing, will find your web site more easily, and will direct more traffic your way.

If you've never thought about SEO, a reasonable place to start is with Google's starter guide. It'll help you understand the basic ways in which your site can influence the behaviors of the search engines.

And you'll want to learn about the various tools that are available to you to help you measure the effectiveness of your tuning.

Once you get past the basics, things get much more complex, and you can put a lot of effort into analyzing and studying your site to try to figure out whether it's complying with all the various best practices and policies.

But even if you become an expert, you can still be surprised and confused by the changes that occur as the search engines continually adjust their own behaviors. For people who are trying to swim in these rivers, Vanessa Fox's advice seems quite sound:

If you tie construction of your site to any one perceived algorithm signal, you’re at the mercy of Google’s constant tweaks. These frequent changes are one reason Google itself downplays algorithm updates. Focus on what Google is trying to accomplish as it refines things (the most relevant, useful results possible for searchers) and you’ll generally avoid too much turbulence in your organic search traffic.

Well put, Ms Fox!

Tuesday, May 25, 2010

Professional awards

I happened to be wandering around the Google "recent publications" page and saw that today's entry is a paper written by my old friend John Black.

I thought for an instant this meant he'd joined Google, but no, he's still teaching in Colorado. I guess it's one of his former students who is now a Googler.

Anyway, while bouncing around the net I remembered that I particularly liked John's mention of a professional award he received in 1996:

A check for $2.56 from Don Knuth, 1996

I can't remember exactly the circumstances, but of course John was a recipient of one of the famous Knuth Reward Checks.

Though I see that Knuth is no longer issuing those checks.

VoltDB general availability

The VoltDB team have announced the general availability of Volt. Volt is supposed to be an in-memory relational DBMS using standard SQL as its language and standard transactional integrity as its data model, which is rather refreshing amongst all the hubbub nowadays about NoSQL, Key/Value stores, and so forth.

Volt can't really seem to decide what sort of organization they are, however. Part of the company seems to want to be an open-source company, and they are making the source available via Subversion and licensing it under GPLv3. But another part of the company seems to want to be a traditional enterprise software startup: the website trumpets their "serial entrepreneur" leadership and their "technical overview" links lead to a marketing page that asks you to register as a sales lead to get the information. The best information about the VoltDB architecture appears to be this paper from VLDB 2008.

It's very interesting that their stored procedure language is Java. In the Derby community, we have fairly extensive experience with using Java as a server-side stored procedure language. I wonder if the Volt community is interested in sharing their experiences with the Derby community?

Monday, May 24, 2010

Draw odds armageddon

The sporting world is filled with tie-breaking techniques.

Many sports simply play extra time, with or without the "golden goal" (also known by the grimmer "sudden death"), and sometimes with the added fillip of a minimum victory differential (e.g., in tennis).

In baseball, at the highest level, the extra time is without bounds, and may stretch for hours, while other sports tend to limit the extra time and then either admit a draw or switch to some other mechanism, like a shootout.

But here's a most unusual tie-breaking technique, somtimes known by the colorful phrase "draw odds armageddon", that is being practiced this week in the United States Chess Championship, in St Louis, Missouri:

The base time for the game is 60 minutes+ 5 second increment. It will be a draw odds game (Black wins on a draw.) The Players will both bid on the amount of time (minutes and seconds, a number equal or less to 60:00) that they are willing to play with in order to choose their color. The Player who bid the lower number of time chooses his or her color and is allowed the amount of time they bid; the other side shall receive 60:00 time. If both Players pick exactly the same number, the chief arbiter will flip a coin to determine who shall choose their color.

At the top levels of chess, having White is a significant advantage, but if you bid too low in order to gain the first move advantage, the combination of the time shortage and the draw odds scoring could be your downfall.

We'll find out tomorrow if it's Shulman or Kamsky who takes White, and what the outcome is.

Meanwhile, don't miss this chance to replay Yuri Shulman's gorgeous and spectacular queen sacrifice victory over up-and-coming Hikaru Nakamura. The winning move will take your breath away.

Google I/O seemed kind of underwhelming this year

Why do I feel like Google I/O was sort of a disappointment this year? Maybe I was expecting too much. Certainly there is some interesting stuff, including:

But nothing really grabbed me this year.

What do you think? Am I missing something? What was the major focus of Google I/O from your perspective?

Sunday, May 23, 2010

Country gone lurid

Is this the most beautifully vivid description of a rock music song that you have ever read? It's from today's New York Times review of the Rolling Stones's new reissue of "Exile on Main Street":

It's country gospel gone lurid, and it seems to rise up out of a nap. Nicky Hopkins's piano chords circle around a G at slow tempo in an echoey room. Charlie Watts starts pumping a bass drum at the third beat of the second bar; he's either late or early, but finding his way. Piano and drums roll up to the A chord at the beginning of the first verse, and Mick Taylor bends two guitar strings under Mick Jagger's opening line: "I'm the man on the mountain -- yes, come on up." Onward, Mr. Watts weaves around the beat, smashing down on his high-hat, forming weird and clattering snare-drum fills. He both shapes and follows the group's euphoria and the music's subtle acceleration. The Stones gather around the song like pickpockets, jostling and interfering with it. Keith Richards, playing rhythm guitar and singing backup, quits harmonizing and starts to shout.

"gather around the song like pickpockets" -- wonderful!

Saturday, May 22, 2010

13 days to go!

OK, I'm ready. I've filled out my brackets, hung my poster-size chart on the wall, and watched Write the Future.

Go Ivory Coast! Go Ghana! Go Slovakia! Go Chile!

Want to know what the play will be like? Warm up today by watching the Champions League final, held on a weekend for what? the first time ever? It's being carried by Fox, so ratings in the U.S. could actually be quite respectable.

Thursday, May 20, 2010

Hard Core

... and this, my friends, is how you know you are in the presence of someone who loves C programming ...

v += 0xff & *++*sourcestart;

Wednesday, May 19, 2010

Derby 10.6 is released!

See Knut Anders's weblog for the scoop! I was part of several major elements of the 10.6 release, and helped out with a number of smaller pieces, so I'm very pleased to see this work released. Rick, thanks a million for putting out the release!

Sadly, after I left my previous job, the new owners of the system almost immediately dumped Derby and transitioned to a Industrial Strength DBMS. Sigh. Derby was in 24x7 production use for almost 4 years at AmberPoint, and I think it did a wonderful job. But I've moved on, and they've moved on.

I still think Derby is a wonderful piece of software, and contains some of the best Java code I've seen anywhere: clearly written, efficient, well-maintained, thoroughly tested, elegant.

I'm not finding much time to spend on Derby right now; working on my mentoring projects for the Google Summer of Code is about all I have time for.

But Derby is a good project and I hope to stay connected to it, and I hope that it finds the success it deserves.

A compendium of all knowledge

Not only does the Internet contain everything that can be known, but just occasionally, it has something useful.

Well, to be honest, I think that qualifies more precisely as "something my son will find quite useful". :)

Friday, May 14, 2010

Studying failure

In my day job, we care a lot about server uptime, reliability, and availability. Any little disruption in the server is pored over and studied in great detail. You have to sweat the small stuff.

So I loved this essay in the Usenix blog comparing the IT attitude toward server availability to the efforts of the aviation profession. As the author notes, the aviation industry has a century's more experience with these topics, and so has much to teach us about what is needed in order to achieve the reliability levels that we desire:

I’m talking about the methods, the drive, and the sheer determination to discover, at all costs, the root cause of the issues that occur in the aviation profession.

Don't miss this link to the author's more detailed essay about redundancy and disaster planning. Beware the fiber backhoe!

Tuesday, May 11, 2010

Stock market mysteries

It sounds like we still have no idea what exactly happened in the stock markets for 15 minutes last Thursday.

I continue to be surprised that certain orders are being cancelled, even though there is no evidence yet that those orders were fraudulent or were improperly executed as the result of a mechanical failure of some sort.

This is a pretty frightening admission by SEC Chairman Mary Schapiro:

One of the challenges we face in recreating the events of last Thursday is the reality that the technologies used for market oversight and surveillance have not kept pace with the technology and trading patterns of the rapidly evolving and expanding securities markets.

Ms. Schapiro's testimony, as well as the testimony of Commodity Futures Trading Commission Gary Gensler, is available as links from this New York Times article.

The CFTC would like more data:

Independent from Thursday's events, the CFTC currently is considering the implications of co-location and high-frequency trading. We also are considering a rule related to account identification so that the CFTC can collect better and more-detailed information on each trader in the futures markets.

Sounds like our financial regulators need more data and better analysis tools. Hopefully they can figure out a better answer for "what happened on Thursday" than they've been able to provide so far.

Patience Diff

Bram Cohen's Patience Diff algorithm is really quite clever. The core insight of Patience Diff is that most LCS-based diff algorithms, such as the Hunt-McIlroy and Miller-Myers diff algorithms that I've been studying, are easily misled by certain high-frequency, low-content lines of text which occur in programming language texts.

For example, many program source files are filled with lines like:

  • blank line

  • {

  • }

  • return;

  • /*

  • */

and so forth.

LCS-based diff programs concentrate a lot of energy in finding long sequences of common lines like these, and then trying to match up those common sequences of uninteresting lines with a few content-heavy diff blocks.

Patience Diff, instead, focuses its energy on the low-frequency high-content lines which serve as markers or signatures of important content in the text. It is still an LCS-based diff at its core, but with an important difference, as it only considers the longest common subsequence of the signature lines:

Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.

Here's a superb writeup, with side-by-side examples, that does a great job of illustrating why these signatures lines produce a much clearer diff.

It's not obviously clear whether Patience Diff is necessarily faster than the classic diff algorithms, or whether it has its own set of pathological cases; I suspect all diff algorithms are somewhat susceptible to such weaknesses. But it's a very clearly-presented idea, with a powerful intuitive basis, and it will be very interesting to see how it spreads as more people become familiar with it.

Tie-breakers in Chess? Not necessary this time...

The world championship match between Viswanathan Anand and Veselin Topalov is nearing the end of its scheduled series of games, and the score is currently tied with one game left.

In previous world championships that I am aware of, in this situation, the match would have been declared a tie and the current champion (Anand) would simply retain his title.

But in this match, there is a tie-breaker format defined, in which the players will play a series of faster games, with much stricter time controls, culminating in an "Armageddon game" if the match remains tied after those short games are played.

I clearly hadn't been paying enough attention before, for I wasn't aware of these tie-breaker rules. It's nice to have a clear victor, but as in many sporting situations, the tie breakers are unsatisfying in many respects.

Perhaps there will be a clear victor in the final match today.

Update: The game is a fierce fight! Anand has conducted a tremendous attack and has won Topalov's queen. The game has been underway for many hours and still is not resolved.

Update 2: It's over, Topalov resigned. Mate was imminent (the immediate thread was Qh1 mate).

Saturday, May 8, 2010

The rsync algorithm

Andrew Tridgell and Paul Mackerras's 1996 paper on the rsync algorithm is a masterpiece. The algorithm is a masterpiece, and the paper is a masterpiece. In 5 clear and readable pages, the authors describe a brilliant and elegant algorithm which combines difference computation, cryptography, and network communications to provide a remote file update service.

Here's how the paper starts:

Imagine you have two files, A and B, and you wish to update B to be the same as A. The obvious method is to copy A onto B.

Now imagine that the two files are on machines connected by a slow communications link, for example a dial up IP link. If A is large, copying A onto B will be slow. To make it faster you could compress A before sending it, but that will usually only gain a factor of 2 to 4.

Now assume that A and B are quite similar, perhaps both derived from the same original file. To really speed things up you would need to take advantage of this similarity. A common method is to send just the differences between A and B down the link and then use this list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used (once you had copied the file over, you wouldn't need the differences). This is the problem that rsync addresses.

The essence of the rsync algorithm is its insight into the idea that cryptographic signatures of sections of the text are much smaller than the text itself, but contain enough information to locate sections of common text:

The basic strategy is to compute the 32-bit rolling checksum for a block of length S starting at each byte of A in turn, and for each checksum, search the list for a match.
If no match is found at a given offset in the file, the rolling checksum is updated to the next offset and the search proceeds. If a match is found, the search is restarted at the end of the matched block. This strategy saves a considerable amount of computation for the common case where the two files are nearly identical.

Andrew Tridgell's work on rsync eventually become the basis for his PhD thesis, Efficient Algorithms for Sorting and Searching. It's one of those rare PhD theses that is both readable and worth reading (though I haven't made it through all of it). Tridgell is most famous for his work on Samba, the open source implementation of the Microsoft remote-file-and-print protocols that is the basis for most data center inter-operability in almost every machine room I've ever been around. He's also somewhat famous for being the inspiration of one of the more famous breaking-up letters in computer software history, an event which resulted in the creation of Git.

Friday, May 7, 2010

Modern mysteries confound

Our world is filled with technology that we strain to understand. Our systems have become so complex and so sophisticated, that we observe their behavior but cannot always explain it.

My wife and I were discussing the stock market behaviors, idly. She noted that all the information should be available: the stock exchanges know which parties made which trades at which times for what prices. This is true, but we don't know why those trades were made. Even the exchanges aren't sure about some of these trades, and apparently have cancelled some trades? And, in many (the majority?) of the trades, the parties were not humans, making decisions for human reasons, but computerized trading algorithms, making decisions based on their programming, and doing so at the speed of machines.

You can't sub-poena a trading algorithm and demand that it explain its reasoning.

The other Myers diff paper

The other Myers diff paper is An O(ND) Difference Algorithm and its Variations. It was published in the journal Algorithmica in 1986. These two papers, together, are really the core of understanding how all modern diff algorithms work. I happened to read A File Comparison Program first, though it probably would be better to read them in the opposite order. Well, who knows; read them both!

An O(ND) Difference Algorithm and its Variations appears to be the paper where the famous "snakes" are first discussed. The image that I get in my mind while reading the paper and looking at the edit graph, is of the diff algorithm examining the two files in parallel, searching for the common bits, and stitching together the different bits, "snaking" its way back and forth between the two files as it goes. I'm not sure if this is the image that Myers had in mind, but it works for me. A snake is a section of common content between the two files, represented diagrammatically as a diagonal portion of the edit graph, and snakes are important because, the longer the snakes are in your path, the more compact your diff is; the total length of all the snakes is the length of the common subsequence that the dynamic algorithm has located:

The problem of finding a shortest edit script reduces to finding a path from (0,0) to (N,M) with the fewest number of horizonal edges. Let a D-path be a path starting at (0,0) that has exactly D non-diagonal edges. A 0-path must consist solely of diagonal edges. By a simple induction, it follows that a D-path must consist of a (D-1)-path followed by a non-diagonal edge and then a possibly empty sequence of diagonal edges called a snake.

Snakes are all over the rest of the paper, as you end up reading sentences like:

The final snake must be maximal, as the D-path would not be furthest reaching if the snake could be extended.

After a little while, you start to get pretty comfortable with the snake notion, though my wife still doesn't want me extending any maximal snakes around the house!

Later in the paper, there is also a fantastic diagram explaining the "contours" notion that you find in other papers, in which the dynamic programming algorithm effectively explores the space of all possible edit scripts which have the same edit distance. The Myers paper calls these "edit graph boundaries".

Both the Myers papers have stood up very well over time. I've seen a number of more recent papers which attempt to re-present and re-analyze the problem, but I haven't seen anything which is superior to the two Myers papers, and most other presentations are in fact far weaker, so reading these two papers is clearly the place to start.

Now I'm off to go track down a reference to Bram Cohen's Patience Diff algorithm, which a poster on Stack Overflow claimed was the new diff algorithm in the bazaar version control system. (Bram Cohen is a famously confident programmer, but it still seems rather bold to title your essay "The diff problem has been solved.")

Thursday, May 6, 2010

The future of web browsers.

Here's a terse set of notes from a panel Q-and-A regarding the future of web browsers.

Naturally, each of the browser vendors was somewhat pushing their own agenda, but still there is lots of interesting stuff to read here. Unfortunately, the comments are not verbatim, but instead are the notes from an attendee, so they are substantially edited and condensed. It would be interesting to see a verbatim transcript of the panel.

It's a shame that Adobe wasn't on the panel, nor was Apple, and so the big hot question of the spring (iPhones vs Flash) didn't get discussed at all. However, there is this short set of notes from a separate interview with Adobe's Kevin Lynch, including:

We’re facing a time where there are some who want to wall off parts of the web and need to have approval. I don’t think that’s the role of a company. Apple is playing this strategy where they want to create a walled garden.

It's interesting to see the role reversal between Microsoft and Apple nowadays.

Tuesday, May 4, 2010

Deep deep dive into multi-threading and concurrency

I thought I knew a lot about multi-threading and concurrency; I've been building multi-threaded software systems since 1985. I'm quite comfortable with the various threading primitives and common approaches to thread safety, and I can crudely describe the difference between a LoadLocked/StoreConditional architecture and a CompareAndSwap architecture. I've been trying to keep up with the action in thread-aware inflatable thin locks, concurrent garbage collection, and other recent advances.

If the above paragraph more-or-less describes you, you'll find this edited summary of a discussion between Cliff Click and Doug Lea fascinating...

And, now I have a whole long list of new things to go learn about:

  • cardmarks

  • fenceless weak CAS

  • workstealing array spaces

  • park/unpark