So, of course, I've now pressed the big upgrade button on my second Ubuntu machine, and Lucid Lynx is now winging its way to another new home.
Meanwhile, what do you read while downloading a few hundred megabytes of fine new Linux software? Well, interesting articles on ongoing development in the Linux world, of course:
Found, as interesting things often are, via Wes Felter's fine blog.
Short notes and essays about stuff that interests me (mostly technical stuff).
Pages
▼
Friday, April 30, 2010
Upgraded one machine anyway
I can't believe I said I was going to wait 4 or 5 days before upgrading my Ubuntu machine(s).
Of course, I went ahead and upgraded one of the machines today. You're not surprised, are you?
So far, everything seems fine, although I did run into this bug: https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/548891
It seems very specific to having a Ubuntu guest under a VMWare host.
Happily, the workaround in the bug report worked fine.
And my other machines aren't VMWare guests so I don't anticipate having this problem with them.
Of course, I went ahead and upgraded one of the machines today. You're not surprised, are you?
So far, everything seems fine, although I did run into this bug: https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/548891
It seems very specific to having a Ubuntu guest under a VMWare host.
Happily, the workaround in the bug report worked fine.
And my other machines aren't VMWare guests so I don't anticipate having this problem with them.
Thursday, April 29, 2010
Lucid Lynx is here!
Right on schedule, Canonical have placed Lucid Lynx on the download servers today.
The servers appear to be completely swamped, of course, as thousands (millions?) of people are eagerly fetching the new release.
I'll probably press the big "Upgrade" button in 4 or 5 days, assuming no obvious show stoppers get reported between now and then.
The servers appear to be completely swamped, of course, as thousands (millions?) of people are eagerly fetching the new release.
I'll probably press the big "Upgrade" button in 4 or 5 days, assuming no obvious show stoppers get reported between now and then.
Peter Norvig and the Game of Set
Peter Norvig digs deeply into the game of Set, which I've never played, but enjoy reading about. In my group of gamers, there is one other besides myself who is this sort of gamer, and 4 who are not, so by majority rule, this is not our sort of game. :)
Still, it's a beautiful game, and the mathematics are fun.
Make sure you find your way to Gregory Quenell's presentation. Money quote:
Still, it's a beautiful game, and the mathematics are fun.
Make sure you find your way to Gregory Quenell's presentation. Money quote:
We can't draw a hypercube in a 4-torus.
Deepwater Horizon
Interesting short slideshow with some information about the Deepwater Horizon incident.
In my youth (1979), I was a Transocean employee. I went through rig training and worked for 3 months as a roustabout on various rigs. This rig didn't exist then, and I no longer recall what rigs I worked on. But it was a very exciting job to have, at the time. It took me only 3 months to realize that I was much better cut out for mathematics than for rig work :)
In my youth (1979), I was a Transocean employee. I went through rig training and worked for 3 months as a roustabout on various rigs. This rig didn't exist then, and I no longer recall what rigs I worked on. But it was a very exciting job to have, at the time. It took me only 3 months to realize that I was much better cut out for mathematics than for rig work :)
Pinnacles
Nice article about Pinnacles National Monument, one of my favorite outdoor destinations in the greater Bay Area.
It's about a 2.5 hour drive, each way, from my house, so it's a long day trip, but if the weather is accomodating, it's one of my favorites.
I've always hoped to see a California Condor there, but I've visited 5 times now without success.
Still, there's always next time!
It's about a 2.5 hour drive, each way, from my house, so it's a long day trip, but if the weather is accomodating, it's one of my favorites.
I've always hoped to see a California Condor there, but I've visited 5 times now without success.
Still, there's always next time!
Tuesday, April 27, 2010
More study of diff: Walter Tichy's papers
I'm continuing to slowly work my way though the published work on diff algorithms.
Recently, I've been reading two papers by Walter Tichy:
The first paper is almost 30 years old, and dates from Tichy's work at Purdue during the development of RCS. From the introduction:
At the time, the best-known diff algorithm was Doug McIlroy's Unix diff algorithm (more on that in a future post), which is based on the detection of the Longest Common Subsequence. As Tichy shows, the LCS-based algorithms, while computationally related to the edit sequence programs, are not necessarily the best for use in difference construction.
Tichy's basic algorithm is surprisingly simple to state:
After working through a proof of the basic algorithm, Tichy briefly touches on two variations:
The first variation finds an interesting expression 15 years later in the work of Andrew Tridgell on the rsync algorithm, which I'll discuss in a future post.
Delta Algorithms: An Empirical Analysis describes Tichy's work in benchmarking diff algorithms. The paper contains dozens of scatter-plot diagrams of the various benchmark tests, as well as a fine high-level discussion of the complexity of building a suitable benchmark for diff:
The paper also describes a diff algorithm variation which they call vdelta:
Over the years, there have been a number of fundamental attempts to construct differencing algorithms. It would be an ideal world if the "best" algorithm always became the best known and most widely-used. However, the discussion and analysis of algorithms is a complex intellectual activity and many factors other than the qualities of the actual algorithm come in to play. Perhaps most importantly, if an algorithm is not well-presented and well-described, it can be over-looked and under-used, even if it is valuable and effective.
With diff algorithms, it is becoming clear that two things are true:
I've still got a number of other papers to study, but for now I think I've learned what I can from Tichy's work.
Recently, I've been reading two papers by Walter Tichy:
The first paper is almost 30 years old, and dates from Tichy's work at Purdue during the development of RCS. From the introduction:
The string-to-string correction problem is to find a minimal sequence of edit operations for changing a given string into another given string. The length of the edit sequence is a measure of the differences between the two strings.
At the time, the best-known diff algorithm was Doug McIlroy's Unix diff algorithm (more on that in a future post), which is based on the detection of the Longest Common Subsequence. As Tichy shows, the LCS-based algorithms, while computationally related to the edit sequence programs, are not necessarily the best for use in difference construction.
Tichy's basic algorithm is surprisingly simple to state:
Start at the left end of the target string T, and try to find prefixes of T in S. If no prefix of T occurs in S, remove the first symbol from T and start over. If there are prefixes, choose the longest one and record it as a block move. Then remove the matched prefix from T and try to match a longest prefix of the remaining tail of T, again starting at the beginning of S. This process continues until T is exhausted. The recorded block moves constitute a minimal covering set of block moves.
After working through a proof of the basic algorithm, Tichy briefly touches on two variations:
Program text and prose have the property of few repeated lines. ... To speed up comparisons, the program should use hashcodes for lines of text rather than performing character-by-character comparisons.
An important element in the Knuth-Morris-Pratt algorithm is an auxiliary array N which indicates how far to shift a partially matched pattern or block move after a mismatch. ... Fortunately, N can also be computed incrementally.
The first variation finds an interesting expression 15 years later in the work of Andrew Tridgell on the rsync algorithm, which I'll discuss in a future post.
Delta Algorithms: An Empirical Analysis describes Tichy's work in benchmarking diff algorithms. The paper contains dozens of scatter-plot diagrams of the various benchmark tests, as well as a fine high-level discussion of the complexity of building a suitable benchmark for diff:
The first problem encountered when defining a benchmark is finding an appropriate data set that is both large enough for the results to be statistically significant and representative of real world applications. For delta algorithms, the most important quality of any benchmark is that it contain a wide spectrum of change examples. This means that both the size of the changes represented and the size of the files involved should vary considerably. Large changes on small files and small changes on large files should be included as well as small changes on small files and large changes on large files.
Furthermore, the benchmark should contain a variety of formats, in particular pure text, pure object code, and pseudo text.
The paper also describes a diff algorithm variation which they call vdelta:
Vdelta is a new technique that combines both data compression and data differencing. It is a refinement of W.F. Tichy's block-move algorithm, in that, instead of a suffix tree, vdelta uses a hash table approach inspired by the data parsing scheme in the 1978 Ziv-Lempel compression technique. Like block-move, the Ziv-Lempel technique is also based on a greedy approach in which the input string is parsed by longest matches to previously seen data. ... Vdelta generalizes Ziv-Lempel and block-move by allowing for string matching to be done both within the target data and between a source data and a target data. For efficiency, vdelta relaxes the greedy parsing rule so that matching prefixes are not always maximally long.
Over the years, there have been a number of fundamental attempts to construct differencing algorithms. It would be an ideal world if the "best" algorithm always became the best known and most widely-used. However, the discussion and analysis of algorithms is a complex intellectual activity and many factors other than the qualities of the actual algorithm come in to play. Perhaps most importantly, if an algorithm is not well-presented and well-described, it can be over-looked and under-used, even if it is valuable and effective.
With diff algorithms, it is becoming clear that two things are true:
- There have been a variety of diff algorithms discovered and re-discovered over the years, but many of them are not well-described nor easy to find: the papers are scattered, hard to locate, and behind ACM or IEEE paywalls; and when the papers are tracked down, they are confusing and hard to read.
- The two Myers papers ("A file comparison program" and "An O(ND) difference algorithm and its variations") are so well-written and so well-known that they have pretty much dominated the discussion.
I've still got a number of other papers to study, but for now I think I've learned what I can from Tichy's work.
Monday, April 26, 2010
How do these computers work, anyway?
We recently bought a Wii, and have been learning how to play.
Our grand-daughter, too, has been learning how to play.
She is frightfully good, and soon will be the best.
But, for now, I can still beat her reliably in bowling and golf.
She gets rather frustrated when she loses, so I tried to explain to her that it's my game, and I've had more practice, and so I'm better at the game (right now) and hence I win.
The other day, she and my wife were playing at bowling. Somewhat accidentally, the game was configured so that the two "Miis" (virtual characters) were mis-set, and my grand-daughter was actually playing as me.
Well, she proceeded to lose to her grand-mother.
At which point she stomped her feet in frustration and exclaimed:
Our grand-daughter, too, has been learning how to play.
She is frightfully good, and soon will be the best.
But, for now, I can still beat her reliably in bowling and golf.
She gets rather frustrated when she loses, so I tried to explain to her that it's my game, and I've had more practice, and so I'm better at the game (right now) and hence I win.
The other day, she and my wife were playing at bowling. Somewhat accidentally, the game was configured so that the two "Miis" (virtual characters) were mis-set, and my grand-daughter was actually playing as me.
Well, she proceeded to lose to her grand-mother.
At which point she stomped her feet in frustration and exclaimed:
I don't get it. Grandpa said that he always wins, so why didn't he win this time?
Google Summer of Code 2010
The Google Summer of Code project has announced the list of projects that have been accepted for the summer of 2010.
I'm pleased that the Derby project at Apache was awarded three projects:
We had a number of very good proposals this year. Overall, Google reported that they received nearly 6,000 proposals, and awarded 1,025 of them. As you can see, it is a very competitive program and it is not easy to qualify among so many excellent students.
Congratulations to the students who were accepted!
This is my third year being involved with the Google Summer of Code program, and my second year as a mentor with an accepted project. Last year's project was a very rewarding experience; I hope that I have another good experience this summer.
I'm pleased that the Derby project at Apache was awarded three projects:
- UTF-8 character support in the DRDA client-server networking protocol.
- JUnit test conversion and bug-fixing
- Query plan and execution statistics visualization tools
We had a number of very good proposals this year. Overall, Google reported that they received nearly 6,000 proposals, and awarded 1,025 of them. As you can see, it is a very competitive program and it is not easy to qualify among so many excellent students.
Congratulations to the students who were accepted!
This is my third year being involved with the Google Summer of Code program, and my second year as a mentor with an accepted project. Last year's project was a very rewarding experience; I hope that I have another good experience this summer.
This blog is one year old
I've reached my one year blog-i-versary!
As with most aspects of my life, I never have enough time to tend to my blog, but I'm pleased that it's lasted this long, and I hope to keep it going.
Happy Blog-i-versary to me!
As with most aspects of my life, I never have enough time to tend to my blog, but I'm pleased that it's lasted this long, and I hope to keep it going.
Happy Blog-i-versary to me!
Saturday, April 24, 2010
Open source and bond rating agencies
It's interesting to read this story in the New York Times with my open source hat on.
The article discusses the role that open-ness played in the CDO/CDS ratings disaster:
That paragraph beginning "But..." is a common knee-jerk reaction to the concept of open source, particularly for sensitive topics such as financial ratings, and it's wrong, so wrong, 100% wrong.
At first, it's hard to understand how sharing and opening-up a complex and sensitive process can make it more secure. So it's time for a bit of history.
A similar process has occurred with computer security software over the years. Initially, the prevailing notion was that, the less known about the security software and its implementation, the better. At one company I worked for in the early 80's, the subroutine package which handled security went by the internal acronym CTTC; this string of letters was used to name the library, the APIs and variables, etc. When I asked what it meant, somebody finally told me: "Confusing To The Curious".
But, over time, the computer software security community has come to hold a different view, best summarized in Eric Raymond's memorable phrase:
As Raymond observes:
It's an argument that takes time to comprehend and grow comfortable with. If you've never done so, I encourage you to read Raymond's entire paper.
So, back in the world of finance, how does this argument apply? Well, I think that, if handled properly, the release of rating model details should actually make the ratings models better, not worse. When computer software security teams first started releasing the details of security bugs, people were horrified, as they thought that this was tantamount to giving the bad guys the keys to the storehouse.
But in fact, it was exactly the opposite: the bad guys already have the keys to the storehouse (and we see exactly this in the NYT article that started this discussion); what the open-ness process does is to give the good guys a chance to improve and refine their models.
Back to the New York Times article:
Do you see the comparison now? The agencies were trying to make the models better, but they needed more eyeballs.
Some years ago, the Secure Hash Algorithm was broken. If you have no idea what I'm talking about, here's a good place to start. That discovery, as it percolated through the computer security community, could have had a variety of results: it could have been covered up, it could have been "classified", it could have been played down.
Instead, the computer security world made it big news, as big as they possibly could. They told everybody. They blogged about it, wrote articles for the trade press, called up journalists and tried to explain to them what it meant.
The end result was this: The National Institute of Standards and Technology is holding a world-wide competition to design and implement a new hash function, the best one that the entire planet can possibly construct.
This is the process that we need for the ratings agencies and their models: instead of having a closed, secret process for developing ratings, we need an open one. We need eager graduate students pursuing their research projects in ratings models. We need conferences where people from different organizations and communities meet and discuss how to comprehend the financial models and their complexity. We need sunshine. We need a mindset in which anyone who reads this paragraph recoils in horror at how wrong the second sentence is, especially after the first sentence seems to show that they were starting to 'get it':
A commitee??!?
Make no mistake about it: modern structured financial instruments are extraordinarily complex. As this paper shows, they are equivalent in complexity to the hardest problems that computer scientists tackle. So hopefully the task forces and politicians investigating these problems will realize that they need to be handled using techniques that we have evolved for tackling these problems.
The article discusses the role that open-ness played in the CDO/CDS ratings disaster:
The rating agencies made public computer models that were used to devise ratings to make the process less secretive. That way, banks and others issuing bonds — companies and states, for instance — wouldn’t be surprised by a weak rating that could make it harder to sell the bonds or that would require them to offer a higher interest rate.
But by routinely sharing their models, the agencies in effect gave bankers the tools to tinker with their complicated mortgage deals until the models produced the desired ratings.
That paragraph beginning "But..." is a common knee-jerk reaction to the concept of open source, particularly for sensitive topics such as financial ratings, and it's wrong, so wrong, 100% wrong.
At first, it's hard to understand how sharing and opening-up a complex and sensitive process can make it more secure. So it's time for a bit of history.
A similar process has occurred with computer security software over the years. Initially, the prevailing notion was that, the less known about the security software and its implementation, the better. At one company I worked for in the early 80's, the subroutine package which handled security went by the internal acronym CTTC; this string of letters was used to name the library, the APIs and variables, etc. When I asked what it meant, somebody finally told me: "Confusing To The Curious".
But, over time, the computer software security community has come to hold a different view, best summarized in Eric Raymond's memorable phrase:
Given enough eyeballs, all bugs are shallow
As Raymond observes:
Complex multi-symptom errors also tend to have multiple trace paths from surface symptoms back to the actual bug. Which of the trace paths a given developer or tester can chase may depend on subtleties of that person's environment, and may well change in a not obviously deterministic way over time. In effect, each developer and tester samples a semi-random set of the program's state space when looking for the etiology of a symptom. The more subtle and complex the bug, the less likely that skill will be able to guarantee the relevance of that sample.
It's an argument that takes time to comprehend and grow comfortable with. If you've never done so, I encourage you to read Raymond's entire paper.
So, back in the world of finance, how does this argument apply? Well, I think that, if handled properly, the release of rating model details should actually make the ratings models better, not worse. When computer software security teams first started releasing the details of security bugs, people were horrified, as they thought that this was tantamount to giving the bad guys the keys to the storehouse.
But in fact, it was exactly the opposite: the bad guys already have the keys to the storehouse (and we see exactly this in the NYT article that started this discussion); what the open-ness process does is to give the good guys a chance to improve and refine their models.
Back to the New York Times article:
Sometimes agency employees caught and corrected such entries. Checking them all was difficult, however.
“If you dug into it, if you had the time, you would see errors that magically favored the banker,” said one former ratings executive, who like other former employees, asked not to be identified, given the controversy surrounding the industry. “If they had the time, they would fix it, but we were so overwhelmed.”
Do you see the comparison now? The agencies were trying to make the models better, but they needed more eyeballs.
Some years ago, the Secure Hash Algorithm was broken. If you have no idea what I'm talking about, here's a good place to start. That discovery, as it percolated through the computer security community, could have had a variety of results: it could have been covered up, it could have been "classified", it could have been played down.
Instead, the computer security world made it big news, as big as they possibly could. They told everybody. They blogged about it, wrote articles for the trade press, called up journalists and tried to explain to them what it meant.
The end result was this: The National Institute of Standards and Technology is holding a world-wide competition to design and implement a new hash function, the best one that the entire planet can possibly construct.
This is the process that we need for the ratings agencies and their models: instead of having a closed, secret process for developing ratings, we need an open one. We need eager graduate students pursuing their research projects in ratings models. We need conferences where people from different organizations and communities meet and discuss how to comprehend the financial models and their complexity. We need sunshine. We need a mindset in which anyone who reads this paragraph recoils in horror at how wrong the second sentence is, especially after the first sentence seems to show that they were starting to 'get it':
David Weinfurter, a spokesman for Fitch, said via e-mail that rating agencies had once been criticized as opaque, and that Fitch responded by making its models public. He stressed that ratings were ultimately assigned by a committee, not the models.
A commitee??!?
Make no mistake about it: modern structured financial instruments are extraordinarily complex. As this paper shows, they are equivalent in complexity to the hardest problems that computer scientists tackle. So hopefully the task forces and politicians investigating these problems will realize that they need to be handled using techniques that we have evolved for tackling these problems.
Thursday, April 22, 2010
My Chair
This is my chair at work.
I've been here for just about exactly one month.
Today, Katie came by (she's part of our Facilities team), and asked me whether I liked my chair.
"Why yes, I do," I said.
"Well," she said, "I generally come around after the first month and ask people how they feel about their chair. If you don't like that chair, we'll get you another. Some people like this one better."
"This one is just fine," I said, "but thanks very much for asking!"
"No problem!" she said.
I've been here for just about exactly one month.
Today, Katie came by (she's part of our Facilities team), and asked me whether I liked my chair.
"Why yes, I do," I said.
"Well," she said, "I generally come around after the first month and ask people how they feel about their chair. If you don't like that chair, we'll get you another. Some people like this one better."
"This one is just fine," I said, "but thanks very much for asking!"
"No problem!" she said.
Wednesday, April 21, 2010
The Internet has an answer for everything!
Simple, yet surprisingly useful: http://islostarepeat.com/
However, since that's true, this should keep you alive until May 4th.
However, since that's true, this should keep you alive until May 4th.
One tack to India!
Fascinating story about an Omani group who have faithfully recreated a 9th-century trading vessel and are attempting to sail it from Oman to Singapore.
The past is so often a dusty book; it's wonderful to see an effort which truly brings history to life.
The past is so often a dusty book; it's wonderful to see an effort which truly brings history to life.
Tuesday, April 20, 2010
MapReduce for single-system programming
If you're familiar with MapReduce, it's probably in the context of giant data centers like Google's, where enormous tasks are processed by decomposing them into smaller jobs which are then distributed over thousands of individual servers, with the results re-assembled at the end.
Well, here's a nice body of work by a team at Stanford, showing how they are using the basic concepts of MapReduce (functional programming, problem decomposition, parallelism, resource management) to accomplish parallel programming tasks on single-system configurations.
Most parallel and concurrent programming APIs are too hard: to complex to understand, and to easy to use incorrectly. MapReduce has been successful over the last 15 years because it nicely balances the power of parallelism with a clear and simple programming abstraction.
Well, here's a nice body of work by a team at Stanford, showing how they are using the basic concepts of MapReduce (functional programming, problem decomposition, parallelism, resource management) to accomplish parallel programming tasks on single-system configurations.
Google's MapReduce implementation facilitates processing of terabytes on clusters with thousands of nodes. The Phoenix implementation is based on the same principles but targets shared-memory systems such as multi-core chips and symmetric multiprocessors.
Most parallel and concurrent programming APIs are too hard: to complex to understand, and to easy to use incorrectly. MapReduce has been successful over the last 15 years because it nicely balances the power of parallelism with a clear and simple programming abstraction.
Monday, April 19, 2010
Google's Innovation Factory
Here's an interesting paper and the accompanying slides from a Keynote presentation by Patrick Copeland of Google about some of the build/test/scm/etc infrastructure that Google's software development efforts depend on.
This, as they say, is the "money quote":
Oh, what the heck, how about two money quotes:
Everybody "knows" that investment in development tools, infrastructure, and productivity support is important. But Google not only talk the talk, they back it up with real systems and processes. Very impressive!
This, as they say, is the "money quote":
Code is expected to have high reliability as it is written and we adhere to a socially reinforced code review and check-in practices. Development teams write good tests because they care about the products, but also because they want more time to spend writing features and less on debugging. Teams with good testing hygiene upstream have more time to innovate, and are thus more adaptable and competitive. In addition, there is one source tree and poorly written code is quickly identified because it breaks other people's tests and projects. Aggressive rolling back is employed to keep the tree building "green".
Oh, what the heck, how about two money quotes:
Within milliseconds of a code check-in, our build process will automatically select the appropriate tests to run based on dependency analysis, run those tests and report the results. By reducing the window of opportunity for bad code to go unnoticed, overall debugging and bug isolation time is radically reduced. The net result is that the engineering teams no longer sink hours into debugging build problems and test failures.
Everybody "knows" that investment in development tools, infrastructure, and productivity support is important. But Google not only talk the talk, they back it up with real systems and processes. Very impressive!
Wednesday, April 14, 2010
Windows fun with LastModified time
Here's an interesting thing I noticed yesterday. I only tried it on Windows 7, so perhaps the behavior is different on other versions of Windows. Also, this behavior seems to be specific to situations where Daylight Saving Time applies, so if your computer isn't observing Daylight Saving Time, you probably won't notice this.
Anyway, without further ado:
On the systems that I tried it on, the DIR command displays a last-modified time that is 1 hour later than the time showed in Windows Explorer (which, I think, is the correct time).
If you happen to have a system running a version of Windows other than Windows 7, I'd appreciate it if you tried this experiment and let me know what you found.
Anyway, without further ado:
- Open Windows Explorer and find a file somewhere on your machine which you worked on back in February. It can be any file: a document, a picture, etc.
- In Windows Explorer, right-click on that file and choose Properties.
- Now, run the Command Prompt. (Start -> Run, and type "cmd"). Change directory to the folder where you found the file from February, and type "dir" to display the directory information.
- Now, tell me:
Does the last-modified time displayed by the "dir" command match the last-modified time displayed in the "Properties" window?
On the systems that I tried it on, the DIR command displays a last-modified time that is 1 hour later than the time showed in Windows Explorer (which, I think, is the correct time).
If you happen to have a system running a version of Windows other than Windows 7, I'd appreciate it if you tried this experiment and let me know what you found.
Sunday, April 11, 2010
Gata Kamsky is still competitive
Why, look! It's a sports story, and it's not about the Masters and what's-his-name!
This weekend's NYT has a story about about last weekend's Philadelphia Open, and about Gata Kamsky, one of the top chess players in the United States. When I was paying a lot of attention to chess back in the early 1990's, Gata Kamsky was all the rage: a 17 year old phenom who looked like he was unstoppable.
Well, fast-forward 18 years:
But, as the article goes on to discuss, Kamsky ain't done for yet, and his strong showing in the Philadelphia Open, which is one of the strongest tournaments held in this country, is a promising one:
The column goes on to dissect an exciting 7th round match between Kamsky and Vladimir Romanenko of Belarus. Plenty of errors on both sides, but somehow Kamsky hung on and found a way to win.
OK, I've got a few years on Kamsky, but still, let's celebrate: way to go, old guys!
This weekend's NYT has a story about about last weekend's Philadelphia Open, and about Gata Kamsky, one of the top chess players in the United States. When I was paying a lot of attention to chess back in the early 1990's, Gata Kamsky was all the rage: a 17 year old phenom who looked like he was unstoppable.
Well, fast-forward 18 years:
Most of the top players are European, and most of the top tournaments are in Europe. So it is not surprising that there are few spots in those competitions for non-Europeans.
For many years, if an American player was included in an elite event, the invitation went to Gata Kamsky. But Kamsky's world ranking has slipped to No. 34, while Hikaru Nakamura, the reigning United States champion, has risen to No. 17. Nakamura, at age 22, is 13 years younger than Kamsky, and he plays an exciting chess that is popular with fans. So Nakamura now seems to be claiming most of the choice tournament spots.
But, as the article goes on to discuss, Kamsky ain't done for yet, and his strong showing in the Philadelphia Open, which is one of the strongest tournaments held in this country, is a promising one:
In the Philadelphia Open last weekend, he tied for first with three other grandmasters, and he took the title on a tie-breaker.
The column goes on to dissect an exciting 7th round match between Kamsky and Vladimir Romanenko of Belarus. Plenty of errors on both sides, but somehow Kamsky hung on and found a way to win.
OK, I've got a few years on Kamsky, but still, let's celebrate: way to go, old guys!
A close reading of "diff"
In literary studies, "close reading" is a technique in which the reader conducts an extremely thorough and detailed study of a relatively short passage. It's not an easy skill to learn, but, once learned, it can be very successful at revealing insights about the work you're studying.
Here are a few nice introductions to the notion of close reading. Lynch describes the purpose of close reading quite nicely:
Although most software has centuries of development to go before it reaches the level of accomplishment that great works of literature have achieved, the serious study of software has many aspects in common with close reading.
Recently, I've been engaged in a close reading of "diff".
You know diff, of course: it's the program which computes the differences between a pair of files, and displays them to you.
Diff has been around a long while, and has been very closely studied. It's also a great example of a program where the proper choice of algorithm makes an enormous difference. So it's important to start, before reading the code, by understanding the algorithm.
In the case of diff, it turns out that there are some great papers that, over the years, have done a wonderful job of explaining how diff works. Here's three papers that tend to be most commonly credited as the most important:
I'm planning to get to all of these papers, but for the last week, I've been painstakingly and minutely making my way, word by word, through
This last paper is in the journal Software, Practice and Experience, and was published in 1985. It's somewhat more recent than the other papers, and is not therefore "beginning at the beginning", but it's where I chose to start, and so here I am.
There are (at least) three aspects to a diff algorithm:
To the first point, there are many different ways to describe the differences between two files; some are short and precise, others are long and verbose. We speak about the differences between two files by talking about an "edit script", which is a sequence of commands (insert this line here, remove that line there) that can be applied to transform the one file into the other. There is such a thing as the shortest possible edit script, and any diff algorithm worth respecting should produce such a script. There are usually more than one shortest possible edit script, however, so diff programs may vary in their output while all still producing the "best" result.
To the second point, it is very important that we be able to diff two files efficiently. Modern systems spend stupendous amounts of time computing differences between pairs of files, so improvements in execution time can have breakthough results on larger systems.
To the last point, which is my current focus, some algorithms are simply harder to understand than others. So, I'm glad I started with the SP&E presentation, as it is very clear and readable. The algorithm described by Miller and Myers is a dynamic programming technique, inductively proved correct, that computes edit (sub-)scripts of length N + 1 by applying 3 basic rules to edit scripts of length N (greatly summarized here):
I haven't really done a very good job of describing the algorithm. Hopefully, though, I've motivated you to go track down the references and read the algorithm(s) for yourself.
Enjoy your close reading!
Here are a few nice introductions to the notion of close reading. Lynch describes the purpose of close reading quite nicely:
That means reading every word: it's not enough to have a vague sense of the plot. Maybe that sounds obvious, but few people pay serious attention to the words that make up every work of literature. Remember, English papers aren't about the real world; they're about representations of the world in language. Words are all we have to work with, and you have to pay attention to them.
Although most software has centuries of development to go before it reaches the level of accomplishment that great works of literature have achieved, the serious study of software has many aspects in common with close reading.
Recently, I've been engaged in a close reading of "diff".
You know diff, of course: it's the program which computes the differences between a pair of files, and displays them to you.
Diff has been around a long while, and has been very closely studied. It's also a great example of a program where the proper choice of algorithm makes an enormous difference. So it's important to start, before reading the code, by understanding the algorithm.
In the case of diff, it turns out that there are some great papers that, over the years, have done a wonderful job of explaining how diff works. Here's three papers that tend to be most commonly credited as the most important:
- The String-to-String Correction problem, Robert Wagner and Michael Fischer. This paper is 35 years old, and is certainly one of the great early efforts in algorithm description and analysis.
- The String-to-String Correction Problem with Block Moves, Walter Tichy. Tichy is well-known not only for this work, but also for the development of RCS, one of the first widely-successful version control programs.
- An O(ND) Difference Algorithm and its Variations, Eugene Myers. This is the paper which introduced the famous notion of "snakes". Here's a great pair of articles working through this algorithm in detail.
I'm planning to get to all of these papers, but for the last week, I've been painstakingly and minutely making my way, word by word, through
- A File Comparison Program, Webb Miller and Eugene Myers
This last paper is in the journal Software, Practice and Experience, and was published in 1985. It's somewhat more recent than the other papers, and is not therefore "beginning at the beginning", but it's where I chose to start, and so here I am.
There are (at least) three aspects to a diff algorithm:
- How good is the result?
- How fast does the algorithm run?
- How hard is the algorithm to understand?
To the first point, there are many different ways to describe the differences between two files; some are short and precise, others are long and verbose. We speak about the differences between two files by talking about an "edit script", which is a sequence of commands (insert this line here, remove that line there) that can be applied to transform the one file into the other. There is such a thing as the shortest possible edit script, and any diff algorithm worth respecting should produce such a script. There are usually more than one shortest possible edit script, however, so diff programs may vary in their output while all still producing the "best" result.
To the second point, it is very important that we be able to diff two files efficiently. Modern systems spend stupendous amounts of time computing differences between pairs of files, so improvements in execution time can have breakthough results on larger systems.
To the last point, which is my current focus, some algorithms are simply harder to understand than others. So, I'm glad I started with the SP&E presentation, as it is very clear and readable. The algorithm described by Miller and Myers is a dynamic programming technique, inductively proved correct, that computes edit (sub-)scripts of length N + 1 by applying 3 basic rules to edit scripts of length N (greatly summarized here):
- Move right: Given a script that converts A[1:i] to B[1:j-1], adding the command 'Insert B[j]' produces the next step in the edit script
- Move down: Given a script that converts A[1:i-1] to B[1:j], adding the command 'Delete A[i]' produces the next step in the edit script
- Slide down the diagonal: Given a script that converts A[1:i-1] to B[1:j-1], and given that A[i] equals B[j], the edit script that we already have already produces the next step in the edit script, converting A[1:i] to B[1:j].
I haven't really done a very good job of describing the algorithm. Hopefully, though, I've motivated you to go track down the references and read the algorithm(s) for yourself.
Enjoy your close reading!
Tuesday, April 6, 2010
Sunday NYT Magazine
Perhaps I'm spending a bit too much time with the Sunday NYT Magazine nowadays.
But, on the positive side, I've recently been able to finish both the crossword and the seven-by-seven KenKen by Thursday, or often even by Wednesday!
This week's Puns and Anagrams puzzle, however, I'm leaving for my mom.
But, on the positive side, I've recently been able to finish both the crossword and the seven-by-seven KenKen by Thursday, or often even by Wednesday!
This week's Puns and Anagrams puzzle, however, I'm leaving for my mom.
Monday, April 5, 2010
RhythmBox's iPod interface has me baffled
I think I want to do something fairly simple with my little iPod shuffle and my Ubuntu desktop:
I take the iPod over to the fitness center, so I have something to listen to on the treadmill.
I used to use iTunes on Windows, and I found the interface pretty easy:
But with RhythmBox, I keep tying myself in knots:
Is there an easier-to-use Ubuntu application for fetching recent podcast episodes, keeping my podcast episode collection trimmed to a manageable size, and copying recent episodes to my iPod, overwriting the previous ones as I do so?
- Subscribe to a fairly small (5-6) set of podcasts
- About twice a week, connect my iPod shuffle to my laptop
- Get the latest podcasts from my feeds, copy them onto my shuffle, deleting the podcasts that were previously there, and disconnect the iPod
- Repeat the process again, a few days later
I take the iPod over to the fitness center, so I have something to listen to on the treadmill.
I used to use iTunes on Windows, and I found the interface pretty easy:
- I could tell what podcasts I was subscribed to,
- I could fetch the latest episodes, after scanning their descriptions to see if they sounded interesting.
- I could copy those episodes to the iPod, and could select and delete old episodes off the iPod
But with RhythmBox, I keep tying myself in knots:
- I think I've deleted episodes, but they are still present on the iPod even though they no longer show up in the RhythmBox display! By still present I mean: when I press the "play" button on the iPod, the podcast episode plays, even though I deleted it via RhythmBox (and responded to a confirmation prompt!)
- I can't easily see a list of which new podcast episodes are available for download
- RhythmBox seems to have two displays: "Music" and "Podcasts" which seem to mostly show the same thing, but not quite exactly, and neither seems quite accurate
Is there an easier-to-use Ubuntu application for fetching recent podcast episodes, keeping my podcast episode collection trimmed to a manageable size, and copying recent episodes to my iPod, overwriting the previous ones as I do so?
Saturday, April 3, 2010
Modern computers, VMWare, and OS joy
My development machine at my day job is quite amply provisioned with the two resources that no programmer ever gets enough of: memory (8Gb) and disk (2Tb). I've been periodically checking the display that shows my overall usage of disk space, waiting for the gauge to think that I've used enough disk space to be worthy of mention. With 2Tb, that may take a while...
But one wonderful thing that can be done with a computer like this is to run VMWare. Specifically, I have VMWare Fusion 3 on my Mac. It's hard to overstate what an amazing piece of software VMWare is.
So I've been building additional virtual machines, since programmers (especially programmers like me) love to have as many operating systems available to them as possible.
So far, I've got a fairly short list:
I'm confused by OpenSolaris version numbering: am I running OpenSolaris 9? Or 10? VMWare Fusion considers OpenSolaris 9 to be "experimental" while it thinks version 10 is "stable". However, when I actually boot the OpenSolaris VM, it seems to think I'm running version 5. Hmmm... Much more to learn about OpenSolaris.
Meanwhile, the question is, what to install next?
I'm intending to install FreeBSD version 8 next week, partly at the suggestion of my co-worker, who feels that there is a lot to learn from the FreeBSD implementations of various OS-level behaviors (networking, memory management, etc.).
Here's the current list of supported guest operating systems. It seems kind of old to me, but there are definitely some ideas here.
What about Chrome OS? What is involved in getting it running under VMWare Fusion? I see a few intriguing blog posts, but it all still looks a bit sketchy.
Of course, you could say that 5 operating systems ought to be enough for now, and I think you're right. But if you know of other operating systems I should take out for a spin, drop me a line!
But one wonderful thing that can be done with a computer like this is to run VMWare. Specifically, I have VMWare Fusion 3 on my Mac. It's hard to overstate what an amazing piece of software VMWare is.
So I've been building additional virtual machines, since programmers (especially programmers like me) love to have as many operating systems available to them as possible.
So far, I've got a fairly short list:
- Mac OS X (Snow Leopard)
- Windows 7 (64 bit)
- Ubuntu 9.1 (32 bit)
- Open Solaris 2009.6
I'm confused by OpenSolaris version numbering: am I running OpenSolaris 9? Or 10? VMWare Fusion considers OpenSolaris 9 to be "experimental" while it thinks version 10 is "stable". However, when I actually boot the OpenSolaris VM, it seems to think I'm running version 5. Hmmm... Much more to learn about OpenSolaris.
Meanwhile, the question is, what to install next?
I'm intending to install FreeBSD version 8 next week, partly at the suggestion of my co-worker, who feels that there is a lot to learn from the FreeBSD implementations of various OS-level behaviors (networking, memory management, etc.).
Here's the current list of supported guest operating systems. It seems kind of old to me, but there are definitely some ideas here.
What about Chrome OS? What is involved in getting it running under VMWare Fusion? I see a few intriguing blog posts, but it all still looks a bit sketchy.
Of course, you could say that 5 operating systems ought to be enough for now, and I think you're right. But if you know of other operating systems I should take out for a spin, drop me a line!
Thursday, April 1, 2010
When fan fiction meets the real world
I loved this story about how a marketing/PR exercise has turned into a great example of fan fiction.
"TripAdvisor is the lifeblood of agrotourism", says Dwight!
TripAdvisor added a caveat explaining that Schrute Farms was fictional, Ms. Petersen said. “We had a complaint from someone who had wanted to go there.”
"TripAdvisor is the lifeblood of agrotourism", says Dwight!