Journal of a Programmer: June 2012

Friday, June 22, 2012

Offline for a while...

... see you all in July, hopefully I'll remember to take a few pictures...

Thursday, June 21, 2012

It's not just a movie...

... it's a breakthrough vehicle for computer rendering of realistic human skin, among other significant steps forward in computerized visual effects: Prometheus: rebuilding hallowed vfx space:

The specular reflection on the skin provides an immediate surface property. It gives one the details of the skin, diffused light is inherently less ‘detailed’ as the light has been scattered under the skin. In reality skin is a complex combination of scattered and surface properties. “Traditionally what we would do back at King Kong era is we would have a sub-surface model and then we’d add some sort of general Lambertian back in to maintain that detail,” explains Hill. It’s something we’ve always tried to fake until now, but the new model allows us not to have to do that, in a far superior way. It contains the detail right at the surface without us having to de-convolve our texture maps to accommodate it.”

And not just skin algorithms; the orrery scene was quite a step forward, as well:

A significant amount of work was required in preparing the plates for the star map effects, by removing the distortion from the plates and also solving for stereo. “Say your camera’s a totally non-VA (vertically-aligned),” explains Cole, “then you’re getting parallax between these very fine star fields and your pixel you can touch on the screen is the sum of all the pixels behind it – and you have a vertical alignment matte you can rubber sheet that around. You can push that one pixel around but you’ve got baked parallax.”

Wednesday, June 20, 2012

Version Control for database stored procedures

At my day job, we spend a lot of time thinking about and talking about the advantages that version control bring to all sorts of digital data, so much so that the company's new tagline is: Version Everything!

Here's a nice essay about exactly why this is so important, and how simple it can be to include Version Everything in all your work: Deploying PostgreSQL Stored Procedures.

Months later, the function needs to be modified again, this time by someone else, who makes the change to the function source code file in the VCS, commits, tests and deploys the function.
Whops! The local modification made by the evil consultant is gone! Alarms goes off, people are screaming “the system is down!”, people are running around in circles, you revert your change, back to the previous version in the VCS, but people are screaming even higher “the system is down!”.

Nice work, Joel!

2012 USENIX Federated Conferences Week

The 2012 USENIX Federated Conferences Week was last week, June 12-15.

The main event in this collection of co-located, co-ordinated conferences is the USENIX Annual Tecnical Conference, but there are several other major conferences that occur at the same time, including Hot Cloud '12, Hot Storage '12, TaPP '12, and the CMS Summit.

In addition to their always-stellar work of organizing and selecting talks, USENIX have once again made all the conference materials freely available, to benefit the rest of the world and help enlarge the knowledge of the human race. Thank you, USENIX!

There's an immense amount to learn about in these conferences; here's a few of the talks that caught my eye:

Finding Soon-to-Fail Disks in a HaystackMoises Goldszmidt, Microsoft Research
Delta Compressed and Deduplicated Storage Using Stream-Informed LocalityPhilip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu, EMC Corporation
Non-Linear Compression: Gzip Me Not!Michael F. Nowlan, Bryan Ford, and Ramakrishna Gummadi, Yale University
The TokuFS Streaming File SystemJohn Esmet, Tokutek & Rutgers; Michael A. Bender, Tokutek & Stony Brook; Martin Farach-Colton, Tokutek & Rutgers; Bradley C. Kuszmaul, Tokutek & MIT
The Seven Deadly Sins of Cloud Computing ResearchMalte Schwarzkopf, University of Cambridge Computer Laboratory; Derek G. Murray, Microsoft Research Silicon Valley; Steven Hand, University of Cambridge Computer Laboratory
Saving Cash by Using Less CacheTimothy Zhu, Anshul Gandhi, and Mor Harchol-Balter, Carnegie Mellon University; Michael A. Kozuch, Intel Labs
Automated Diagnosis Without Predictability Is a Recipe for FailureRaja R. Sambasivan and Gregory R. Ganger, Carnegie Mellon University
Erasure Coding in Windows Azure StorageCheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin, Microsoft Corporation
netmap: A Novel Framework for Fast Packet I/OLuigi Rizzo, Università di Pisa, Italy
Body Armor for Binaries: Preventing Buffer Overflows Without RecompilationAsia Slowinska, Vrije Universiteit Amsterdam; Traian Stancescu, Google, Inc.; Herbert Bos, Vrije Universiteit Amsterdam
Granola: Low-Overhead Distributed Transaction CoordinationJames Cowling and Barbara Liskov, MIT CSAIL
Generating Realistic Datasets for Deduplication AnalysisVasily Tarasov and Amar Mudrankit, Stony Brook University; Will Buik, Harvey Mudd College; Philip Shilane, EMC Corporation; Geoff Kuenning, Harvey Mudd College; Erez Zadok, Stony Brook University
AddressSanitizer: A Fast Address Sanity CheckerKonstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov, Google
Practical Hardening of Crash-Tolerant SystemsMiguel Correia, IST-UTL/INESC-ID; Daniel Gómez Ferro, Flavio P. Junqueira, and Marco Seraﬁni, Yahoo! Research
Dynamic Reconﬁguration of Primary/Backup ClustersAlexander Shraer and Benjamin Reed, Yahoo! Research; Dahlia Malkhi, Microsoft Research; Flavio Junqueira, Yahoo! Research

Any particular favorites of yours that I missed? Let me know!

That's a long one, all right!


Date    Sunrise Sunset  Length  Dawn    Dusk    Length
 
Today   05:47   20:34   14:47   05:16   21:06   15:50

What a great day the world picked out of a hat for me!

Tuesday, June 19, 2012

Lots of interesting things to read about Flame

Over the last few days, the discussion about the "Flame" virus has been fascinating.

Here's a few of the interesting tidbits I've noticed:

Mikko Hypponen of F-Secure looks at the role of AntiVirus software vendors in the (non-)detection of Flame in his article at Wired: Why Antivirus Companies Like Mine Failed to Catch Flame and Stuxnet
As far as we can tell, before releasing their malicious codes to attack victims, the attackers tested them against all of the relevant antivirus products on the market to make sure that the malware wouldn’t be detected.
Kurt Wismer reflects on Chris Soghoian's observations about national intelligence agencies and their role in cyberwar, and talks about some of the implications of Flame's use of the Windows Update vector for transmission, in his article:
we placed trust in microsoft's code, in the automaton they designed, not because it was trustworthy, but because it was more convenient than being forced to make the equivalent decisions ourselves. furthermore, we relied on it for protecting consumers because it's easier than educating them (in fact many still don't believe this can or should be done).
Richard Bejtlich follows up on the points made by Soghoian and Wismer, in his article Flame Hypocrisy, and links to an article by David Gilbert in the International Business Times: US Government Behind Flame Virus According to Expert, in which Mikko Hypponen is quoted as saying:
If the US government did direct one of its intelligence agencies to attack an American company of the reputation and size of Microsoft, it would mark a major turning point in cyber espionage activity.
Hypponen told IBTimes UK that he was planning on writing an open letter to Barack Obama this week to say: "Stop taking away the trust from the most important system we have, which is Microsoft Windows Updates."
And, today, Kim Zetter of Wired has an article: Report: US and Israel Behind Flame Espionage Tool, following up on last week's article: Researchers Connect Flame to US-Israel Stuxnet Attack. Zetter links to this article published in the Washington Post: U.S., Israel developed Flame computer virus to slow Iranian nuclear efforts, officials say, which is unfortunately behind a paywall, but is said to confirm the role of the U.S. Government in developing the Flame and Stuxnet malware.

Really, did you go watch Soghoian's speech? It's not that long (12 minutes), and very interesting. Go. Watch. It.

Medical professionals don't like the idea that the CIA will pretend to be them, for the simple reason that many of these NGO health roles require the trust of individuals, and if people think you are a spook, they aren't going to let you poke needles in them. ... But, we want horrible diseases to be eradicated. That's what's important for our security. ... We need people in these parts of the world to trust medical professionals.

Scary discussions, scary thoughts.

Sunday, June 17, 2012

It's summer in California

School is out, the days are long, and summer has finally arrived in California.

Yesterday was nice and warm, with triple-digit temperatures in many locations. We found ourselves down in beautiful Santa Cruz, where the ocean air kept things delightful, warm but not too hot.

Lest you think that yesterday's heat was unusual, read these wonderful passages from William Brewer's logbook of his travels in California's Central Valley during the summer of 1862, 150 years ago, re-created at Tom Hilton's marvelous Whitney Survey website:

June 14, 1862: Orestimba Canyon
You cannot imagine how tedious it is to ride on this plain. The soil and herbage is dry and brown, few green things cheer the eye, no trees (save in the distance) vary that great expanse. Tens of thousands of cattle are feeding, but they are but specks save when they cluster in great herds near the water—often for miles we see nothing living but ourselves, except birds or insects, reptiles, and ground squirrels.
June 16, 1862: Orestimba Canyon
we climbed some hills about 2,200 feet high, a few miles from camp. It was intensely hot. I know not how hot, but over 90° all day in camp where there was wind, and vastly hotter in some of the canyons we had to cross. It was 81° long after sunset.

Orestimba Canyon, I believe, is west of the town of Newman, California, and east of Henry Coe State park. Here's a nice picture of the area. And here's a nice article about the Orestimba Indian Rocks at the foot of the canyon: Orestimba Indian Rocks Reveal a Way of Life.

Saturday, June 16, 2012

Russia plummets to earth

Well, I suppose it's my fault, as I predicted they'd take it all, but perhaps I can't take too much credit.

Anyway, Russia are out, just one of the many Euro 2012 surprises.

Kid Dynamite tries to explain how it all came to pass:

So Poland and Czech Republic, once Greece had the lead, had to go for broke – must win to advance. And yet, at the same time, Czech Republic couldn’t just go crazy, because if Russia managed to pull back and draw with Greece, then Czech Republic would be good with a draw of their own! This is why the games are played simultaneously on the last day, by the way!

The core problem is that Breaking Ties Is Complicated

If two teams that have the same number of points, the same number of goals scored and conceded, play their last group match against each other, and are still equal at the end of that match, the ranking of the two teams in question is determined by kicks from the penalty mark (Article 16), provided no other teams within the group have the same number of points on completion of all group matches.

I suppose it's a good thing it didn't come to that (and I thought software patent law was complicated!); perhaps things will be better once the singularity arrives.

Are the new TLDs a land grab?

Dave Winer with some very interesting observations: Tech press misses Google/Amazon name grab

Amazon and Google have made an audacious grab of namespace on the Internet. As far as I can see there's been no mention of this in the tech press.

Friday, June 15, 2012

Stuff I'm reading on a Friday afternoon

A variety of stuff, in a variety of areas (what else did you expect?)

It's been a few weeks now since the great LinkedIn password disaster. You might be sick of reading about it, but these particular essays are worth your time:
- Brian Krebs talks to Thomas Ptacek about password security: How Companies Can Beef Up Password Security
  The difference between a cryptographic hash and a password storage hash is that a cryptographic hash is designed to be very, very fast. And it has to be because it’s designed to be used in things like IP-sec. On a packet-by-packet basis, every time a packet hits an Ethernet card, these are things that have to run fast enough to add no discernible latencies to traffic going through Internet routers and things like that. And so the core design goal for cryptographic hashes is to make them lightning fast.
  Well, that’s the opposite of what you want with a password hash. You want a password hash to be very slow. The reason for that is a normal user logs in once or twice a day if that — maybe they mistype their password, and have to log in twice or whatever. But in most cases, there are very few interactions the normal user has with a web site with a password hash. Very little of the overhead in running a Web application comes from your password hashing. But if you think about what an attacker has to do, they have a file full of hashes, and they have to try zillions of password combinations against every one of those hashes. For them, if you make a password hash take longer, that’s murder on them.
- Steven Bellovin points out some counter-intuitive aspects of the LinkedIn compromise: Password Leaks
  There's another ironic point here. Once you log in to a web site, it typically notes that fact in a cookie which will serve as the authenticator for future visits to that site. Using cookies in that way has often been criticized as opening people up to all sorts of attacks, including cross-site scripting and session hijacking. But if you do have a valid login cookie and hence don't have to reenter your password, you're safer when visiting a compromised site.
- Patrick Nielsen has a great two-part series of posts:
  - The History of Password Security
  - Storing Passwords Securely
  Nielsen again discusses some of the differences between cryptographic hashes and password hashes:
  If you create a digest of a password, then create a digest of the digest, and a digest of that digest, and a digest of that digest, you've made a digest that is the result of four iterations of the hash function. You can no longer create a digest from the password and compare it to the iterated digest, since that is the digest of the third digest, and the third digest is the digest of the second digest. To compare passwords, you have to run the same number of iterations, then compare against the fourth digest. This is called stretching.
- Francois Pesce of Qualys took some time to share some statistical observations about the leaked hashes: Lessons Learned from Cracking 2 Million LinkedIn Passwords
  The hashes in the 120MB file sometimes had their five first characters rewritten with 0. If we look at the 6th to 40th characters, we can even find duplicates of these substrings in the file meaning the first five characters have been used for some unknown purpose: is it LinkedIn that stores user information here? is it the initial attacker that tagged a set of account to compromise? This is unknown.
I must sadly admit that somehow I had never heard of the brilliant young computer scientist Mihai Pătraşcu before his tragic death last month. If, like me, you were ignorant of this young man and his work, read on:
- Lance Fortnow and Bill Gasarch have a great retrospective on his life and work at their blog
- Dick Lipton and Ken Regan review his work on their blog
- Rasmus Pagh, Rina Panigrahy, Kunal Talwar and Udi Wieder explore some of Mihai's beautiful work in the field of reductions
- His blog is full of fascinating articles
- Michael Mitzenmacher's blog has a variety of interesting comments from people who knew Mihai
Nat Torkington does something unusual. Instead of his normal "Four Short Links", he chooses to supply just One Short Link, as he is so impressed by the results of Marc Hedlund and the engineering team at Etsy with their summer research program.
With help from all of you, Hacker School received applications from 661 women, nearly a 100-times increase from the previous session.
Interesting tidbits about the new FLAME virus are starting to emerge: FLAME – The Story of Leaked Data Carried by Human Vector
So, how is the memory stick carried between the two systems? Well, here is where the human factor kicks in. So it’s amazing how two instances of Flame communicate with one another using a memory stick and a human as a channel. A private channel is created between two machines and the person carrying the memory stick has no idea that he/she is actually contributing to the data leak.
File systems and storage systems continue to evolve:
- Introduction to Data Deduplication in Windows Server 2012
  Deduplication creates fragmentation for the files that are on your disk as chunks may end up being spread apart and this causes increases in seek time as the disk heads must move around more to gather all the required data. As each file is processed, the filter driver works to keep the sequence of unique chunks together, preserving on-disk locality, so it isn’t a completely random distribution. Deduplication also has a cache to avoid going to disk for repeat chunks. The file-system has another layer of caching that is leveraged for file access. If multiple users are accessing similar files at the same time, the access pattern will enable deduplication to speed things up for all of the users.
- VMFS Locking Uncovered
  In order to deal with possible crashes of hosts, the distributed locks are implemented as lease-based. A host that holds a lock must renew a lease on the lock (by changing a "pulse field" in the on-disk lock data structure) to indicate that it still holds the lock and has not crashed. Another host can break the lock if the lock has not been renewed by the current holder for a certain period of time.
- NFS on vSphere – A Few Misconceptions
  There are only two viable ways to attempt load balancing NFS traffic in my mind. The decision boils down to your network switching infrastructure, the skill level of the person doing the networking, and your knowledge of the traffic patterns of virtual machines.
This game looks like fun: Ninja: Legend of the Scorpion Clan. Here's the BGG page
Ryan Carlson with an intriguing, infuriating post about How I manage 40 people remotely. Why do I say infuriating? I guess it's just that I believe that a setup like this is doomed:
I’m in the UK with one other person on the Support Team, our main office is in Orlando and the rest of the Team is spread out all around the States.
Carlson seems to have come to, at least partly, the same conclusion:
I’ve decided it’s no longer viable to manage the team from another country. We’re still going to operate remotely as a company with everyone spread out around the US, but as the CEO I really need to be on US time.
So I’m moving my family to Portland Oregon where we’re going to setup another office for Treehouse. A lot of the team will still be remote but being closer will really help. My goal is to slowly gather Team Members in our Portland office
Cliff Mass wonders how the success of Space-X will play out for other space-related government activities, such as weather forecasting. Will there be a Weather-X?
The National Weather Service prediction efforts are crippled by inadequate computer infrastructure, lack of funds for research and development, an awkward and ineffective research lab structure out of control of NWS leaders, and government personnel rules that don't allow the NWS to replace ineffective research and development staff. Lately there has been talk of furloughs for NWS personnel and a number of the NWS leadership are leaving. The NWS has fallen seriously behind its competitors (e.g., the European Center for Medium Range Weather Forecasting, UKMET office, Canadian Meteorological Center) even though the U.S. has a huge advantage in intellectual capital (U.S. universities and the National Center for Atmospheric Research are world leaders in field, as are several U.S. government research labs--e.g, NRL Monterey).
Here's a nice short page at the IETF summarizing the current state of the various Http 2.x proposals
People always ask me what I use for my Integrated Development Environment. It's often hard to explain to them that the operating system itself is my IDE. Now I can just point them to this great description: Using Unix as your IDE
I don’t think IDEs are bad; I think they’re brilliant, which is why I’m trying to convince you that Unix can be used as one, or at least thought of as one. I’m also not going to say that Unix is always the best tool for any programming task; it is arguably much better suited for C, C++, Python, Perl, or Shell development than it is for more “industry” languages like Java or C#, especially if writing GUI-heavy applications. In particular, I’m not going to try to convince you to scrap your hard-won Eclipse or Microsoft Visual Studio knowledge for the sometimes esoteric world of the command line. All I want to do is show you what we’re doing on the other side of the fence.
Lastly, what a great conference this must have been: Turing’s Tiger Birthday Party
Alan Turing earned his Ph.D. at Princeton in 1938 under the great Alonzo Church. This alone gives Princeton a unique claim to Turing. But there are many other connections between Turing and Princeton. Two of the other great “fathers of computation,” John von Neumann and Kurt Gödel, were also at Princeton and promoted his transfer in 1936 from Cambridge University.
...
The meetings were held in McCosh 50. I (Dick) taught freshman Pascal CS101 there years ago there with Andrea LaPaugh, while Ken remembers taking Econ 102: Microeconomics there. This is the same hall where Andrew Wiles gave his “general” talk to a standing-room audience on his famous solution to Fermat’s Last Theorem, after he had repaired it.

Must be a heck of a rainstorm in Donetsk

During today's Euro 2012 match between Ukraine and France, the game was suspended due to weather conditions, which is a quite rare occurence for a soccer game.

Here's a snip from the ESPN matchcast blog:

4'
Scores of fans are leaving their seats at the front near the pitch to avoid getting soaking wet in the rain. Man up!
5'
Wow - play is suspended because of the torrential rain. Brave decision but probably the right call.
6'
There is lightning right above the stadium and the referee has decided to take the players off, give it five minutes and see if this brutal storm passes.

Sounds like quite the thunderstorm!

Thursday, June 14, 2012

It's not just a game ...

It's a great way to spend an afternoon with friends, just hanging out at home, checking out the new gadgets you've been working on: Terminal Velocity

Great work, Jason Craft and friends!

Wednesday, June 13, 2012

Contented by life's subtleties

Here's a nice article discussing the particular pleasure of Train Simulator (nee Railworks).

Your in-game character lacks any super powers, or a sword, or even a keychain container of mace. You don't sport body armor, or even a leather jacket.

As the author points out, though, the game is well-loved by many (my father is quite the Railworks fan)

My respect for the Railworks community began to grow as it occurred to me that their passion does not require thrills, instead they are contented by life's subtleties. Their fantasies don't rely upon adrenaline or destruction, they just wish to peacefully command a Class 47 Triple Grey all the way from Oxford to Paddington.

Well, to each their own; it's fine if it's not for everybody, and I'm glad the Train Simulator team have found their audience.

Some things take a long time; some things go very quickly

Two unrelated items that caught my eye today:

Kevin Kelly: The One Minute Vacation
I took a one-second clip each day on a two-month trip in Asia during April & May 2012. On a few days I just had to do an extra second, so this video is actually 90 seconds long. I was inspired by Ceasar Kuriyama's one-second-per-day life summary.
An unending, ten year long game of Civ II that has degraded to eternal war.
The only governments left are two theocracies and myself, a communist state. I wanted to stay a democracy, but the Senate would always over-rule me when I wanted to declare war before the Vikings did. This would delay my attack and render my turn and often my plans useless. And of course the Vikings would then break the cease fire like clockwork the very next turn.

Juxtaposed for your reading pleasure!

Tuesday, June 12, 2012

Dig deep for knowledge

There's nothing better than finding articles that look deep into the complex details that underlie modern software. I love it when somebody takes the time to really study and then describe something in sufficient detail.

Your reward for writing such an article? I praise you here! To wit:

Martin Nilsson and the team at Opera Software contributed this great review of the SPDY protocol, critiquing and analyzing it in detail, considering details such as the observation that, on very small devices, the need for aggressive compression algorithms to accumulate 32K buffers can be problematic:
Just putting the HTTP request in a SPDY stream (after removing disallowed headers) only differs by 20 bytes to SPDY binary header format with dictionary based zlib compression. The benefit of using a dictionary basically goes away entirely on the subsequent request.
For very constrained devices the major issue with using deflate is the 32K LZ window. The actual decompression is very light weight and doesn't require much code. When sending data, using the no compression mode of deflate effectively disables compression with almost no code. Using fixed huffman codes is almost as easy.
Their most significant proposal is to re-work the flow control:
Mike Belshe writes an example in his blog that stalling recipient of data as a situation where SPDY flow control is helpful, to avoid buffering potentially unbound amount of data. While the concerns is valid, flow control looks like overkill to something where a per-channel pause control frame could do the same job with less implementation and protocol overhead.
Hubert Lubaczewski pens this detailed explanation of the "upsert" problem and its potential solutions: Why is Upsert so Complicated?
Of course the chances for such case are very low. And the timing would have to be perfect. But it is technically possible, and if it is technically possible, it should be at least mentioned, and at best – solved.
This is, of course, another case of race condition. And this is exactly the reason why docs version of the upsert function has a loop.
If you’ll excuse me – I will skip showing the error happening – as it requires either changing the code by adding artificial slowdowns, or a lot of luck, or a lot of time. But I hope you understand why the DELETEs can cause problems. And why loop is needed to solve the problem.
On the Chromium blog, don't miss A Tale of Two Pwnies (Part 1) and A Tale Of Two Pwnies (Part 2), in which the Chrome security team walk through the intricate details of how a modern browser vulnerability may require the complex interactions of multiple independent browser weaknesses:
The exploit was still far from done. It was now running JavaScript inside an iframe inside a process with limited WebUI permissions. It then popped up an about:blank window and abused another bug -- this time in the JavaScript bindings -- to confuse the top-level chrome://net-internals page into believing that the new blank window was a direct child. The blank window could then navigate its new “parent” without losing privileges
Lastly, again on the topic of Chrome, Ilya Grigorik offers this wonderful description of Chrome's Predictor architecture for overlapping networking operations such as DNS lookup and TCP/IP connection establishment with other work, to dramatically enhance perceived browser speed: Chrome Networking: DNS Prefetch & TCP Preconnect
If it does its job right, then it can speculatively pre-resolve the hostnames (DNS prefetching), as well as open the connections (TCP preconnect) ahead of time.
To do so, the Predictor needs to optimize against a large number of constraints: speculative prefetching and preconnect should not impact the current loading performance, being too aggressive may fetch unnecessary resources, and we must also guard against overloading the actual network. To manage this process, the predictor relies on historical browsing data, heuristics, and many other hints from the browser to anticipate the requests.

Wonderful, wonderful, wonderful all around. The world of software is so complex and beautiful these days! Enjoy!

Monday, June 11, 2012

Unbroken: a very short review

Laura Hillenbrand's Seabiscuit: Three Men and a Racehorse was a massive worldwide hit, and became a movie hit as well.

It took her seven years to write her next book: Unbroken: A World War II Story of Survival, Resilience, and Redemption. As she describes the experience:

When I finished writing my first book, Seabiscuit: An American Legend, I felt certain that I would never again find a subject that fascinated me as did the Depression-era racehorse and the team of men who campaigned him. When I had my first conversation with the infectiously effervescent and apparently immortal Louie Zamperini, I changed my mind.
That conversation began my seven-year journal through Louie's unlikely life.

I didn't read Hillenbrand's first book, but now I think I might.

Unbroken is definitely a fascinating story. Zamperini, who at age 95 is one of the last living members of that group that Tom Brokaw so perfectly coined "our greatest generation", has one of the most astonishing life stories that you will ever hear.

The majority of the book is concerned with Zamperini's wartime survival story. There were, of course, many many such stories, but Zamperini's story is more dramatic than most, and Hillenbrand tells it well.

To be sure, this is not an easy book to read. The tales of horror in the South Pacific are dreadful, and there were several chapters of the book that I could barely bring myself to read. Hillenbrand pulls no punches, laying it all out there for you.

As Hillenbrand tells it, Zamperini's story is also the story of a nation, and of the world; there are many others who pass through his life, both friend and enemy, and the events that he was part of changed the entire history of the world. Zamperini was of course just one man, but Hillenbrand makes sure, at appropriate points, to tie Zamperini's personal experiences to the broader picture of what was happening in the world. It's definitely a unique and vivid way to experience the horror and tragedy of World War II.

I'm glad I read the book, though I'm not sure it's for everyone. But if you think it might be for you, give Unbroken a try!

Friday, June 8, 2012

Your Friday afternoon reading list

Here you go, just what you were waiting for :)

A nice explanation of why it was rather challenging to build index scans into the Postgres MVCC engine: http://michael.otacoo.com/postgresql-2/postgresql-9-2-highlight-index-only-scans/
The commit message talks about “visibility map”, which is a feature implemented since PostgreSQL 8.4, which allows to keep tracking of which pages contains only tuples that are visible to all the transactions (no data modified since latest vacuum cleanup for example). What this commit simply does is to check if the page that needs to be consulted is older than the transaction running.
A simple introduction to Postgres's query timing features: http://momjian.us/main/blogs/pgblog/2012.html#June_8_2012
Each data manipulation language (dml) command (select, insert, update, delete) goes through three stages:
1. parser
2. planner
3. executor
You can actually time how long each stage takes.
A super-awesome discussion of the intricacies of Postgres's 9.2 Group Commit algorithm, and a change that is in the hopper for Postgres 9.3: http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html
Oftentimes, they will find that this has happened, and will be able to simply fastpath out of the function that ensures that WAL is flushed (a call to that function is required to honour transactional semantics). In fact, it is expected that only a small minority of backends (one at a time, dubbed “the leader”) will actually ever go through with flushing WAL.
A great essay on why your storage system really needs to be sensitive to whether it uses SSD hardware underneath: http://blog.empathybox.com/post/24415262152/ssds-and-distributed-data-systems
Traditional B+Trees or hashes are no longer the most appropriate persistent data structure. This is not due to the drop in latency but due the the write endurance problem. Moving a database with a traditional storage engine to commodity SSDs will likely be quite fast but the SSDs may stop working after a few months!
One of the best sharding presentations I've seen yet, from the Tumblr team: https://github.com/tumblr/jetpants/blob/master/doc/VelocityEurope2011Presentation.pdf?raw=true
Sharding is the implementation of horizontal partitioning outside of MySQL (at the application level or service level). Each partition is a separate table. They may be located in different database schemas and/or different instances of MySQL.
Also from the Tumblr gang, a nifty short note about using parallel gzip and netcat to get the fastest possible transfer of immense data sets between nodes: http://engineering.tumblr.com/post/7658008285/efficiently-copying-files-to-multiple-destinations
By adding tee and a FIFO to the mix, you can create a fast copy chain: each node in the chain saves the data locally while simultaneously sending it to the next server in the chain.
In the most recent issue of IEEE Computer, Prof. Eric Brewer revisits his famous CAP theorem, 12 years later: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
The challenging case for designers is to mitigate a partition’s effects on consistency and availability. The key idea is to manage partitions very explicitly, including not only detection, but also a specific recovery process and a plan for all of the invariants that might be violated during a partition.
Twitter have open-sourced their distributed trace infrastructure, designed for real-time tracing of service-oriented architectures under actual load: http://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.html
Zipkin started out as a project during our first Hack Week. During that week we implemented a basic version of the Google Dapper paper for Thrift. Today it has grown to include support for tracing Http, Thrift, Memcache, SQL and Redis requests.
Here's Google Dapper, by the way, if you haven't seen it before: http://research.google.com/pubs/pub36356.html
A great checklist if you find yourself considering the problem of, say, loading 500 million rows into your database, all at once: http://derwiki.tumblr.com/post/24490758395/loading-half-a-billion-rows-into-mysql
Use LOAD DATA INFILE
This is the most optimized path toward bulk loading structured data into MySQL. 8.2.2.1. Speed of INSERT Statements predicts a ~20x speedup over a bulk INSERT (i.e. an INSERT with thousands of rows in a single statement).
LOAD DATA INFILE is the MySQL equivalent of Postgres's COPY FROM, I believe.
A somewhat-marketing-slanted post about why you shouldn't expect any shared-disk database system to ever perform very well: http://database-scalability.blogspot.com/
With a shared disk - there is a single shared copy of my big data on the shared disk, the database engine still have to maintain "buffer management, locking, thread locks/semaphores, and recovery tasks". But now - with a twist! Now all of the above need to be done "globally" between all participating servers, thru network adapters and cables, introducing latency. Every database server in the cluster needs to update all other nodes for every "buffer management, locking, thread locks/semaphores, and recovery tasks" it is doing on a block of data.
A well-written plea to never poll, and never spin, and always use a real lock manager or real synchronization system instead: http://randomascii.wordpress.com/2012/06/05/in-praise-of-idleness/
The comment block says “Note: these locks are for use when you aren’t likely to contend on the critical section”. Famous last words. Caveat lockor. Sorry Facebook, too risky, not recommended.
A fascinating and detailed article about trying to use the Linux "perf" toolkit to profile and measure complicated system software (in this case, the Postgres engine) http://rhaas.blogspot.com/2012/06/perf-good-bad-ugly.html
perf seems to be the best of the Linux profiling tools that are currently available by a considerable margin, and I think that they've made good decisions about the functionality and user interface. The lack of proper documentation is probably the tool's biggest weakness right now, but hopefully that is something that will be addressed over time.
An intriguing introduction to some of the searching issues that come up in genome sequencing, which, at its core, bears a lot of resemblance to the problem of computing diffs between two versions of an object. http://williamedwardscoder.tumblr.com/post/24071805525/searching-for-substrings-in-a-massive-string
How might you search a really big string - a few billion symbols - for the best alignment of a large number of short substrings - perhaps just a few hundred symbols long or less - with some allowed edit-distance fuzzy-matching?
Includes some interesting pointers to the Bowtie project, which I had not previously encountered: http://bowtie-bio.sourceforge.net/index.shtml
And, lastly, since no list should end with 13 items, from the University of Ulm, an entire PhD thesis entitled "Concurrent Programming for Scalable Web Architectures": http://berb.github.com/diploma-thesis/original/index.html
Chapter 6 is, naturally, of particular interest to me, being a storage systems kinda guy:
In this chapter, we consider the impact of concurrency and scalability to storage backends. We illustrate the challenge of guaranteeing consistency in distributed database systems and point to different consistency models. We then outline some internal concepts of distributed database systems and describe how they handle replication and partitioning.

Enjoy!

The extra cookie

The wonderful writer Michael Lewis gave the commencement speech at Princeton this year.

All of you have been faced with the extra cookie. All of you will be faced with many more of them. In time you will find it easy to assume that you deserve the extra cookie. For all I know, you may. But you'll be happier, and the world will be better off, if you at least pretend that you don't.

Well put.

Thursday, June 7, 2012

Whales and Martingales

Here's an interesting and easy-to-read article by Jonathan Adler about whales and martingales and other interesting behaviors of the high rollers in our modern world.

As Adler observes, though, what Iksil was doing was more like poker than it was like Geismar's blackjack activity:

Unlike at a blackjack table where the dealer has a fixed set of actions she has to follow, on Wall Street there are other investors looking to exploit other people’s mistakes. Once other investors saw that the Whale left a chance for his investment to go sour, they were able to take actions to exploit this, and caused the event that seemed unlikely to come to pass.

Does it seem that watching-the-ultra-rich is displacing watching-the-movie-stars as the paparazzi's preferred choice nowadays?

Wednesday, June 6, 2012

Wrapping up the Anand-Gelfand match

While you wait, patiently, for Friday's launch of the Euro 2012 tournament, take a few minutes to reflect on the results of the Anand-Gelfand World Championship.

Ken Regan's observation is that, while most people who aren't familiar with the issues think that computer chess is boring, the actual reality turns out to be quite interesting: when computers play computers, there are relatively few draws, while when humans (at the highest level) play humans, there are many more draws.

Regan's own computer-assisted analysis concludes that, rather than be disappointed that so many games were draws, we should be astonished at the high level of play that was demonstrated:

According to my statistical model of player move choice mentioned here, this match had the highest standard in chess history. Based on computer analysis of the twelve regulation games, my model computes an “Intrinsic Performance Rating” (IPR) for Anand of 3002, and 2920 for Gelfand. Each is about 200 points higher than their current Elo ratings of 2791 and 2727, respectively. My analysis eliminates moves 1–8, moves in repeating sequences, and moves where one side is judged to have a clearly winning advantage, the equivalent of being over three pawns ahead.

Hartosh Singh Bal wonders whether the time has come to merge computers and humans, and allow the direct use of computers during matches such as these:

So far, experiments with advanced chess suggest that the powers of man and machine combined don’t just make for a stronger game than a man’s alone; they also seem to make for a stronger game than a machine’s alone. Allowing chess players the assistance of the best computer chess engine available during top tournaments would ensure that the contests really do showcase the very best chess being played on earth.

Dylan McClaim surveys the controversy over the format changes at the conclusion of the match, noting that, at this level, chess has been plagued with this problem for many years:

One notorious method gave the title to the first player to win a set number of games. But problems became apparent during the first championship showdown between Garry Kasparov and Anatoly Karpov, when the first player to take six games would be named champion.
The match started in September 1984, but after five months and 48 games there was still no champion.

Michael Aigner points out that the match was extremely close, and that few people seem to be paying enough attention to how much better Gelfand played than he had been expected to play (at least, according to "the ratings").

All told, the combatants played 16 total games, but only three proved decisive to determine a champion.

Dana MacKenzie notes that Anand's final victory, the second game of the four game rapid series, was scintillating:

At one point Shipov said that Anand had a position you wouldn’t wish on your best friend. But in spite of all that, he did have an extra pawn, and he managed somehow to get to a bare-bones endgame with rook, knight and pawn against Gelfand’s rook and bishop. I’m sure that the analysts will tell us that it should have been a draw, but with Gelfand’s flag hanging Anand handled his knight like a virtuoso, like the wand of a conductor leading an orchestra, and he finally got to a winning rook and pawn endgame. What a masterful performance!

And, looking more closely at the actual chess that was played, Vinay Bhat observes that one exciting outcome will be the return of Anand to major tournament play after being isolated for nearly a year:

Anand plays again later this month in Bazna. I’m curious to see what he’ll show there – will all his opponents purposefully play the Grunfeld and Sveshnikov? Will he play with a new fire after having won an event for the first time in a while? Will the criticism fuel him?

For my part, I'm wondering what will happen with the World Championship process; this year's matches were terribly controversial, with Carlsen not even taking part, and a highly disputed process for determining the challenger. Will there be reforms in the process, so that the best possible match can be held, the next time around? I hope so.

Tuesday, June 5, 2012

72 hours until Euro 2012 begins!

And if you have no idea what I'm talking about, go visit Zonal Marking's wonderful site and start digging through their great set of preview articles:

It's no fun if you don't make a prediction, so I'll make a prediction and predict that this year's victory will go to the Russian team.

I'm still trying to figure out exactly when the matches are being played, and which ones will be televised here in the states, and so forth.

Are you planning to watch any matches? Have any favorites to recommend? Let me know!

Signed and unsigned integers in the C programming language

Nice quiz and accompanying article on John Regehr's blog about the details and intricacies of signed and unsigned integers in the C programming language.

This topic has been with us ever since I started programming in the early 80's; I suspect it will be one of the details that you have to commit to memory (and frequently re-study) for many years to come.

Monday, June 4, 2012

Facebook Folly

Facebook's new Folly library looks interesting, but neither their announcement nor the README seem to mention the licensing terms.

Am I just overlooking it?

If you don't include any license information, is there some sort of default license?

Sunday, June 3, 2012

Bryan is 50!

That is, Bryan the Nord of Skyrim has now advanced to Level 50.

Bryan is also: Thane of Whiterun, Leader of the Companions, Thane of Windhelm, Archmage of the College of Winterhold, Thane of Markarth, Leader of the Thieves Guild, Thane of Solitude, Member of the Blades, and Member of the Bard's College.

And probably many other titles that I've lost track of.

And there's still probably 30% of the world of Skyrim that I haven't visited.

This game is big.

Back to Blackreach...

Friday, June 1, 2012

Trying to digest Oracle v. Google

So, in the end, what are we to make of Oracle v Google?

As Judge Alsup pointed out, "This action was the first of the so-called “smartphone war” cases tried to a jury". It was a major case, litigated by top counsel on both sides, heard by a judge and jury in the country's most tech-savvy area. It clearly means something. But what?

There are no shortage of opinions about what it might mean.

First, we should note that the two parties to the case have their own opinions about what the ruling represents:

Oracle distributed a widely-quoted statement that stated:
Oracle is committed to the protection of Java as both a valuable development platform and a valuable intellectual property asset. It will vigorously pursue an appeal of this decision in order to maintain that protection and to continue to support the broader Java community of over 9 million developers and countless law abiding enterprises. Google's implementation of the accused APIs is not a free pass, since a license has always been required for an implementation of the Java Specification. And the court's reliance on "interoperability" ignores the undisputed fact that Google deliberately eliminated interoperability between Android and all other Java platforms. Google's implementation intentionally fragmented Java and broke the "write once, run anywhere" promise. This ruling, if permitted to stand, would undermine the protection for innovation and invention in the United States and make it far more difficult to defend intellectual property rights against companies anywhere in the world that simply takes them as their own.
For their part, Google simply said:
The court's decision upholds the principle that open and interoperable computer languages form an essential basis for software development. It's a good day for collaboration and innovation.

Perhaps oddly, I think that both Oracle and Google are right about this. Google's implementation did fragment Java and break its core promise. But I am no fan of using the courts to limit what we can do with our computers, either, so I agree that it is indeed a good day for innovation.

Meanwhile, many other interested observers have weighed in with their own views:

Wired's Caleb Garling offers the assessment that: Judge Frees Google’s Android From Oracle Copyrights
Alsup said that in cloning the 37 Java APIs, Google wrote 97 percent of the code from scratch and that the remaining three percent was lifted in accordance with the law. He also said that out of the 166 Java software packages controlled by Oracle, 129 were in no way infringed upon by Google. Oracle cannot legally claim, he argued, that it owns all possible implementations and pieces of the command structures of all 166 APIs.
Alsup added, however, that his order does not mean that the Java API packages are free for all to use without license or that the structure, sequence, and organization of all computer programs may be “stolen.” Google, he said, had simply acted appropriately under the U.S. Copyright Act.
CNet's Rachel King feels that the ruling is narrow, not broad:
it's a narrow ruling that only covers the APIs at question in the copyright phase of this case.
The folks at Groklaw, however, see it as a fairly broad ruling: Judge Alsup Rules: Oracle's Java APIs are Not Copyrightable, noting that the key paragraph in the ruling states that:
Contrary to Oracle, copyright law does not confer ownership over any and all ways to implement a function or specification, no matter how creative the copyrighted implementation or specification may be. The Act confers ownership only over the specific way in which the author wrote out his version. Others are free to write their own implementation to accomplish the identical function, for, importantly, ideas, concepts and functions cannot be monopolized by copyright.
Florian Mueller, who has devoted an enormous number of hours to studying the case, cautions that there may still be much more to come:
Judge Alsup's decision is unprecedented in the sense that no comparable amount of software code (400 class definitions including many thousands of methods and other definitions) has previously been held uncopyrightable despite being deemed to satisfy the originality requirement. Both sides of the argument had reasons to voice and defend their respective positions -- that's why this had to go to trial, and that's why it will go on appeal.
In a follow-on posting, Mueller recalls a fascinating observation from one of the members of the Google legal team:
More than seven years ago, one of the intellectual property lawyers on Google's defense team against Oracle, Greenberg Traurig's Heather Meeker, wrote an opinion piece for Linux Insider in which she argued that copyright protection of software is "tricky" because copyright focuses on expression while the value of software is in function, the very thing that copyright law wasn't designed to protect. With the exception of "wholesale copying" of entire products (which is what most software copyright cases are about) by "shameless counterfeiters", Mrs. Meeker says that "actually identifying software copyright infringement is like reading tea leaves" because "people using software to develop products rarely copy software without modification". She goes on to say:
"The serious copyright battles are over the copying of bits and pieces, structures, design elements and so forth -- and applying copyright law to those cases is difficult, expensive and unpredictable."
The Heather Meeker piece, although written nearly a decade ago, is certainly worth re-reading in this context
in the 1990s, when software patents exploded in the U.S., the PTO lacked sufficient institutional knowledge of prior art to weed out the obvious claims. This institutional knowledge exists mostly in the form of previously issued patents, so a sudden and dramatic increase in filings causes a prior art vacuum. This has been borne out empirically, in the sense that many U.S. software patents -- after being issued by the PTO -- are later invalidated due to obviousness based on prior art, in lawsuits where the patent holders try to assert them against infringers.
The Electronic Frontier Foundation, who I think are relatively clear thinkers about most topics of policy in the high tech world, came out strongly opposed to the notion of copyrighting an API:
Treating APIs as copyrightable would have a profound negative impact on interoperability, and, therefore, innovation. APIs are ubiquitous and fundamental to all kinds of program development. It is safe to say that all software developers use APIs to make their software work with other software. For example, the developers of an application like Firefox use APIs to make their application work with various OSes by asking the OS to do things like make network connections, open files, and display windows on the screen. Allowing a party to assert control over APIs means that a party can determine who can make compatible and interoperable software, an idea that is anathema to those who create the software we rely on everyday.

So, after all that, where are we? Is it just that, in the end, there was "no steak, only parsley"?

Or is it more that, as Simon Phipps noted in his column at InfoWorld, we simply have a situation where deep pockets found the best justice money can buy:

While a company with the resources of Google can attempt to challenge each patent in turn at the Patent Office and get it invalidated, most smaller companies simply have to cut their losses and settle with the legally sanctioned extortioner.

I find myself rather ambivalent and certainly not much smarter than I was, two years ago, when this whole mess started.

One the one hand, I think that the whole field of Intellectual Property law, at least in the area of computer software, is a complete disaster. I think that Mike Loukides describes it well:

As I've frequently said, I invented linked lists and hash tables when I was in high school, but fortunately for the industry, software patents didn't exist back then. I don't claim to be unique; I'm sure many clever 17-year-old programmers have invented hash tables. Just about everything in this industry has been invented many times, by almost everyone who is vaguely clueful. Once you understand the way computing works, you fairly quickly understand why software patents should be extremely rare, if they exist at all. There's prior art everywhere, and almost every invention is "obvious" when looked at from the right perspective.

On the other hand, I'm furious with Google for splitting and splintering Java. It used to mean something to state that you had implemented Java. The Sun Microsystems team worked hard, for decades, to ensure that Java was Java was Java, and that in order to provide Java, you had to provide all of Java, and only Java. Even though it was ugly to see Sun suing Microsoft, I understood why they did it, and I think that, in the end, that suit made Java more valuable to everyone.

But Google's Android did the wrong thing by providing something that is like Java, but isn't Java. There are many completely valid Java programs, important pieces of software, that don't run on Android, and I think it was wrong of Google to twist Java in this fashion.

So, there I am: the court case is over, the judge and jury have rendered their decisions, and life goes on.

My own, personal, opinion, is that in the end it is mattering less and less. Java's time has gone. Oracle are taking Java in a very different direction from where Sun took it. Java is being turned into the giant company enterprise back office language, where it will be used extensively in corporate application development, by giant companies like Oracle and IBM, but Java as a language for small teams, for hobbyists, for students, is no more.

Which is too bad.

Update: My father, reading about the same issues, had a tl;dr moment, and offers this nice summation, which I think captures the entire lesson elegantly and briefly:

It turns out patent and copyright are really all about power, and not about ideas.

Who says you can't learn anything useful on the Internet?

Bottle Cap Blues.

Sanger on Stuxnet in the NYT

The New York Times is running quite the bombshell article today: Obama Order Sped Up Wave of Cyberattacks Against Iran

Mr. Obama decided to accelerate the attacks — begun in the Bush administration and code-named Olympic Games — even after an element of the program accidentally became public in the summer of 2010 because of a programming error that allowed it to escape Iran’s Natanz plant and sent it around the world on the Internet. Computer security experts who began studying the worm, which had been developed by the United States and Israel, gave it a name: Stuxnet.

The article itself is massive, and apparently draws from Sanger's book on the subject: Confront and Conceal, scheduled to be published next Tuesday.

Like so much of modern "national security" journalism, Sanger presents a huge amount of information, with almost no concrete sources for any of it, essentially asking his readers to trust him on the accuracy:

This account of the American and Israeli effort to undermine the Iranian nuclear program is based on interviews over the past 18 months with current and former American, European and Israeli officials involved in the program, as well as a range of outside experts. None would allow their names to be used because the effort remains highly classified, and parts of it continue to this day.

It's explosive stuff, and fascinating. Keep your eyes on this story, I suspect it will be developing rapidly.