Journal of a Programmer: Big data, Bryan style

Thursday, August 11, 2011

Big data, Bryan style

I had my first encounter with a terabyte dataset today.

At my day job, our product is used across an amazing range of problem sizes:

At the low end, many hundreds of thousands of individual engineers quite happily use their own personal server to manage their individual digital work, in complete isolation.

At the high end, a single server can manage immense quantities of data under simultaneous use from tens of thousands of users.

Recently, I had the opportunity to investigate a confusing behavior of the server, which only presented its symptoms under certain circumstances. Unfortunately, these circumstances weren't just "run these 3 steps", or "look at this one-page display of data", but rather, "it happens on our production server".

Happily, our user was able to arrange to share their dataset with us, so we embarked on a effort to connect me, the developer, with the symptoms of interest:

Our user was able to collect the data and compress it to a small enough size (33 GB) that they could transfer it 9,000 miles across the planet using their corporate network.

A colleague in a more accessible timezone was then able to find a spare IDE hard drive and load the data onto it, and send it to us (remember, as Andy Tanenbaum says, Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.).

When the dataset arrived, I was rather astonished to discover that a 33 GB compressed file expanded to almost 1 TB in size! This was highly textual data, and also contained a fair amount of redundancy, so I had been expecting a high degree of compression, but my expectation was something around 12x, so I initially had confidently selected a 550 GB filesystem and issued the uncompress command there.

Imagine my surprise, when after 2 hours the filesystem filled up and the uncompress was aborted!

After a quick conference call, we realized that we needed something much closer to a terabyte in available space.

Happily, hardware vendors are working hard to keep up with this demand, and our benchmarking team happened to have a machine with a fresh new 4 TB hard drive, so I slipped a few promises their way and the deal was done :) A few hours later, the uncompress was once again running, and a mere 10 hours (!!) later I had my terabyte dataset to play with.

Yes, indeed, the age of big data is upon us. I know I'm still small-potatoes in this world of really big data (check the link out, those guys are amazing!), but I've crossed the 1 TB threshold, so, for me, I think that counts as starting to play with the big boys

Plus, it's the first problem report I've encountered that I couldn't just drop on my awesome MacPro :)

Journal of a Programmer

Thursday, August 11, 2011

Big data, Bryan style

No comments:

Post a Comment

About Me

Blog Archive

Pages