Thursday, August 11, 2011

Big data, Bryan style

I had my first encounter with a terabyte dataset today.

At my day job, our product is used across an amazing range of problem sizes:

  • At the low end, many hundreds of thousands of individual engineers quite happily use their own personal server to manage their individual digital work, in complete isolation.

  • At the high end, a single server can manage immense quantities of data under simultaneous use from tens of thousands of users.



Recently, I had the opportunity to investigate a confusing behavior of the server, which only presented its symptoms under certain circumstances. Unfortunately, these circumstances weren't just "run these 3 steps", or "look at this one-page display of data", but rather, "it happens on our production server".

Happily, our user was able to arrange to share their dataset with us, so we embarked on a effort to connect me, the developer, with the symptoms of interest:


When the dataset arrived, I was rather astonished to discover that a 33 GB compressed file expanded to almost 1 TB in size! This was highly textual data, and also contained a fair amount of redundancy, so I had been expecting a high degree of compression, but my expectation was something around 12x, so I initially had confidently selected a 550 GB filesystem and issued the uncompress command there.

Imagine my surprise, when after 2 hours the filesystem filled up and the uncompress was aborted!

After a quick conference call, we realized that we needed something much closer to a terabyte in available space.

Happily, hardware vendors are working hard to keep up with this demand, and our benchmarking team happened to have a machine with a fresh new 4 TB hard drive, so I slipped a few promises their way and the deal was done :) A few hours later, the uncompress was once again running, and a mere 10 hours (!!) later I had my terabyte dataset to play with.

Yes, indeed, the age of big data is upon us. I know I'm still small-potatoes in this world of really big data (check the link out, those guys are amazing!), but I've crossed the 1 TB threshold, so, for me, I think that counts as starting to play with the big boys

Plus, it's the first problem report I've encountered that I couldn't just drop on my awesome MacPro :)

No comments:

Post a Comment