Journal of a Programmer: Data Deduplication

Thursday, July 9, 2009

Data Deduplication

Apparently "Data Deduplication" is a New Hot Thing.

I find "deduplication" a very awkward word, but from what I can tell, it refers to software systems which can automatically detect redundant information and consolidate it.

That is, it's a form of compression.

I started paying attention to it when EMC and NetApp engaged in a takeover battle to purchase Data Domain, which is apparently a big player in this field. EMC won the takeover battle this week.

Here's a Wikipedia page about Data Deduplication.

I guess I'm kind of surprised that this technique is more successful than simple compression; NTFS has been able to compress a file system automatically for 20 years, I believe. And Microsoft recently added a very sophisticated automatic compression feature to SQL Server 2008; it supports both row compression and page compression.

But apparently this new Data Deduplication technology is attracting a lot of interest; here's a recent Usenix article about extending the technology to clustered file systems, from a team at VMWare. The article points to work done about 10 years ago, in the context of backup/archiving processing, under the project named Venti.

Robin Harris at StorageMojo speculates that NetApp might decide to go after Quantum, now that they lost out on DataDomain. Apparently Quantum was an early leader in Data Deduplication technology. I wonder if my old friend Nick Burke still works at Quantum?

There are many new things under the sun, and now I'll start paying more attention to Data Deduplication (once I train my fingers to start spelling it properly -- ugh what a mouthful!)

Journal of a Programmer

Thursday, July 9, 2009

Data Deduplication

No comments:

Post a Comment

About Me

Blog Archive

Pages