Journal of a Programmer: The Data Deluge

Tuesday, March 16, 2010

The Data Deluge

The February 27th, 2010 issue of the Economist has an interesting special section:

Data, data everywhere: A special report on managing information

The special section contains about ten articles, looking at the subject from various angles. From the first article:

The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account.

But they are also creating a host of new problems.

The companies and organizations which are working with this "big data" are the obvious ones: Google, IBM, Microsoft, Oracle, etc. Here's an example:

Wal-Mart's inventory-management system, called RetailLink, enables suppliers to see the exact number of their products on ever shelf of every store at that precise moment. The system shows the rate of sales by the hour, by the day, over the past year, and more. Begun in the 1990's, RetailLink gives suppliers a complete overview of when and how their products are selling, and with what other products in the shopping cart. This lets suppliers manage their stocks better.

The article gives a clever use of such data mining:

In 2004 Wal-Mart peered into its mammoth databases and noticed that before a hurricane struck, there was a run on flashlights and batteries, as might be expected; but also on Pop-Tarts, a sugary American breakfast snack. On reflection it is clear that the snack would be a handy thing to eat in a black-out, but the retailer would not have thought to stock up on it before a storm.

The articles touch on many of the problems in dealing with big data:

Size/cost/time to process such enormous amounts of data

False conclusions, confusing interpretation issues

Security and privacy concerns

Ownership issues

Fair access to data

Environmental issues

You might be surprised to see "Environmental issues" in the list.

Another concern is energy consumption. Processing huge amounts of data takes a lot of power. "In two to three years we will saturate the electric cables running into the building," says Alex Szalay at Johns Hopkins University. "The next challenge is how to do the same things as today, but with ten to 100 times less power."

Both Google and Microsoft have had to put some of their huge data centers next to hydroelectric plants to ensure access to enough energy at a reasonable price.

The articles also talk about many of the fascinating areas of technology being driven/created by these new big-data efforst:

Cloud computing

Query processing

Statistical algorithms, such as
- collaborative filtering
- statistical spelling correction
- statistical translation
- Bayesian spam filtering
- Predictive analytics

Storage and networking advancements

Data visualization

Flash trading

It's a long list, and a lot of exciting areas.

As is often true with Economist special reports, the writing is fairly dry, and the presentation tends to provide a fairly high-level overview of a lot of related areas, without providing much in the way of resources for digging into those areas more deeply.

But overall it was intriguing and quite worth reading.

Journal of a Programmer

Tuesday, March 16, 2010

The Data Deluge

No comments:

Post a Comment

About Me

Blog Archive

Pages