Saturday, August 27, 2011

Microsoft does look at the WER data

Microsoft Windows systems provide a feature known as WER, which stands for Windows Error Reporting. This is a collection of technology which detects crashes on your PC, and attempts to gather information about what appears to have caused the crash, and to send that information to a central Microsoft server.

When WER intercepts a crash, it asks you for permission to send that crash data to Microsoft. You may have said yes when this happened; you may have said no. Regardless, you may have wondered whether this information went, and what happened to it.

A recent paper presented by three Microsoft researchers discusses some of the things that Microsoft does with the data it gets: Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs.

In this particular case, the researchers were trying to understand what they could learn about the frequency and distribution of hardware problems on commodity machines by analyzing the WER data.

While doing the study, the researchers learned a number of things. For example:

our study has found a number of interesting results. For instance, even small degrees of overclocking significantly degrade machine reliability, and small degrees of underclocking improve reliability over running at rated speed. We also find that faster CPUs tend to become faulty more quickly than slower CPUs, and laptops have lower failure rates than desktops. Beyond our results, this study serves to inform the community that hardware faults on consumer machines are not rare, independent, or always transient.

I found the observed higher reliability of laptops over desktops surprising, and the researches did, too:

Although one might expect the typically abusive environment of laptops to make them less reliable, Figure 7 shows the opposite. Laptops are between 25% and 60% less likely than desktop machines to crash from a hardware fault over the first 30 days of observed TACT. We hypothesize that the durability features built into laptops (such as motion-robust hard drives) make these machines more robust to failures in general.

Another conclusion is that while disks wear out, memory and CPU generally doesn't wear out, at least nowhere near as quickly. However, once memory fails it continues to fail:

Almost 80% of machines that crashed more than once from a 1-bit DRAM failure had a recurrence at the same physical address.

It's a pretty interesting and pretty readable paper. It's nice to see that the WER data is actually doing some good, and thanks to Microsoft for sharing the results of their analysis.

1 comment:

  1. -good followup to the Google and CERN papers. Surprised they blame neutrinos for some DRAM errors; I'd have thought they were too weakly interacting. Everyone needs ECC, motherboards that log it reliably, and OSs that push up the events.