Sunday, August 16, 2015

Let's read about things that failed.

There's lots more to learn from failure than from success.

  • A Large-Scale Study of Flash Memory Failures in the Field
    Based on our field analysis of how flash memory errors manifest when running modern workloads on modern SSDs, this paper is the first to make several major observations: (1) SSD failure rates do not increase monotonically with flash chip wear; instead they go through several distinct periods corresponding to how failures emerge and are subsequently detected, (2) the effects of read disturbance errors are not prevalent in the field, (3) sparse logical data layout across an SSD’s physical address space (e.g., non-contiguous data), as measured by the amount of metadata required to track logical address translations stored in an SSD-internal DRAM buffer, can greatly affect SSD failure rate, (4) higher temperatures lead to higher failure rates, but techniques that throttle SSD operation appear to greatly reduce the negative reliability impact of higher temperatures, and (5) data written by the operating system to flash-based SSDs does not always accurately indicate the amount of wear induced on flash cells due to optimizations in the SSD controller and buffering employed in the system software.
  • Postmortem for July 27 outage of the Manta service
    It appears that PostgreSQL blocks new attempts to take a shared lock while an exclusive lock is wanted. (This sounds bad, but it's necessary in order to avoid writer starvation.) However, the exclusive lock was itself blocked on a different shared lock held by the autovacuum operation. In short: the autovacuum itself wasn't blocking all the data path queries, but it was holding a shared lock that conflicted with the exclusive lock wanted by the "DROP TRIGGER" query, and the presence of that "DROP TRIGGER" query blocked others from taking shared locks. This explanation was corroborated by the fact that during the outage, the oldest active query in the database was the "DROP TRIGGER". Everything before that query had acquired the shared lock and completed, while queries after that one blocked behind it.
  • Behind the Scenes of a long EVE Online downtime
    First we ran step 2a on all nodes in parallel. The command completed instantly, and we saw a spike of 125,000 lines (250*500) on the Splunk graph. That might seem like a lot of logging, but it isn't anything the system can't handle, especially in small bursts like this. Next we ran step 2b in the same way. This was where something curious happened. The correct number of log lines did show up in Splunk (The logs do show something!), but the command did not appear to return as immediately as it did for step 2a. In fact it took a few minutes before the console became responsive again, and the returned data indicated that several nodes did not respond in time. Looking over the status of the cluster, those nodes were now showing as dead. Somehow this innocent log line had managed to cause these nodes to time-out and drop out of the cluster.
  • Black Hat USA 2015: The full story of how that Jeep was hacked
    Recently we wrote about the now-famous hack of a Jeep Cherokee. At Black Hat USA 2015, a large security conference, researchers Charlie Miller and Chris Valasek finally explained in detail, how exactly that hack happened.
  • Remote Exploitation of an Unaltered Passenger Vehicle
    Automotive security research, for the most part, began in 2010 when researchers from the University of Washington and the University of California San Diego showed that if they could inject messages into the CAN bus of a vehicle (believed to be a 2009 Chevy Malibu) they could make physical changes to the car, such as controlling the display on the speedometer, killing the engine, as well as affecting braking. This research was very interesting but received widespread criticism because people claimed there was not a way for an attacker to inject these types of messages without close physical access to the vehicle, and with that type of access, they could just cut a cable or perform some other physical attack.
  • Not Even Close: The State of Computer Security (with slides) - James Mickens
    Did you hear what I just said? You can use WireShark to analyze the network traffic being exchanged by the light bulbs in the house!
  • Welcome to The Internet of Compromised Things
    It's becoming more and more common to see malware installed not at the server, desktop, laptop, or smartphone level, but at the router level. Routers have become quite capable, powerful little computers in their own right over the last 5 years, and that means they can, unfortunately, be harnessed to work against you.
  • Oracle security chief to customers: Stop checking our code for vulnerabilities [Updated]
    Davidson scolded customers who performed their own security analyses of code, calling it reverse engineering and a violation of Oracle's software licensing. She said, "Even if you want to have reasonable certainty that suppliers take reasonable care in how they build their products—and there is so much more to assurance than running a scanning tool—there are a lot of things a customer can do like, gosh, actually talking to suppliers about their assurance programs or checking certifications for products for which there are Good Housekeeping seals for (or “good code” seals) like Common Criteria certifications or FIPS-140 certifications."
  • Don’t Bug Me
    The claim that Oracle can, on its own, find all the vulnerabilities in its products is nonsense. No tech company in the world is equal to the task of shipping bug-free code. The idea that no one outside of Oracle could have the expertise or ability to find relevant exploitable coding errors in the company’s products is similarly ridiculous: Independent security researchers routinely find important vulnerabilities in commercial products made by companies they don’t work for. And while it is no doubt true that some of the reports Davidson and her team receive are false alarms, the notion that assessing and responding to these concerns is a waste of her time demonstrates a fundamental misunderstanding of the value provided by people who devote their time and energy to finding and reporting software vulnerabilities.
  • Metacritic Matters: How Review Scores Hurt Video Games
    But people find it hard to trust what they don’t understand. And nobody understands how Metascores are computed.

    One of Doyle’s other big policies has also been in the news recently: Metacritic’s refusal to change an outlet’s first review score, no matter what happens. It’s a policy they’ve had for a while now, Doyle told me. He enacted it because during the first few years of Metacritic, which launched in 2001, reviewers kept changing their scores for vague reasons that Doyle believes were caused by publisher pressure.

  • Bad comments are a system failure
    For years this has been discussed in more academic circles as “context collapse.” You have an identity and a set of ideas about the world that exists and is understood in one social context. You want to bring it to another place and not have to have to do a five minute introduction about who you are and what you value every time you say anything. Other people don’t share the same preset understandings and may read more into what you are saying than you think you put there. Your jokes fall flat, or cause offense. Conversation devolves into side discussions and arguments about first principles and word definitions. People start citing the dictionary and Wikipedia and angrily talking past each other.
  • Man fights off bear near Yosemite National Park, drives himself to hospital
    The attack occurred about 4 a.m. when the man walked onto his porch and was ambushed by the bear in Midpines, a community on the edge of Yosemite National Park.

    The bear was feeding on a bag of trash left 20 feet from the man’s front door, Stoots said. The bear tackled the man and attacked him. But the man fought back, using his legs and arms, and eventually escaped back into his house.

A bag of trash 20 feet from his door? Sheesh. Some of these failures are easier to fix than others...

No comments:

Post a Comment