Sunday, February 21, 2016

The story of CVE-2015-7547

I don't believe there is a Pulitzer Prize for software.

But if there was such a prize, it should be given to the teams from RedHat and from Google who worked on CVE-2015-7547.

Let's start the roundup by looking a bit at Dan Kaminsky's essay: A Skeleton Key of Unknown Strength

The glibc DNS bug (CVE-2015-7547) is unusually bad. Even Shellshock and Heartbleed tended to affect things we knew were on the network and knew we had to defend. This affects a universally used library (glibc) at a universally used protocol (DNS). Generic tools that we didn’t even know had network surface (sudo) are thus exposed, as is software written in programming languages designed explicitly to be safe.

Kaminsky goes on to give a high-level summary of how the bug allows attacks:

Somewhat simplified, the attacks depend on:.
  • A buffer being filled with about 2048 bytes of data from a DNS response
  • The stub retrying, for whatever reason
  • Two responses ultimately getting stacked into the same buffer, with over 2048 bytes from the wire
The flaw is linked to the fact that the stack has two outstanding requests at the same time – one for IPv4 addresses, and one for IPv6 addresses. Furthermore DNS can operate over both UDP and TCP, with the ability to upgrade from the former to the latter. There is error handling in DNS, but most errors and retries are handled by the caching resolver, not the stub. That means any weird errors just cause the (safer, more properly written) middlebox to handle the complexity, reducing degrees of freedom for hitting glibc.

An interesting thing about this bug is that it was more-or-less concurrently studied by two separate security analysis teams. Here's how the Google team summarize the issue in their article: CVE-2015-7547: glibc getaddrinfo stack-based buffer overflow

glibc reserves 2048 bytes in the stack through alloca() for the DNS answer at _nss_dns_gethostbyname4_r() for hosting responses to a DNS query.

Later on, at send_dg() and send_vc(), if the response is larger than 2048 bytes, a new buffer is allocated from the heap and all the information (buffer pointer, new buffer size and response size) is updated.

Under certain conditions a mismatch between the stack buffer and the new heap allocation will happen. The final effect is that the stack buffer will be used to store the DNS response, even though the response is larger than the stack buffer and a heap buffer was allocated. This behavior leads to the stack buffer overflow.

The vectors to trigger this buffer overflow are very common and can include ssh, sudo, and curl. We are confident that the exploitation vectors are diverse and widespread; we have not attempted to enumerate these vectors further.

That last paragraph is a doozy.

Still, both of the above articles, although fascinating and informative, pale beside the epic, encyclopedic, exhaustive, and fascinating treatise written by Carlos O'Donell of RedHat and posted to the GNU C Library mailing list: [PATCH] CVE-2015-7547 --- glibc getaddrinfo() stack-based buffer overflow.

O'Donell's explication of the bug is perhaps the greatest debugging/diagnosis/post-mortem write-up of a bug that I think I've ever read.

If you've ever tried to precisely describe a bug, and how it can cause a security vulnerability, you'll know how hard it is to do that both exactly and clearly. Here's how O'Donell does it:

The defect is located in the glibc sources in the following file:

- resolv/res_send.c

as part of the send_dg and send_vc functions which are part of the
__libc_res_nsend (res_nsend) interface which  is used by many of the
higher level interfaces including getaddrinfo (indirectly via the DNS
NSS module.)

One way to trigger the buffer mismanagement is like this:

* Have the target attempt a DNS resolution for a domain you control.
  - Need to get A and AAAA queries.
* First response is 2048 bytes.
  - Fills the alloca buffer entirely with 0 left over.
  - send_dg attemps to reuse the user buffer but can't.
  - New buffer created but due to bug old alloca buffer is used with new
    size of 65535 (size of the malloc'd buffer).
  - Response should be valid.
* Send second response.
  - This response should be flawed in such a way that it forces
    __libc_res_nsend to retry the query. It is sufficient for example to
    pick any of the listed failure modes in the code which return zero.
* Send third response.
  - The third response can contain 2048 bytes of valid response.
  - The remaining 63487 bytes of the response are the attack payload and
    the recvfrom smashes the stack with it.

The flaw happens because when send_dg is retried it restarts the query,
but the second time around the answer buffer points to the alloca'd
buffer but with the wrong size.

O'Donell then proceeds to walk you through the bug, line by line, showing how the code in question proceeds, inexorably, down the path to destruction, until it commits the fatal mistake:

So we allocate a new buffer, set *anssizp to MAXPACKET, but fail to set *ansp to the new buffer, and fail to update *thisanssizp to the new size.
And, therefore:
So now in __libc_res_nsend the first answer buffer has a recorded size of MAXPACKET bytes, but is still the same alloca'd space that is only 2048 bytes long.

The send_dg function exits, and we loop in __libc_res_nsend looking for an answer with the next resolver. The buffers are reused and send_dg is called again and this time it results in `MAXPACKET - 2048` bytes being overflowed from the response directly onto the stack.

There's more, too, and O'Donell takes you through all of it, including several other bugs that were much less severe which they uncovered while tracking this down and studying it using tools like valgrind.

O'Donell's patch is very precise, very clearly explained, very thoroughly studied.

But, as Kaminsky points out in today's follow-up, it's still not clear that we understand the extent of the danger of this bug: I Might Be Afraid Of This Ghost

A few people have privately asked me how this particular flaw compares to last year’s issue, dubbed “Ghost” by its finders at Qualys.

...

the constraints on CVE-2015-7547 are “IPv6 compatible getaddrinfo”. That ain’t much. The bug doesn’t even care about the payload, only how much is delivered and if it had to retry.

It’s also a much larger malicious payload we get to work with. Ghost was four bytes (not that that’s not enough, but still).

In Ghost’s defense, we know that flaw can traverse caches, requiring far less access for attackers. CVE-2015-7547 is weird enough that we’re just not sure.

It's fascinating that, apparently due to complete coincidence, the teams at Google and at RedHat uncovered this behavior independently. Better, they figured out a way to coordinate their work:

In the course of our investigation, and to our surprise, we learned that the glibc maintainers had previously been alerted of the issue via their bug tracker in July, 2015. (bug). We couldn't immediately tell whether the bug fix was underway, so we worked hard to make sure we understood the issue and then reached out to the glibc maintainers. To our delight, Florian Weimer and Carlos O’Donell of Red Hat had also been studying the bug’s impact, albeit completely independently! Due to the sensitive nature of the issue, the investigation, patch creation, and regression tests performed primarily by Florian and Carlos had continued “off-bug.”

This was an amazing coincidence, and thanks to their hard work and cooperation, we were able to translate both teams’ knowledge into a comprehensive patch and regression test to protect glibc users.

It was very interesting to read these articles, and I'm glad that the various teams took the time to share them, and even more glad that companies like RedHat and Google are continuing to fund work like this, because, in the end, this is how software becomes better, painful though that process might be.

No comments:

Post a Comment