Blackhat 2010 video on remote timing attacks

November 9, 2010November 9, 2010 ~ Nate Lawson ~ 2 Comments

The Blackhat staff recently posted the video and slides from our talk on remote timing attacks. You can watch the talk via the playlist below. While there wasn’t enough time to go into detail in every aspect, I think this talk gives a good explanation why developers should take these attacks seriously.

Our conclusions can be summarized as:

Surprisingly small timing differences are visible from remote (< 40 ns LAN, < 25 μs Internet in this talk). This is an improvement over previously published results.
Many factors thought to prevent timing attacks, including geographical distance and competing server load, do not have as big an effect as tradition suggests.
Many common crypto libraries have timing leaks and some are exploitable.
If you deliver software as a product, you can’t make assumptions about your customer’s threat model for side channel leaks. They may just be on Slicehost or EC2 or use a slow embedded processor.
Since some routines that leak timing information (such as “terminate early compare”) are easy to fix, it’s better to be conservative than hope your customer’s environment prevents exploitation.

Since giving this talk at Blackhat, we have an updated version with new results. I had hoped to post it now, but it turns out that perfect is still the enemy of “good enough”. We’ll have more on that soon. In the meantime, I hope this talk is helpful to you.

Clench is inferior to TLS+SRP

September 8, 2010September 8, 2010 ~ Nate Lawson ~ 21 Comments

There’s a new proposed client authentication method called Clench from web security super-group Goatse Security. While their deep magic may work for revealing iPad email addresses, it’s not so great when it comes to designing crypto protocols.

This protocol is intended to provide a way for clients to authenticate themselves to servers and vice versa without revealing useful information to a man-in-the-middle. While most of the rhetoric in the post is anti-PKI (and rightly so), their protocol doesn’t really replace PKI.

First, it only works for sessions where the user is logging in. There is no way to authenticate just the server (anonymous client). This means the protocol does not replace PKI since it assumes a priori that the server knows the client’s password. There has to be an existing relationship established with this particular server before Clench can be used. This is the chicken-and-egg problem that SSL+certs solves by including a root cert in the browser.

Second, the server has to store plaintext passwords. The authors attempt to explain this away by suggesting the password will be stored in some other secure device (like an HSM). But this is definitely a step down in security from appropriate password hashing methods like bcrypt.

Finally, the authors don’t emphasize that Clench will require changes to all client software (e.g., your web browser). This has been a non-starter for many protocols.

Luckily, all of this has already been solved. SRP is a family of password-based authentication schemes. SRP provides the following:

Provably secure against dictionary attacks
Provably secure against an active man-in-the-middle attacker
No trusted third-party required
Passwords stored on the server are salted and hashed

SRP has been around since about 1998. RFC 5054 added a TLS extension for SRP and was approved in 2007. OpenSSL will support it in version 1.0.1. Various patches are out for OpenSSH support.

All the discussion about Clench will hopefully raise awareness of SRP instead. It’s been around for 12 years, solves more problems, and has none of the drawbacks Clench has. It’s definitely time to adopt SRP.

Optimized memcmp leaks useful timing differences

August 5, 2010August 5, 2010 ~ Nate Lawson ~ 9 Comments

One of the main questions raised after our Blackhat talk on exploiting remote timing attacks was how memcmp() optimizations might affect our results.

Our talk focused on a particular timing vulnerability. If an HMAC comparison exits early when a mismatch is found, it reveals information to an attacker about the correct HMAC for the forged message. This early-exit behavior is desirable in general but can leak timing information about secret data, allowing an attacker to iteratively guess the secret.

In C, memcmp() is almost always used for comparisons of binary data. Its API specifies that it compares two fixed-length buffers and returns the difference between them or zero if they are identical. In most implementations, memcmp() exits as soon as a difference is found in the two buffers.

Even if memcmp() exits early, the size of comparisons might affect how exploitable this timing delta is. For example, an optimized memcmp() may work in units of 64-bit quad words. Such an implementation might not be exploitable even if the timing delta of a mismatch was 1 second since you would have to brute-force 64-bits before seeing the timing delta increase due to a match. However, things aren’t so simple and using an optimized memcmp() does not always save you.

To prepare for our talk, we disassembled a few memcmp() implementations and found a surprising number were byte-based. This gives an attacker only 256 possibilities per guess, which is quite tractable. We decided to follow up with more detailed results for various operating systems.

We found two things:

Even an optimized memcmp() usually leaks bytewise timing differences. In fact, it is often required by its API to do so.
OS and compiler settings often select a bytewise compare even when a more-optimized version is available.

Anyone who is assuming particular compiler behavior is advised to review their disassembly. It’s almost always best to use a constant-time comparison instead of memcmp() when working with secret data.

Windows XP SP3 (32-bit)

We disassembled all .exes and DLLs in the system32 directory. All of them import memcmp() from either ntdll.dll or msvcrt.dll. All of them were compiled with some version of MSVC.

The msvcrt.dll memcmp() is 32-bit optimized. It compares 32-bit words if both pointers are 32-bit aligned (rep cmpsd). The remainder is compared with up to a 3-byte series of individual byte compares (cmp/jnz). If either of the input pointers is not 32-bit aligned, it goes into an unrolled loop that compares single bytes and exits early if any mismatch. ntdll.dll has the same memcmp() implementation.

Even with this optimization, the Windows memcmp() leaks bytewise timing information. When a difference is found in the 32-bit compares, it goes through each byte of the 32-bit word to find which byte differs. This means there is a 2-instruction timing delta per byte the attacker got correct, even though the original compare was in 32-bit chunks.

Windows memcmp() doesn’t actually have to do this set of bytewise comparisons. MSDN states the following behavior:

Return value Relationship of first count bytes of buf1 and buf2

< 0 buf1 less than buf2

0 buf1 identical to buf2

> 0 buf1 greater than buf2

Return value	Relationship of first count bytes of buf1 and buf2
< 0	buf1 less than buf2
0	buf1 identical to buf2
> 0	buf1 greater than buf2

The cygwin man page for memcmp() states:

The function returns an integer greater than, equal to or less than zero according to whether the object pointed to by S1 is greater than, equal to or less than the object pointed to by S2.

Since neither implementation requires identifying the bytewise difference, the implementation could just subtract the two 32-bit words. Only the sign of the return value matters, not the magnitude. This is analogous to treating the two buffers as multi-precision integers, just like in an RSA implementation.

We also compiled our own code using the cygwin and mingw gcc compilers. With no optimization flags, the cygwin gcc links to cygwin1.dll, which has a 32-bit optimized memcmp() similar to msvcrt.dll. It returns the bytewise difference and checks the input pointers for 32-bit alignment. The mingw compiler links to the msvcrt.dll version of memcmp() when no optimization flags are specified.

However, if an optimization level higher than default is used (-O1, -O2, -O3), both cygwin and mingw gcc use a 1-byte compare loop (rep cmpsb). This is surprising. If the -fno-builtin flag is specified, both cygwin and mingw gcc revert to the previous behavior, calling the external implementation in their respective DLLs.

We disassembled the default cygwin system utilities and libraries and found they all used a 1-byte compare loop. This is likely because they were compiled with optimization enabled but without the -fno-builtin flag.

Summary

The default Windows memcmp() (msvcrt.dll or ntdll.dll) leaks bytewise timing information even for chunks of 32-bits due to its search for the differing byte once a mismatch is found. It also leaks bytewise information if either input pointer is not 32-bit aligned. MSVC always uses an external DLL memcmp().

Both cygwin and mingw gcc use a 1-byte compare loop if optimization is enabled (-O1 or higher). If no optimization is enabled, gcc links to a DLL (cygwin1.dll or msvcrt.dll) that has a 32-bit optimized memcmp(). This memcmp() leaks the bytewise difference just like msvcrt.dll.

Windows 7 (64-bit)

This memcmp() implementation in msvcrt.dll compares buffers longer than 8 bytes in 64-bit chunks. It does not return the bytewise difference if it finds a mismatch. It returns the difference of the trailing 32-bits. The only bytewise timing leak occurs if the input length is not a multiple of 8. In this case, the remainder of the bytes are compared one at a time.

Linux (32 and 64-bit, Ubuntu 10.04 with glibc 2.12.1)

The x86 32-bit memcmp() is a long unrolled set of 32-bit compares. When it finds some difference, it jumps to find_diff which, like Windows, identifies exactly which byte differs. This leaks a bytewise timing delta.

Like Windows, the Linux memcmp() does not require finding the differing byte:

The memcmp() function returns an integer less than, equal to, or greater than zero if the first n bytes of s1 is found, respectively, to be less than, to match, or be greater than the first n bytes of s2.

The 64-bit memcmp() uses a tricky combination of a bit-set instruction (bsf) and shifts to identify the first byte difference without branching. So if the input is a multiple of 4, 8, or 16, it should not reveal a timing delta for any smaller units. If the input length is not a multiple of these values, smaller differences are visible.

Linux has optimized versions for other architectures as well, but we did not examine them.

Summary

Only the Linux 64-bit memcmp() does not reveal bytewise timing deltas. The 32-bit version behaves the same as Windows.

FreeBSD 8.0 (32 and 64-bit, gcc 4.2.1)

The system libc.so and libc.a have an optimized memcmp() for the most popular CPU architectures (i386, amd64, and arm). For i386, it consists of a 32-bit compare for the length in chunks of 4 (rep cmpsd), and a single-byte rep cmpsb for the remaining up to 3 bytes. There is no check for alignment first, so behavior with unaligned pointers is identical. The amd64 version has the same behavior as i386 but the libc memcmp() works in units of 8-byte quadwords (rep cmpsq).

If a difference is found in the word comparison, all implementations of memcmp() back up and find the first different byte using rep cmpsb. This required by the man page:

The memcmp() function returns zero if the two strings are identical, otherwise returns the difference between the first two differing bytes (treated as unsigned char values, so that ‘\200’ is greater than ‘\0’, for example). Zero-length strings are always identical.

For all other platforms (sparc, mips, ia64), the generic implementation of memcmp() is used. It is a loop of 1-byte compares, implemented in C.

GCC does not use the libc version of memcmp() by default. It uses a built-in version that consists of a rep cmpsb for the entire input. This occurred no matter what the optimization level (-O1, -O2, -O3). If we added the -fno-builtin flag, gcc generated calls to the libc memcmp(). Nothing changed based on optimization level.

We disassembled the default system utilities on FreeBSD 8.0 and found none of them were compiled with -fno-builtin, and thus all used the 1-byte memcmp.

Summary

By default, gcc on FreeBSD uses a completely timeable memcmp() that works in 1-byte units. This is regardless of optimization level. If the -fno-builtin flag is used, the libc memcmp() is linked in instead. The libc memcmp() is not more resistant to timing attacks because it finds the first differing byte. The sparc, mips, and ia64 platforms always use a 1-byte loop.

Exploiting remote timing attacks

July 19, 2010 ~ Nate Lawson ~ 9 Comments

We’ll be giving a Blackhat talk on exploiting remote timing attacks. Our goal is to convince developers that this class of attack is exploitable and worth fixing. This article in Computer World gives a decent background.

The attack is very simple. You repeatedly send guesses about a secret value to the server, which rejects them as incorrect. However, if your first byte of the guess is correct, it takes a very slightly longer time to return the error. With many measurements and some filtering, you can distinguish this difference.

While any comparison against a secret is a potential target, we chose HMAC comparison for a few reasons. An HMAC is a message authenticator, similar to a digital signature. The primary difference is that the the verifying party must keep the true HMAC secret since it gives the attacker the correct authenticator for their forged message. HMACs are used in many protocols, including most web authentication frameworks such as OAuth and OpenID and HTTP session cookies.

Guessing the correct HMAC for an arbitrary message is game over for these authentication frameworks. The token grants access to resources or allows the attacker to assume a user’s identity on various websites.

This is not a new attack. Remote timing attacks on OpenSSL were shown to be practical in 2003. Further research in 2007 showed that differences as small as 20 microseconds over the Internet and 100 nanoseconds over the LAN could be distinguished with about 1000 samples.

We (and others) have been reporting these flaws for over a year and raising developer awareness. In 2009, we found a timing leak in Google Keyczar‘s HMAC verification that was quickly fixed. Coda Hale found a similar flaw in Java’s MessageDigest implementation. The OAuth group discussed his bug back then and some maintainers decided to fix it in their code too. But many didn’t.

A quick review of OAuth and OpenID implementations showed many had timing leaks that were potentially exploitable. Either developers knew about the bug and gave it a low priority or they weren’t aware of it. Either way, we thought some concrete research was needed to show exactly how easy or hard it was to exploit these flaws in various environments.

Exploiting timing attacks depends on extracting a timing difference from many samples by filtering out the effect of noise. If there is too much noise (the difference is too small), this attack may take too long to be practical. But an attacker who can control the environment to decrease noise (say, by blocking competing users of the server), accurately model the noise and thus filter it better, or just wait longer because their target is so valuable might be successful.

Our talk builds most closely on the Crosby 2007 paper mentioned above. We have tested many configurations to find how different variables influence an attacker. The most obvious analysis is how small a time delta can be distinguished for a given number of samples. This was performed from various vantage points (guest-to-guest VM, LAN, Internet) and for various languages (C, Python, Ruby, Java, etc.)

We applied various filtering methods to the samples to see how much unfiltered jitter remained. This would determine how small a difference could be distinguished. We added in other variables such as competing load, power management, and other factors.

The talk will have the full results. Both the proponents and skeptics should be surprised in some way. We have found some configurations that are almost certainly not exploitable and others that certainly are. If you’re the maintainer of a software package, don’t count on your users being safe from timing attacks because of your assumptions. Cryptographic software, especially open-source, is deployed in everything from slow embedded servers on up to multi-Ghz clusters.

Likewise, attackers often have a better vantage point than you’d first assume. With shared hosting providers and cloud computing, you have to assume attackers can locate themselves on the same host as their target. Even in a shared datacenter, you may assume the attacker has a LAN-equivalent vantage point.

Given that it is difficult to rule out timing attacks and the fix is so simple, we think it’s best to fix them proactively. The approach that is easiest to gain assurance is to use a constant-time compare function. (That post also gives reasons why other approaches don’t work or are too complicated to verify).

We hope our talk will give some concrete results so that more developers will take this flaw seriously. See you in Vegas!

SSL gives point-to-point, not end-to-end security

June 28, 2010June 28, 2010 ~ Nate Lawson

At Summercon, Jon Oberheide gave a talk on Android security. He described a trojan app called RootStrap that checks for native code updates on a remote server and executes them. This is possible because Android doesn’t place any restrictions on native code that is installed with or later downloaded by an app. The only limitation is that the code runs as the app’s unprivileged UID, but there are no additional restrictions.

The more interesting part gives an overview how the app installation process interacts with GTalkService. In a followup post today, Jon gave more analysis of this installation mechanism. Unlike other parts of Android, this service is not open source so you have to disassemble the DEX file to see how it works.

When you select “install” on the Market app, the phone doesn’t download and install the app immediately. Instead, Google’s server sends an INSTALL_APP message to GTalkService on the phone, which downloads and installs the app. The message is delivered over SSL and includes a SHA-1 hash of the app’s installer (.apk). While this is better than no authentication, the link between the user’s action and the actual code installed is tenuous.

SSL provides good point-to-point privacy and integrity protection. However, there is no guarantee to upper layers that SSL is indeed in use. Few, if any, programs query the SSL layer to check the state of the connection, do additional cert validation, etc. Even when active, SSL provides point-to-point, not end-to-end security.

In today’s computing environment, there are seldom only two systems involved in a transaction. Even if the apps were stored on a single Google server, they are still compiled and signed on other systems. Anywhere along that production chain, a compromise could lead to apps being trojaned and surreptitiously pushed to many Android phones.

Android does provide some security in its code signing model. The developer’s signature on the .apk is basically a JAR signature. The hash of the APK cert is used to determine if a new app can access the same data as the previous app since it determines which UID an app gets. However, this only protects data created by existing apps from being accessed by other apps that are not signed with the same key. It also doesn’t say anything about the legitimacy of the code since the developer signs it themselves, often with a self-signed cert.

Since it appears that the INSTALL_APP message does not have any signature on itself, this message is not protected other than with SSL. Could an attacker who could inject some messages into the Google server replay this message, causing phones everywhere to install their malware? Will phones install apps without the Market service requesting it?

We’ll have to see what happens as more info is found out about GTalkService. The installation process should include a challenge/response value for liveness (perhaps this is the “tickle_id” field?) The installed APK should be linked to the phone’s install challenge with a Google signature. After all, Android ships with a list of CAs. Why can’t Google include some limited CA for their own domains to enable this signing?

This is a good example of how SSL only provides point-to-point, not end-to-end security. While SSL is great for transactions, additional protection is needed for application-level functions such as updates, especially in today’s multi-server environment.

Why buffer overflow exploitation took so long to mature (part 2)

May 5, 2010May 1, 2010 ~ Nate Lawson ~ 5 Comments

Last time, I asked the question, “why did it take 24 years for buffer overflow exploits to mature?” The relevant factors to answering this question are particular to three eras: academic computing (early 1970’s), rise of the Internet (1988), and x86 unification (1996 to present).

In the first era of academia, few systems were networked so buffer overflow flaws could only be used for privilege escalation attacks. (The main type of networking was dialup terminal access, which was a simpler interface and usually not vulnerable to buffer overflows.) Gaining control of an application was relevant to the designers of the trusted computing criteria (Rainbow Books), where objects were tagged with a persistent security level enforced by the TCB, not the program that accesses them. Except for the most secure military systems, computers were mainly used by academia and businesses, which focused more on password security and preventing access by unsophisticated attackers. Why write a complex exploit when there are weak or no passwords on system accounts?

In 1988, the wide availability of Unix source code and a common CPU architecture (VAX) led to the first malicious buffer overflow exploit. Before the rise of the minicomputer, there were many different CPU architectures on the Arpanet. The software stacks were also changing rapidly (TOPS-20, RSX-11, VMS). But by the late 1980’s, the popularity of DEC, Unix, and BSD in particular had led to a significant number of BSD VAX systems on the Internet, probably the most homogeneous it had been up to that point.

I think the widespread knowledge of the VAX architecture along with the number of BSD systems on the Internet led to it being the target of the first malicious buffer overflow exploit. Building a worm to target a handful of systems wouldn’t have been nearly as much fun, and BSD and the VAX had hit a critical mass on the Internet. More general Unix flaws (e.g., trust relationships by the “r” utilities) were also exploited by the worm but the VAX buffer overflow exploit was the only such flaw the author chose to target or had enough familiarity to create.

With this lone exception, any further exploitation of buffer overflows remained quite secret. Sun workstations and servers running 68k and then SPARC processors became the Internet host of choice by the 1990’s. However, there was significant fragmentation with many hosts running IRIX on SGI MIPS, IBM AIX on RT and RS, and HP-UX on PA-RISC. Even though Linux and the various BSD distributions had been launched, they were mostly used for hobbyists as client PCs, not servers.

Meanwhile, a network security community had developed around mailing lists such as Bugtraq and Zardoz. Most members were sysadmins and hackers, and few vendors actively participated (Sun was a notable exception). Exploits and fixes/workarounds were regularly published, most of them targeting logic flaws such as race conditions or file descriptor leakage across trust domains. (On a side note, I find it amusing that any debate about full disclosure back then usually involved how many days someone should wait between notifying the vendor and publishing an exploit, not advisory. Perhaps we should be talking in those same terms today.)

In 1995, the various buffer overflow exploits published were only for non-x86 systems (SunOS and HP-UX, in particular). Such systems were still the most prevalent servers on the Internet. The network security community was focused on this kind of “big iron” with few exploits published for open-source Unix systems or Windows, which was almost exclusively a client OS. This changed with the splitvt exploit for Linux on x86, although it took a year until the real impact of this change appeared.

One of the reasons buffer overflows were so rarely exploited on Unix workstations in the early 1990’s was the difficulty of getting CPU and OS architecture information. Vendors kept a lot of this information private or only provided it in costly books. Sure you could read the binutils source code, but without even an understanding of the assembly behavior, it was difficult to gain this kind of low-level knowledge. Also, common logic flaws were easier to exploit for the Unix-centric sysadmin community and that populated Bugtraq.

In 1996, several worlds collided and buffer overflow exploitation grew rapidly. First, Linux and BSD x86 systems had finally become common enough on the Internet that people cared about privilege escalation exploits specific to that CPU architecture. Second, open-source Linux and BSD code made it easy to understand stack layout, trap frames, and other system details. Finally, the virus community had become interested in network security and brought years of low-level x86 experience.

The last factor is the one I haven’t seen discussed elsewhere. Virus enthusiasts (virus authors, defenders, and various interested onlookers) had been engaged in a cat-and-mouse game since the 1980’s. The game was played out almost exclusively on x86 DOS systems. During this period, attackers created polymorphic code generators (MtE) and defenders created a VM-based system for automated unpacking (TBAV). Even today, the impact of all this innovation is still slowly trickling into mainstream security products.

Once all three of these components were in place, buffer overflow exploits became common and the field advanced quickly through the present day. CPU architecture information is now freely available and shellcode and sample exploits exist for all major systems. New techniques for mitigating buffer overflows and subverting mitigations are also constantly appearing with no end in sight.

The advent of the Code Red and Slammer worms in the early 2000’s can be seen as a second-coming of the 1988 worm. At that point, Windows servers were common enough on the Internet to exploit but did not yet contain valuable enough data to keep the flaw secret. That period quickly ended as banking and virtual goods increased the value of compromising Internet hosts, and client-side exploits also became valuable. While some researchers still publish buffer overflow exploits, the difficulty of producing a reliable exploit and associated commercial value means that many are also sold. The race started in the 1970’s continues to this day.