Why buffer overflow exploitation took so long to mature

I think the history of buffer overflow exploits is interesting because of how long it took for techniques to mature. About 16 years passed from awareness to first public exploitation, and then 8 more years from that until they were commonly exploited. Programmers were aware of this class of flaw but did little to avoid them for 24 years. But why?

Executing code via a buffer overflow was published at least as early as 1972 (pdf). It’s quite likely this was common knowledge before then, as overlay techniques (loading one segment over another during execution) were often used to save RAM. The first public malicious exploitation of a buffer overflow was by the 1988 Internet worm.

The Morris worm exploited a flaw in the finger daemon, which used gets() to read the username from the client into a buffer on the stack. Its shellcode was only targeted at 4.3 BSD Unix on the VAX CPU, and it used a return address overwrite to gain control. Interestingly enough, the worm did not have shellcode for SunOS even though the Motorola 68k CPUs would have been easy to exploit, and Sun’s fingerd was vulnerable to the same flaw.

Public exploitation of overflows took a 7-year hiatus, even with the publication of a seminal 1990 paper on fuzz testing. Everyone seemed to know overflows were present and theoretically exploitable, but no one seemed to be exploiting them or worried about fixing them.

Then in February 1995, Thomas Lopatic published a stack overflow exploit for NCSA httpd on HP-UX. This was an excellent piece of work, but on an obscure OS and CPU. (By the way, Thomas was also known for his work on a loader for Amiga games and later for analyzing the Windows XP activation scheme).

However, buffer overflow exploits were still few and far between. In August, 8lgm published an advisory for syslog() on SunOS SPARC, but no public exploit. In December, the splitvt exploit for Linux x86 was published. However, a month later, some people were still wondering how it worked.

In November 1996, Aleph1 published “Smashing the Stack for Fun and Profit“. This important article described in detail the evolution of a stack overflow exploit. Though I don’t know the exact history, Aleph1’s x86 shellcode is nearly identical to the splitvt exploit, so perhaps his method was derived from it. The article also provided shellcode for SunOS and Solaris on SPARC, which was an important advance since such machines were still more “interesting” than x86 systems.

After this paper, numerous stack overflows were published. Research on both sides advanced rapidly, with new techniques such as heap overflow exploitation and stack integrity protection. So why had research in this area taken so long to reach this inflection point of rapid growth?

In the next article, I discuss various factors that collided in 1996, creating the buffer overflow arms race that continues to this day.

Attacking RSA exponentiation with fault injection

A new paper, “Fault-Based Attack of RSA Authentication” (pdf) by Pellegrini et al, is making the rounds. The general idea is that an attacker can disrupt an RSA private key operation to cause an invalid signature to be returned, then use that result to extract the private key. If you’re new to fault injection attacks on RSA, I previously wrote an intro that should help.

The main concept to grasp is that public key crypto is brittle. In the case of RSA’s CRT operation, a single bit error in one multiplication result is enough to fully compromise your private key. We’ve known this since 1997. The solution is simple: validate every signature with the public key before returning it to the caller.

The authors noticed something curious. OpenSSL does verify signatures it generates before returning them, but if it detects a problem, it does not just return an error. It then tries again using a different exponentiation process, and then returns that signature without validating it.

Think about this for a moment. What conditions could cause an RSA private key operation to compute an invalid answer? An innocent possibility is cosmic radiation, bad RAM, etc. In this case, all computations should be considered unreliable and any retried operation should be checked very carefully. The other and more likely possibility is that the system is under attack by someone with physical proximity. In this case, OpenSSL should generate a very obvious log message and the operation should not be retried. If it is, the result should be checked very carefully.

For whatever reason, the OpenSSL programmers decided to retry with fixed-window exponentiation and trust that since there were no published fault attacks for it, they didn’t have to validate its result. This is a foolhardy attitude — not something you want to see in your crypto library. There had been many other fault injection attacks against various components or implementation approaches for RSA, including right-to-left exponentiation. There was no reason to consider left-to-right exponentiation invulnerable to this kind of attack.

Fixed-window exponentiation is a form of sliding window exponentiation. This is just a table-based optimization, where a window (say, 3 bits wide) is moved across the exponent, computing the final result incrementally. While this may be resistant to some timing attacks (but not cache timing or branch prediction attacks), there is nothing that would prevent fault injection attacks.

Indeed, it turns out to be vulnerable. The authors generate a few thousand signatures with single bit-flips in some window of the signature. Then they compare the faulty signatures to a correct signature over the same message. They compute the value for that portion of the private exponent since there are only 2w possibilities for that location if w is the window size in bits. This is repeated until enough of the private key is known.

The method they used to create the faulty signatures was a bit artificial. They built a SPARC system on an FPGA running Linux and OpenSSL. They then decreased the power supply voltage until multiplies started to fail. Since multiplication logic is a relatively long chain, it is often one of the first things to fail. However, a more interesting hardware result would be to attempt this kind of attack on an actual server because FPGAs work differently than ASICs. It might require careful targeting of the right power pins on the CPU. Since power pins are numerous in modern systems, this may be more effective than only modulating the system power supply.

This was a nice attack but nothing earth-shattering. The only thing I was floored by (yet again), was the willingness of crypto implementers to perform unsafe operations in the face of an almost certain attack. Shame on OpenSSL.

Timing-independent array comparison

I find it interesting that people tend to follow the same stages of awareness when first hearing about timing attacks. A recent discussion about Django redesigning their cookie scheme shows people at the various stages.

1. The timing difference is too small to attack

This is the result of not understanding statistics or how jitter can be modeled and filtered. Usually, people are thinking that a single measurement for a given input is probably obscured by noise. This is correct. That’s why you take many measurements for the same input and average them. The more noise, the more measurements you need. But computers are fast, so it doesn’t take long to do thousands or even millions of runs per guessed byte.

The basic statistical method is simple. You apply a hypothesis test of the mean of all runs with a given guess for a byte (say, 0) against the mean for all the others (1, 2, …, 255) If there is a significant difference in the means, then you have guessed that byte correctly. If no difference, repeat for the other 255 values. If still no difference, take more measurements.

The other misconception is that jitter is too great to get anything useful out of the measurements, especially in a network. There is an excellent paper by Crosby and Wallach that debunks this myth by carefully analyzing noise and its causes as well as how to filter it. They conclude that an attacker can reliably detect processing differences as low as 200 nanoseconds on a LAN or 30 microseconds on a WAN given only 1000 measurements. If your server is hosted somewhere an attacker can also buy rackspace, then you are vulnerable to LAN timing attacks.

2. I’ll just add a random delay to my function

Then I’ll take more measurements. See above.

3. I’ll add a deadline to my function and delay returning until it is reached

This usually has a higher overhead and is more prone to errors than implementing your function such that its timing does not vary, based on the input data. If you make a mistake in your interval calculation or an attacker is able to “stun” your function somewhere in the middle such that all target computations occur after the interval timer has expired, then you’re vulnerable again. This is similar to avoiding buffer overflow attacks by lengthening the buffer — better to fix the actual problem instead.

4. Ok, ok, I developed a constant-time function

How carefully did you analyze it? After citing my Google talk on crypto flaws, here was one attempt from the Reddit thread:

def full_string_cmp(s1, s2):
    total = 0
    for a, b in zip(s1, s2):
        total += (a != b)
    return not total

The first bug is that this opens a worse hole if the two strings are not the same length. In fact, an attacker can send in a zero-length string and the result of zip() is an empty array. This results in the for() loop never executing and any input being accepted as valid! The fix is to check the two lengths first to make sure they’re equal or in C, compare two fixed-length arrays that are guaranteed to be the same size.

The next bug is smaller, but still valid. There’s a timing leak in the comparison of a and b. When you write:

total += (a != b)

Both Python and a C compiler actually generate low-level instructions equivalent to:

if (a != b)
    tmp = 1
else
    tmp = 0
total += tmp

Thus, there is still a small timing leak. If they are equal, an additional branch instruction is executed. If they are not equal, this is skipped. A common intuitive mistake among programmers is that single lines of code are atomic. This is why you might find a C version of this such as  “total += (a != b) ? 1 : 0”, which generates the same code. Just because it fits on a single line does not mean it is atomic or constant-time.

5. Ok, I fixed those bugs. Are we done yet?

Let’s see what we have now (adapted from my keyczar posting).

if len(userMsg) != len(correctValue):
    return False
result = 0
for x, y in zip(userMsg, correctValue):
    result |= ord(x) ^ ord(y)
return result == 0

This now uses an arithmetic operator (XOR) instead of a logical compare (!=), and thus should be constant-time. The += operator could possibly have a leak if carries take longer than an add without carry, so we went with |= instead. We check the lengths and abort if they don’t match. This does leak timing information about the length of the correct value, so we have to make sure that is not a problem. Usually it’s not, but if these were passwords instead of cryptographic values, it would be better to hash them with PBKDF2 or bcrypt instead of working with them directly. But are we done?

Maybe. But first we need to figure out if our high-level language has any behavior that affects the timing of this function. For example, what if there is sign-bit handling code in the Python interpreter that behaves differently if x or y is negative? What if the zip() operator has an optimization that compares two arrays first to see if they’re identical before returning their union?

The answer is you can never be sure. In C, you have more assurance but still not enough. For example, what if the compiler optimization settings change? Is this still safe? What about the microarchitecture of the CPU? What if Intel came out with a new byte-wise cache alignment scheme in the Pentium 1000?

I hope this has convinced some people that side channel attacks are not easily solved. Important code should be carefully reviewed to have higher assurance this class of attack has been mitigated.

Stop using unsafe keyed hashes, use HMAC

The HMAC construction turns a cryptographic hash algorithm into a keyed hash. It is commonly used for integrity protection when the sender and recipient share a secret key. It was developed to address various problems with arbitrary keyed hash constructions. So why are developers still rolling their own?

One of the original papers on keyed hash constructions describes the motivations for developing a standard for HMAC. In 1995, there was no standardization and cryptographers only worked from hunches as to what constituted a secure keyed hash. This paper summarized two known attacks on some common schemes that had evolved in the absence of a standard.

The first construction the paper attacks is H(k || m), aka “secret prefix”. The key and the message to be authenticated were concatenated and hashed. The authenticator was the resulting hash. This was fatally flawed, as I mentioned in my previous talks on web crypto. Standard hash algorithms that use the Merkle-Damgard construction (like SHA-1) are subject to a length-extension attack. An attacker can trivially create an authenticator for m’ where m’ = m1 || pad || m2 if they have seen the authenticator for m1. (The “pad” value makes the input a multiple of the compression function block size and includes the total length hashed). This flaw was most recently found in the Flickr API.

The second construction was H(m || k), aka “secret suffix”. While the length-extension attack no longer applies because k is unknown to the attacker, this still maximally exposes you to weaknesses in the hash algorithm. Preneel et al described two attacks on this approach.

The first attack is that secret suffix is weaker against offline second-preimage attacks. That is, an attacker can take an authenticator for a known plaintext m and calculate their own plaintext m’ that hashes to the same value as the block just before k. If the input to the hash function just before k is identical, then the output is also the same. This means the attacker can just send m’ and the previously seen authenticator for m and the two will match.

For a secure cryptographic hash function, a second-preimage attack takes 2n tries where n is the hash size in bits[1]. However, the secret suffix approach is marginally weaker to this kind of attack. If an attacker has seen t text and authenticator pairs, then the effort is only 2n / t since they can attempt a second-preimage match against any of the authenticators they have seen. This is usually not a problem since second-preimage attacks are usually much harder than finding collisions. As they have aged, all widely-used hash algorithms have fallen to collisions before second-preimage attacks.

The other attack is much more powerful. If the attacker can submit a chosen message to be authenticated, she can attempt an offline collision search. In this case, an attacker searches for two messages, m and m’, that hash to the same value. Once they are found, she requests an authenticator for the innocuous message m. Since a collision means the intermediate hash state before k is mixed in is identical (an “internal collision”), the final authenticator for both will be identical. The attacker then sends the evil message m’ with the authenticator for m, the two match, and the message is accepted as authentic.

This means the secret suffix construction is insecure if collisions can be found in the underlying hash function. Due to the birthday paradox, this takes 2n/2 work even for a secure hash function (e.g., 264 operations for a 128-bit hash). But it gets worse if the hash is weaker to collisions.

MD5 has multiple demonstrated collisions. Many systems continue to use HMAC-MD5 because a collision alone is not enough to compromise it. Because of the way the key is applied in HMAC, an attacker would have to generate an internal collision with the secret key, which is much harder than colliding with a chosen message[2]. Although this may provide some immediate comfort, it is still important to move to HMAC-SHA256 soon if you are using HMAC-MD5.

In contrast, MD5 with secret suffix is completely compromised due to collisions, especially with the recent advance in chosen-prefix collisions. Currently, this takes about 30 seconds on a laptop. To repeat, under no circumstances should you use an arbitrary hash construction instead of HMAC, and MD5 with secret suffix is completely broken. If you were putting off moving away from MD5(m || k), now would be an excellent time to move to HMAC-SHA256.

Thanks go to Trevor Perrin and Peter Gutmann for comments on this article.

[1] This is not true for longer messages. Multicollisions can be used against each block of a longer message. See the work by Kelsey and Schneier and Joux for more details.

[2] This is a very broad statement about HMAC. A more detailed analysis of its security will have to wait for another post.

Google Tech Talk on common crypto flaws

Yesterday, I gave a talk at Google on common crypto flaws, with a focus on web-based applications. Like my previous talk, the goal was to convince developers that the most prevalent crypto libraries are too low-level, so they should reexamine their design if it requires a custom crypto protocol.

Designing a crypto protocol is extremely costly due to the careful review that is required and the huge potential damage if a flaw is found. Instead of incurring that cost, it’s often better to keep state on the server or use well-reviewed libraries that provide SSL or GPG interfaces. There are no flaws in the crypto you avoid implementing.

I hope you enjoy the video below. Slides are also posted here.

XMLDsig welcomes all signatures

There’s a new flaw in the W3C XMLDsig standard, found by Thomas Roessler. This standard specifies the procedure for signing and validating XML data, such as for SOAP. The document is validated with HMAC; however, an additional parameter can be supplied, <HMACOutputLength>. This allows the signer to use a truncated HMAC.

As you may remember, an HMAC is a way of validating data by using a secret key and a cryptographic hash algorithm. I avoid using the term “keyed hash” as that leads to bad implementations like “HASH(key || data)”. The output of an HMAC is the size of the hash algorithm, say 160 bits for SHA-1. This value is sometimes truncated in space-constrained designs. For example, only the first 128 bits might be sent and verified. This is sometimes acceptable because circumventing an HMAC requires a second-preimage attack (2n), unlike forging a signature which only requires a collision (2n/2).

The problem is that there is no specified lower bound on <HMACOutputLength>. So this is a perfectly valid signature field in XMLDsig:

<SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#hmac-sha1">
      <HMACOutputLength>0</HMACOutputLength>
</SignatureMethod>

Yes, an attacker can send any request and any signature and it will be accepted if the server follows the field blindly and validates only 0 bits. Even a server that checked for 0 might allow a truncated length of 1 bit or 8 bits or whatever.

According to the advisory, it has taken 6 months to release this simple spec erratum. This is because it affected so many vendors. This is a great example how a little leniency in crypto can have a huge impact on security.

Also, I’m not certain this bug is actually fixed. For example, I see that the algorithm can be specified by the same field. So could an attacker specify a broken algorithm like MD4, which has a second-preimage attack? If the server’s crypto library just maps the given Algorithm field to a hash function, this might be possible.

The spec says:

Requirements are specified over implementation, not over requirements for signature use. Furthermore, the mechanism is extensible; alternative algorithms may be used by signature applications.

So it’s not clear whether or not this is a problem, although ambiguity in crypto is one of the biggest sources of flaws.

The most annoying thing about this entire vulnerability is why were truncated HMACs even included in the standard at all? This is XML we’re talking about, not some packed binary protocol. The difference between full HMAC-SHA1 and the new minimum allowed truncated value is 10 bytes (plus base64 encoding expansion). You’re telling me that it’s worth exposing all your implementers to this kind of crypto risk to save 10 bytes? Really?