Timing attack in Google Keyczar library

May 28, 2009May 29, 2009 ~ Nate Lawson

I recently found a security flaw in the Google Keyczar crypto library. The impact was that an attacker could forge signatures for data that was “signed” with the SHA-1 HMAC algorithm (the default algorithm).

Firstly, I’m really glad to see more high-level libraries being developed so that programmers don’t have to work directly with algorithms. Keyczar is definitely a step in the right direction. Thanks to all the people who developed it. Also, thanks to Stephen Weis for responding quickly to address this issue after I notified him (Python fix and Java fix).

The problem was that the HMAC verify function (Python src/keyczar/keys.py, Java src/org/keyczar/HmacKey.java) leaked timing information based on how long a verify operation took to fail. The function was defined as follows for the HMAC mode:

Python

    return self.Sign(msg) == sig_bytes

Java

    return Arrays.equals(hmac.doFinal(), sigBytes);

Since the return value is a SHA-1 hash string, the operation devolves to a byte-by-byte compare against sig_bytes. In both Python and Java, this is a classic sequence comparison that terminates early once an incorrect match is found. This allows an attacker to iteratively try various HMAC values and see how long it takes the server to respond. The longer it takes, the more characters he has correct.

It may be non-intuitive, but the symmetric nature of MACs means the correct MAC value for an arbitrary message is a secret on-par with key material. If the attacker knows the correct MAC for a message of his choosing, he can then send that value to forge authentication of the message to the server.

I’ve implemented a simple test server using the Python version of Keyczar. It verifies an HMAC and sends back “yes” or “no” if the value is correct. I then wrote a client in C that connects to the server and tries various values for the HMAC. It tries each value multiple times and records a set of TSC differences for each. These can be fed to a program like ministat to decide when a significant difference has been confirmed (based on mean and standard deviation).

I can confirm that localhost tests have a discernible difference, depending on whether each subsequent byte is correct. I have not optimized the attack to work over a LAN or the Internet yet. However, this does not mean remote attacks are infeasible. Where jitter and other noise is present in the samples, an attacker just needs to collect more data to average it out. Remote timing attacks on SSL have been demonstrated where the timing difference was only a few native multiplies.

I recommended changing the verify function to use a timing-independent compare, such as the following.

    correctMac = self.Sign(msg)
    if len(correctMac) != len(sig_bytes):
        return False
    result = 0
    for x, y in zip(correctMac, sig_bytes):
        result |= ord(x) ^ ord(y)
    return result == 0

This function is data-independent, except for revealing the total length of the correctMac string. Since this is not considered important to security, it is acceptable. Of course, this might not be true for another use of this same code, so it cannot be blindly used in other applications.

The lesson from this is that crypto flaws can be very subtle, especially when it comes to transitioning from an abstract concept (“compare”) to a concrete implementation (“loop while bytes are equal”). Keyczar was implemented by some smart people. If you’re a programmer, you should be using a high-level library like Keyczar or GPGME to take advantage of this knowledge. If you ignore this and develop your own design, it’s likely it would have many worse problems than this one. For those that have to build crypto, please get a third-party review of your design.

I consider it a failing of the crypto community that these libraries are still so new, while the past 20 years we’ve focused on providing raw algorithm APIs. But at least now we have a chance to build out a few important high-level libraries, review them carefully, and encourage application developers to use them. It’s not too late.

22 thoughts on “Timing attack in Google Keyczar library”

noktest says:

May 29, 2009 at 2:20 pm

It is not a “crypto flaw” since the problem does not lie in the actual crypto algorithm, but rather at a point in the flow where a comparison for the authentication _out_ of the actual crypto algorithm is made.

This is the equivalent of using sha512 hashes for your password while you have it written on a postit in plain text. Nobody is perfect, bravo for picking this one up.
1. Nate Lawson says:
  
  May 29, 2009 at 2:43 pm
  
  Crypto is not just algorithms but protocols and implementations of them. This is an implementation error, but my point is that it’s dangerous to apply abstract concepts to crypto implementations.
  
  The compare operation that was previously used was functionally correct — it compared two values and returned whether they matched or not. It was perfect in the abstract concept sense. But we run code, not concepts, and code can leak information via side channels. It’s exactly this breakdown between abstract concept and concrete implementation that makes side channels so common.
  
  Based on your input, I’ve edited that sentence to try to make this more clear. Thanks.
Dave says:

June 8, 2009 at 3:32 am

Interesting. How would you generalize this though to handle something like checking of PKCS #1 or OAEP padding? This mixes up a bunch of checks in a manner that’s hard to make timing-independent…
1. Nate Lawson says:
  
  June 10, 2009 at 10:55 am
  
  I assume you’re referring to the Bleichenbacher attack, where there was a distinguishable error based on whether part of the padding was correct or not. Yes, the same comparison loop you can see in my proposed fix works for this too. You just generate the whole block and compare against it directly (don’t check components one at a time).
  
  Previously, I identified timing channels as another way to exploit the Bleichenbacher attack in this post.
  1. Dave says:
    
    June 10, 2009 at 10:36 pm
    
    Ah, my question was a bit ambiguous, for PKCS #1 v1.5 it’s easy enough to generate the required padding (and payload data) and do a constant-time memcmp() between it and the actual recovered padding and payload, but for OAEP the complex multi-stage processing means there’s no obvious way to do it in a single, constant-time operation.
    
    (The same applies for PSS, but in that case you’re processing the result of a public-key operation so it’s not an issue).
  2. Nate Lawson says:
    
    June 15, 2009 at 11:27 am
    
    Dave, good point about OAEP. Eliminating timing channels is a very difficult problem, especially when working with a high-level language. You have no guarantee about timing operations, architecture, cache line format, etc.
    
    I have no easy solution to this, which is why it costs so much to do a design with high assurance requirements. Careful review and analysis of each operation mixed with some creativity is what it takes.
Dave says:

June 13, 2009 at 5:45 am

Incidentally, this was the same attack that was used on older versions of the Xbox :-).
1. Nate Lawson says:
  
  June 15, 2009 at 11:20 am
  
  Yes, I found this privately in a product before the Xbox360 announcement. I mention that attack in my slides (link).
Dave says:

June 14, 2009 at 4:17 am

Replying to a comment you made at the Hacker News site (which requires that you sign up just to reply, grrr):

>I have found this exact bug (MAC compare timing attack) in three different
>products over the past few years.

As a corollary to this, have you ever found a product that *doesn’t* have this problem? I don’t think I’ve ever seen something that specifically does a constant-time memcmp() in situations like this. Now in most cases you’d never encounter this problem situation (for example in SSL and SSH there’s such a vast amount of other stuff going on that you’ll never see a difference of a few clocks in checking the HMAC result), so it’s really a case of finding something that’s potentially vulnerable to being (ab)used as an oracle, and then if you do I’d guess the chances of it actually being vulnerable would be close to 100%.
1. Nate Lawson says:
  
  June 15, 2009 at 11:23 am
  
  Yes, this kind of problem is widespread. Side channels are difficult to eliminate and yet a very powerful attack.
  
  I think people underestimate the observability of “a few clocks in checking the HMAC result”. My favorite reference is Remote Timing Attacks are Practical (Boneh and Brumley). Read it and the branch prediction timing attack papers before making judgements about what is exploitable or not.
  1. Dave says:
    
    June 15, 2009 at 6:48 pm
    
    >My favorite reference is Remote Timing Attacks are Practical (Boneh and
    >Brumley). Read it and the branch prediction timing attack papers before
    >making judgements about what is exploitable or not.
    
    I read it when it came out, but it’s a bit of a special case, you’ve got a complex heavyweight operation (CRT RSA decrypt taking advantage of issues with Montgomery reduction and Karatsuba vs. normal multiplies) which is a bit more timeable than a memcmp(). In addition they rely on the fact that the same key is applied over and over, while with the SSH and SSL HMAC you’ve got exactly one chance per key used (since a failed MAC will result in a fatal protocol error). Now I’m definitely not trying to say that timing attacks aren’t an issue, more that you don’t immediately need to throw out your SSH and SSL code because of a new timing attack on HMAC checks.
    
    (I once got asked to review a monitoring system that used a 16-bit truncated MAC on its messages (the protocol PDUs were pretty constrained). When I commented on this the person involved pointed out that even if you could intercept a message and forge the MAC on it, you then had to also intercept and modify the MAC on each succeeding message – it used chaining across messages – at the rate of one every 50ms, for the rest of eternity. This extra detail effectively “fixed” an otherwise insecure mechanism)
    
    As a general comment for other readers, the slides are interesting reading. Some gems:
    
    – Arguments why non-standard usage is OK, based on… a blog post.
    – Warning signs: bad crypto ahead – Spec diagrams from Wikipedia
    
    Oh, and in response to the “Questions” bit at the end:
    
    It’d be good to get these as PDFs or something that allows offline reading, having to read text slides using Flash is just… wrong.
  2. Nate Lawson says:
    
    June 16, 2009 at 10:52 am
    
    Dave, thanks for another interesting comment. I agree exploitability is highly variable. Web cookies are a particularly vulnerable target because they often use the same secret key for the HMAC and don’t log bad validations. So you get millions of tries per minute and no log of the failures.
    
    You can download the slides as a PDF with the “Get File” button at the top of the Slideshare presentation. I also post PDFs on my personal website.
Dave says:

June 17, 2009 at 2:28 am

Some further comments after playing with Visual C++ and gcc’s memcmp() a bit…

gcc (for x86), despite claims on mailing lists to the contrary, appears to implement its memcmp() with a repe cmpsb even with optimisation enabled (seen via gcc -S -O2), so this should be vulnerable provided you can get precise timings.

VC++, going all the way back to 6.0 from 1998, uses an optimised compare in which it compares 32-bit quantities as much as possible using repe cmps and then fixes up the leftovers at the end of the loop manually (with all sorts of special-case handling for alignment and odd-length quantities and the like, the code is quite convoluted and since all I have is a disassembly I’m having to guess at the reasons for some of this, which looks like a lot of hand-tuned P5 pipelining optimisation). For multiple-of-32-bit data values like an HMAC compare you’re getting 32-bit compares in a tight loop, so any non-prehistoric Windows code at least would appear to be immune to the guess-a-byte timing channel because the granularity level is 32 bits and not 8.

Thanks for the PDF links.
1. Nate Lawson says:
  
  June 17, 2009 at 10:16 am
  
  Dave, you’re right that 32-bit compares are harder to attack. On average, you have to do 2^31 guesses before you hit the right one. And then, if your timing channel is at all noisy, you have to repeat them all again. As you note, it’s hard to be sure what you’ve got until you really look at things from a low level.
  
  However, some situations like local attacks or slow embedded processors might still be vulnerable. Side channels really require careful threat modeling and low-level awareness.
  
  P.S. Care to include your email address?
  1. Yuhong Bao says:
    
    January 24, 2010 at 11:40 pm
    
    What about fast string ops?
  2. Nate Lawson says:
    
    January 26, 2010 at 9:41 pm
    
    I assume you’re referring to the x86 string instructions such as “rep cmpsd”? They still terminate early so there is a timing difference. It will be very small, but you have to consider the attacker’s vantage point. It still may be exploitable and is highly dependent on the security model.
Steve Hurcombe says:

January 5, 2010 at 6:28 am

Hi Nate,
This is very interesting. Now there may be a simpler way to avoid these attacks.

If you added in a random delay you would surely add enough artificial noise to make the measurement impractical. This would then mask the inner workings of most algorithms regardless of how many separate paths there were.

Best regards
Steve
1. Nate Lawson says:
  
  January 6, 2010 at 11:34 am
  
  Steve, adding noise does not make the attack impractical. Since the very nature of random noise is that it is equally distributed, it averages out if you take more measurements.
  
  The only solution is to use operations that do not have data-dependent variable timing.
Aahz says:

April 5, 2010 at 2:32 pm

I’m curious why this loop isn’t used, it would be a lot faster because
of avoiding the function calls:

for x, y in zip(correctMac, sig_bytes):
result += (x == y)

Assuming I understand correctly, this should not result in a timing
difference until you have a digest that’s bigger than two gigabytes.
1. Nate Lawson says:
  
  April 8, 2010 at 2:36 pm
  
  Every call to “==” is a branch. So you will reveal the timing difference of whether or not the branch is taken on each character. This is actually easier to attack than “terminate early” behavior since it allows you to test a number of possibilities in parallel.
Travis H. says:

August 11, 2010 at 11:06 am

Hey Nate,

Regarding Steve Hurcombe’s comment, I wonder if you had heard of anyone injecting a delay, not random, but based on a function of the user-selected guess and a secret value. It seems that the function computation could be potentially expensive, but it would be simpler to implement in some cases than constant-time algorithms. Example, much simplified for didactic purposes:

…do comparison…
sleep(HMAC(user_supplied_guess, secret_delay_key))

Described among possible counter-measures here:
http://www.subspacefield.org/security/security_concepts/index.html#tth_sEc31.2.4

Another idea, off the top of my head, would be to hash the two values before comparison, so knowing how many octets were identical doesn’t help you much – you find out about the hash of the correct value, but preimage resistance prevents you from learning about the correct value.

I throw these out there as ideas because DJB makes me skeptical of my ability to write truly constant-time code. :-)
1. Nate Lawson says:
  
  August 11, 2010 at 1:03 pm
  
  Unlike local timing attacks where you have to worry more about cache, branch prediction cache, etc., remote timing attacks are easiest to avoid by using a simple constant time loop. More complicated countermeasures are more likely to fail.