Optimized memcmp leaks useful timing differences

One of the main questions raised after our Blackhat talk on exploiting remote timing attacks was how memcmp() optimizations might affect our results.

Our talk focused on a particular timing vulnerability. If an HMAC comparison exits early when a mismatch is found, it reveals information to an attacker about the correct HMAC for the forged message. This early-exit behavior is desirable in general but can leak timing information about secret data, allowing an attacker to iteratively guess the secret.

In C, memcmp() is almost always used for comparisons of binary data. Its API specifies that it compares two fixed-length buffers and returns the difference between them or zero if they are identical. In most implementations, memcmp() exits as soon as a difference is found in the two buffers.

Even if memcmp() exits early, the size of comparisons might affect how exploitable this timing delta is. For example, an optimized memcmp() may work in units of 64-bit quad words. Such an implementation might not be exploitable even if the timing delta of a mismatch was 1 second since you would have to brute-force 64-bits before seeing the timing delta increase due to a match. However, things aren’t so simple and using an optimized memcmp() does not always save you.

To prepare for our talk, we disassembled a few memcmp() implementations and found a surprising number were byte-based. This gives an attacker only 256 possibilities per guess, which is quite tractable. We decided to follow up with more detailed results for various operating systems.

We found two things:

  1. Even an optimized memcmp() usually leaks bytewise timing differences. In fact, it is often required by its API to do so.
  2. OS and compiler settings often select a bytewise compare even when a more-optimized version is available.

Anyone who is assuming particular compiler behavior is advised to review their disassembly. It’s almost always best to use a constant-time comparison instead of memcmp() when working with secret data.

Windows XP SP3 (32-bit)

We disassembled all .exes and DLLs in the system32 directory. All of them import memcmp() from either ntdll.dll or msvcrt.dll. All of them were compiled with some version of MSVC.

The msvcrt.dll memcmp() is 32-bit optimized. It compares 32-bit words if both pointers are 32-bit aligned (rep cmpsd). The remainder is compared with up to a 3-byte series of individual byte compares (cmp/jnz). If either of the input pointers is not 32-bit aligned, it goes into an unrolled loop that compares single bytes and exits early if any mismatch. ntdll.dll has the same memcmp() implementation.

Even with this optimization, the Windows memcmp() leaks bytewise timing information. When a difference is found in the 32-bit compares, it goes through each byte of the 32-bit word to find which byte differs. This means there is a 2-instruction timing delta per byte the attacker got correct, even though the original compare was in 32-bit chunks.

Windows memcmp() doesn’t actually have to do this set of bytewise comparisons. MSDN states the following behavior:

Return value Relationship of first count bytes of buf1 and buf2
< 0 buf1 less than buf2
0 buf1 identical to buf2
> 0 buf1 greater than buf2

The cygwin man page for memcmp() states:

The function returns an integer greater than, equal to or less than zero according to whether the object pointed to by S1 is greater than, equal to or less than the object pointed to by S2.

Since neither implementation requires identifying the bytewise difference, the implementation could just subtract the two 32-bit words. Only the sign of the return value matters, not the magnitude. This is analogous to treating the two buffers as multi-precision integers, just like in an RSA implementation.

We also compiled our own code using the cygwin and mingw gcc compilers. With no optimization flags, the cygwin gcc links to cygwin1.dll, which has a 32-bit optimized memcmp() similar to msvcrt.dll. It returns the bytewise difference and checks the input pointers for 32-bit alignment. The mingw compiler links to the msvcrt.dll version of memcmp() when no optimization flags are specified.

However, if an optimization level higher than default is used (-O1, -O2, -O3), both cygwin and mingw gcc use a 1-byte compare loop (rep cmpsb). This is surprising. If the -fno-builtin flag is specified, both cygwin and mingw gcc revert to the previous behavior, calling the external implementation in their respective DLLs.

We disassembled the default cygwin system utilities and libraries and found they all used a 1-byte compare loop. This is likely because they were compiled with optimization enabled but without the -fno-builtin flag.

Summary

The default Windows memcmp() (msvcrt.dll or ntdll.dll) leaks bytewise timing information even for chunks of 32-bits due to its search for the differing byte once a mismatch is found. It also leaks bytewise information if either input pointer is not 32-bit aligned. MSVC always uses an external DLL memcmp().

Both cygwin and mingw gcc use a 1-byte compare loop if optimization is enabled (-O1 or higher). If no optimization is enabled, gcc links to a DLL (cygwin1.dll or msvcrt.dll) that has a 32-bit optimized memcmp(). This memcmp() leaks the bytewise difference just like msvcrt.dll.

Windows 7 (64-bit)

This memcmp() implementation in msvcrt.dll compares buffers longer than 8 bytes in 64-bit chunks. It does not return the bytewise difference if it finds a mismatch. It returns the difference of the trailing 32-bits. The only bytewise timing leak occurs if the input length is not a multiple of 8. In this case, the remainder of the bytes are compared one at a time.

Linux (32 and 64-bit, Ubuntu 10.04 with glibc 2.12.1)

The x86 32-bit memcmp() is a long unrolled set of 32-bit compares. When it finds some difference, it jumps to find_diff which, like Windows, identifies exactly which byte differs. This leaks a bytewise timing delta.

Like Windows, the Linux memcmp() does not require finding the differing byte:

The memcmp() function returns an integer less than, equal to, or greater than zero if the first n bytes of s1 is found, respectively, to be less than, to match, or be greater than the first n bytes of s2.

The 64-bit memcmp() uses a tricky combination of a bit-set instruction (bsf) and shifts to identify the first byte difference without branching. So if the input is a multiple of 4, 8, or 16, it should not reveal a timing delta for any smaller units. If the input length is not a multiple of these values, smaller differences are visible.

Linux has optimized versions for other architectures as well, but we did not examine them.

Summary

Only the Linux 64-bit memcmp() does not reveal bytewise timing deltas. The 32-bit version behaves the same as Windows.

FreeBSD 8.0 (32 and 64-bit, gcc 4.2.1)

The system libc.so and libc.a have an optimized memcmp() for the most popular CPU architectures (i386, amd64, and arm). For i386, it consists of a 32-bit compare for the length in chunks of 4 (rep cmpsd), and a single-byte rep cmpsb for the remaining up to 3 bytes. There is no check for alignment first, so behavior with unaligned pointers is identical. The amd64 version has the same behavior as i386 but the libc memcmp() works in units of 8-byte quadwords (rep cmpsq).

If a difference is found in the word comparison, all implementations of memcmp() back up and find the first different byte using rep cmpsb. This required by the man page:

The memcmp() function returns zero if the two strings are identical, otherwise returns the difference between the first two differing bytes (treated as unsigned char values, so that ‘\200’ is greater than ‘\0’, for example). Zero-length strings are always identical.

For all other platforms (sparc, mips, ia64), the generic implementation of memcmp() is used. It is a loop of 1-byte compares, implemented in C.

GCC does not use the libc version of memcmp() by default. It uses a built-in version that consists of a rep cmpsb for the entire input. This occurred no matter what the optimization level (-O1, -O2, -O3). If we added the -fno-builtin flag, gcc generated calls to the libc memcmp(). Nothing changed based on optimization level.

We disassembled the default system utilities on FreeBSD 8.0 and found none of them were compiled with -fno-builtin, and thus all used the 1-byte memcmp.

Summary

By default, gcc on FreeBSD uses a completely timeable memcmp() that works in 1-byte units. This is regardless of optimization level. If the -fno-builtin flag is used, the libc memcmp() is linked in instead. The libc memcmp() is not more resistant to timing attacks because it finds the first differing byte. The sparc, mips, and ia64 platforms always use a 1-byte loop.

9 thoughts on “Optimized memcmp leaks useful timing differences

  1. When you say that: “Since neither implementation requires identifying the bytewise difference, the implementation could just subtract the two 32-bit words.”, you are forgetting that we are typically running on a little-endian machine. A naive subtraction will then give the wrong result sometimes: For example, consider these buffers:

    buf1 = { 0x45, 0x10, 0x00, 0x0 }
    buf2 = { 0x45, 0x00, 0x55, 0x55 }

    memcmp() should give a result > 0, because at the first differing byte 0x10 > 0x00. However, if you use an x86 or x86-64 32 bit subtraction on these values, you will compute 0x00001045 – 0x55550045, which is obviously < 0 and hence wrong.

    1. caf, thanks for your comment. You can use the bswap instruction before subtracting once you find a difference in a 32-bit value. It’s quite fast on registers.

      One of the implementations (Linux 64 bit) used a clever approach with bsf/sar/sal, along with subtractions. No conditional branches needed. Still, you usually want to implement your own constant time loop and not make assumptions about the library. Maybe some day libc will include a constant_memcmp()?

      1. True enough.

        Even if just a library like OpenSSL’s libcrypto included constant_memcmp() – having had a quick poke around the source I’m not even convinced it uses a constant-time comparison internally where it might be necessary.

  2. Here is a practical example.

    The xbox 360 was hacked (in 2007) by using a leaking memcmp function which used byte-wise comparision:

    http://beta.ivancover.com/wiki/index.php/Xbox_360_Timing_Attack

    Original thread (you need an account):

    http://www.xboxhacker.org/index.php?topic=8221.0

    This “broke” the HMAC and allowed downgrading to an exploitable kernel again. In just 1 hour ;)

    Video of Felix explaining (from around 48:40 and further) the timing attack in some detail: http://www.youtube.com/watch?v=XtDTNnEvlf8

  3. I have question now… if openssl and most of crypto libs are not protected for side channels, which is the right choice for a critical embedded system? For example an ARM linux. It’s better using gpgme than openssl to avoid silly mistakes but you will be always owneable with this attacks.

    Is there any free lib designed to resist side channel and fault injections? I guess C/C++ would be the choice for an embedded device. You have mentioned Mozilla’s NSS but probably it’s not well suited for embedded since it’s based on the Netscape Runtime.

    If not, which ones are the comercial ones? Is there any comparison between them?

    1. There is no free library designed to prevent side channels. Dan Bernstein’s NaCl is the closest I’ve seen.

      I mentioned NSS only to say they fixed this one memcmp() timing attack while OpenSSL chose not to. I did not mean that NSS is safe against all side channel attacks.

      The problem you will find for critical systems is that the best side channel countermeasures will be very architecture-dependent. This often means assembly or custom hardware, as C is too high-level to express features like cache alignment, redundant parallel computations, etc.

      Also, most of the effort in building a secure library is spent in validation. It has to be validated carefully on the target architecture. Finally, there are many patents for side channel countermeasures, including from my former employer CRI.

      All these factors mean any side channel-resistant library will certainly be commercial and require custom analysis on your particular device to get high assurance it works properly.

Comments are closed.