Stop using unsafe keyed hashes, use HMAC

October 29, 2009October 28, 2009 ~ Nate Lawson ~ 3 Comments

The HMAC construction turns a cryptographic hash algorithm into a keyed hash. It is commonly used for integrity protection when the sender and recipient share a secret key. It was developed to address various problems with arbitrary keyed hash constructions. So why are developers still rolling their own?

One of the original papers on keyed hash constructions describes the motivations for developing a standard for HMAC. In 1995, there was no standardization and cryptographers only worked from hunches as to what constituted a secure keyed hash. This paper summarized two known attacks on some common schemes that had evolved in the absence of a standard.

The first construction the paper attacks is H(k || m), aka “secret prefix”. The key and the message to be authenticated were concatenated and hashed. The authenticator was the resulting hash. This was fatally flawed, as I mentioned in my previous talks on web crypto. Standard hash algorithms that use the Merkle-Damgard construction (like SHA-1) are subject to a length-extension attack. An attacker can trivially create an authenticator for m’ where m’ = m1 || pad || m2 if they have seen the authenticator for m1. (The “pad” value makes the input a multiple of the compression function block size and includes the total length hashed). This flaw was most recently found in the Flickr API.

The second construction was H(m || k), aka “secret suffix”. While the length-extension attack no longer applies because k is unknown to the attacker, this still maximally exposes you to weaknesses in the hash algorithm. Preneel et al described two attacks on this approach.

The first attack is that secret suffix is weaker against offline second-preimage attacks. That is, an attacker can take an authenticator for a known plaintext m and calculate their own plaintext m’ that hashes to the same value as the block just before k. If the input to the hash function just before k is identical, then the output is also the same. This means the attacker can just send m’ and the previously seen authenticator for m and the two will match.

For a secure cryptographic hash function, a second-preimage attack takes 2ⁿ tries where n is the hash size in bits[1]. However, the secret suffix approach is marginally weaker to this kind of attack. If an attacker has seen t text and authenticator pairs, then the effort is only 2ⁿ / t since they can attempt a second-preimage match against any of the authenticators they have seen. This is usually not a problem since second-preimage attacks are usually much harder than finding collisions. As they have aged, all widely-used hash algorithms have fallen to collisions before second-preimage attacks.

The other attack is much more powerful. If the attacker can submit a chosen message to be authenticated, she can attempt an offline collision search. In this case, an attacker searches for two messages, m and m’, that hash to the same value. Once they are found, she requests an authenticator for the innocuous message m. Since a collision means the intermediate hash state before k is mixed in is identical (an “internal collision”), the final authenticator for both will be identical. The attacker then sends the evil message m’ with the authenticator for m, the two match, and the message is accepted as authentic.

This means the secret suffix construction is insecure if collisions can be found in the underlying hash function. Due to the birthday paradox, this takes 2^n/2 work even for a secure hash function (e.g., 2⁶⁴ operations for a 128-bit hash). But it gets worse if the hash is weaker to collisions.

MD5 has multiple demonstrated collisions. Many systems continue to use HMAC-MD5 because a collision alone is not enough to compromise it. Because of the way the key is applied in HMAC, an attacker would have to generate an internal collision with the secret key, which is much harder than colliding with a chosen message[2]. Although this may provide some immediate comfort, it is still important to move to HMAC-SHA256 soon if you are using HMAC-MD5.

In contrast, MD5 with secret suffix is completely compromised due to collisions, especially with the recent advance in chosen-prefix collisions. Currently, this takes about 30 seconds on a laptop. To repeat, under no circumstances should you use an arbitrary hash construction instead of HMAC, and MD5 with secret suffix is completely broken. If you were putting off moving away from MD5(m || k), now would be an excellent time to move to HMAC-SHA256.

Thanks go to Trevor Perrin and Peter Gutmann for comments on this article.

[1] This is not true for longer messages. Multicollisions can be used against each block of a longer message. See the work by Kelsey and Schneier and Joux for more details.

[2] This is a very broad statement about HMAC. A more detailed analysis of its security will have to wait for another post.

Just another day at the office

October 23, 2009October 22, 2009 ~ Nate Lawson ~ 3 Comments

The following does not take place over a 24 hour period. But any one of these situations is a good example of a typical day at Root Labs.

Attack Windows device driver protection scheme

Certain drivers in Windows must implement software protection (PMP) in order to prevent audio/video ripping attacks. Since drivers run in ring 0, they can do a lot more than just the standard SEH tricks. It’s 1992 all over again as we dig into the IDT and attempt to find how they are trying to protect their decryption process.

Whitebox cryptography is a method of combining a key and cipher implementation to create a keyed cipher. It can only do whatever operation it was initialized with and uses a hard-coded key. The entire operation becomes a series of table lookups. The intermediate values are obscured by randomness merged into the tables. But it’s not impossible to defeat.

To get at the cipher though, we’re going to need a way to bypass some of these anti-debugging traps. Since ring 0 code has direct access to all the CPU’s registers (including the debug registers, MSRs, and page tables), it is free to wreak havoc with our attempts to circumvent it. A common approach is to disable any breakpoints in the registers by overwriting them. A less common method is to use the debug registers as a local procedure call gate so that an attacker that writes to them breaks the main loop.

We write our own hook and patch into the int 1 handler at a point that isn’t integrity-checked and set the GD bit. This causes our hook to be called whenever anyone writes or reads the debug registers. The trap springs! There’s the heart of the protection code right there. Now to figure out what it’s doing.

Design self-reinforcing integrity checks

Preventing attackers from patching your code is very difficult. One approach is to insert small hash functions that verify a region of code or data has not been changed. Of course, there’s the obvious problem of “who watches the watcher?” Since it’s easy for attackers to NOP out the checksum routine or modify their patch to compensate if you use CRC, our mission today is to design and implement a more robust approach.

First, we analyze the general problem of mutually-reinforcing checks. The code and data for the check itself both need to be covered. But if two checks exactly mirrored each other, you’d have a chicken-and-egg problem in building them since a change in one would require changes in the other, and so on. It’s like standing between two mirrors, infinite recursion.

A data structure that describes the best we can do is a directed acyclic graph. There are no cycles so the checks (and checksums) can be generated in reverse order. But the multiple paths through the graph provide overlapping coverage so we can be certain that a single patch cannot bypass the protection. Of course, the roots of each of these paths needs to be hidden throughout the code. Putting them all in one big loop run by a single watchman thread would be a mistake. Now we have to come up with a way of automatically generating, randomizing, and inserting them into the code while also hiding the root nodes in various places.

Reverse-engineer RFID transponder

What exactly is in those FasTrak transponders? Do they securely identify themselves via cryptographic challenge/response? Do they preserve your privacy? What about this rumor that they are tracked by antennas all over Bay Area freeways?

Not being an existing user, we bought one at a local supermarket and took it apart. Inside was a microcontroller, some passive electronics, and not a whole lot else. An older one turned out to still have JTAG enabled, so it was simple to dump its firmware. The newer one did have the lock bit set, and Flylogic Engineering was kind enough to decap it and zap it for us. The firmware was identical. Does IDA have an MSP430 plugin? Well, there was one but it was on Geocities and then vaporized. Time to dig around a bit. Find the source code and hack it to work with the newer SDK.

Then it’s off to drop in the code and analyze it for switch statements (protocol handlers usually). Trace everything that starts from the IO pins that map to the receive side of the antenna. Then manually walk up the stack to see where it goes. Hey, there’s an over-the-air update function in here. Just send the magic handshake, then the next packet gets written to flash. There are some checks to try to maintain the writes within the area that stores the ID. But there are multiple paths to this function and one of them is obviously not tested because it pushes the wrong size argument on the stack (a 2-byte instead of 1-byte length argument). Time to notify the agency that 1 million transponders are at risk of a permanent DoS at best and code execution at worst.

Coda

If you’ve made it this far, you might be interested to know we’re hiring. We tackle difficult security problems in environments with clever, persistent adversaries. We’re just as likely to design a system as attack it (and often do both for the same customer). If this sounds like your kind of job, please see here for more details.

Update on xum1541 development

October 14, 2009October 5, 2009 ~ Nate Lawson ~ 4 Comments

Earlier this year, I announced the xum1541 project. This is a microcontroller board that connects a C64 floppy drive via USB to a PC. It is intended to run at a high enough speed to support copying protected disks. Here’s an update on the project’s recent progress.

When I first started this work, I examined the xu1541 adapter developed by Till Harbaum in 2007. It used an AVR microcontroller and a software USB stack. It was a neat project but had a few limitations. The goal for the xu1541 was to be as cheap as possible and use only through-hole parts. Since it used software USB, the microcontroller spent a lot of time bit-banging the USB port and so could not transfer data as fast as the 1541 could (especially with a parallel cable). Also, it required JTAG support to program the microcontroller the first time, something not all users would have. Still, it was a very neat project and is now available for purchase.

I started over with the AT90USB microcontroller. This device has a hardware USB engine that can run at full speed while the main CPU core is running your firmware. It also comes with a bootloader pre-programmed at the factory so users can install the firmware simply by plugging it into a USB port. There is a very nice open-source interface layer for this USB hardware by Dean Camera called LUFA. There are also many pre-built development kits so adding the IEC connectors is all the soldering that is needed.

The first version of the xum1541 was backwards-compatible with the xu1541. You could use it with the stock OpenCBM software from CVS. However, it had some limitations that made this approach a dead-end. The xu1541 works entirely via USB control transfers, which are not intended for high throughput. The AT90USB does not support double-buffering on the control endpoint. Even with a hardware USB engine, the control transfers hit a limit. There was no way I could get the latency down into the 25 microsecond/byte range needed to match the nibbler protocol. However, I did see a good speed increase over the xu1541 simply due to the hardware USB engine.

Thus, I decided to change to using two bulk endpoints, similar to the mass storage IO model that is implemented in USB flash drives. The AT90USB supports double-buffering for two endpoints of 64 bytes each in this configuration. This means that the hardware USB engine will clock data out the bus while the CPU is filling the other buffer. Then the pages are flipped and the process continues. With this approach, I could decrease the latency for nibbler support and get a performance boost for regular transfers as well.

This took a while to implement since it involved rewriting both the firmware and the host plugin component of OpenCBM to work together. Even though I made no effort to optimize the code, the results are already impressive.

Command	Before	Now
`d64copy -t p 8 output.d64`	48.016 sec	35.429 sec
`cbmctrl download 8 0xc000 0x2000 rom1.bin`	25.813 sec	18.988 sec

There are several beta testers who have built their own copy of the hardware and are testing this version. Once we have ironed out any remaining bugs, I will release the pinouts and first version of the code. One notable feature that will be missing for a little while is the nibbler support. It will require more tuning to work reliably. However, it can be supported simply with a software upgrade so there’s no reason to delay the xum1541 release once the basic feature set is stable, which should be soon. It’s already useful as a fast option for transferring unprotected floppy images. I have no plans to produce a commercial version of this product, but I expect someone will take the design and build a cost-reduced model with a nice enclosure.

Thanks for all the words of support and patience as this hobby project nears completion.

Next Baysec: October 20 at Kate O’Briens

October 12, 2009September 15, 2009 ~ Nate Lawson

The next Baysec meeting is October 20 at Kate O’Briens. Come out and meet fellow security people from all over the Bay Area. As always, this is not a sponsored meeting, there is no agenda or speakers, and no RSVP is needed.

See you Tuesday, October 20, 7-11 pm. We’ll be towards the back.

Kate O’Briens
579 Howard St. @ 2nd, San Francisco
(415) 882-7240

Why RSA encryption padding is critical

October 6, 2009October 10, 2009 ~ Nate Lawson ~ 7 Comments

Public key cryptosystems like RSA are often taught by describing how they work at a high level. Even an algorithm textbook written by one of the inventors of RSA focuses more on exponentiation and how the decryption operation inverts ciphertext than on the underlying security details. However, all public key algorithms have various requirements that are not part of the basic primitive but are vital for security.

RSA requires the plaintext to be armored during encryption/signing and the result to be verified during decryption/verification. Unfortunately, this armoring is commonly called “padding”, which means some implementers think it functions like ordinary protocol padding. The interoperability principle (“be strict in what you send and lenient in what you accept”) is exactly opposite how public key crypto must be implemented. Padding cannot be ignored and if even one bit is out of place, the message is invalid. Failure to implement all the steps correctly could allow attackers to forge signatures, decrypt ciphertext, or even recover the private key.

I’ve written before about how failure to verify a signature result properly could result in forged signatures or recovery of the private key. This time, I’d like to explain an attack that shows why encryption armoring works differently than signature armoring.

Most RSA implementations use an optimization called the Chinese Remainder Theorem or CRT. With CRT, the implementation performs the exponentiation m^e mod n as two separate operations, m^e mod p and m^e mod q. The two parts are combined at the end. Since p and q are about half the size of n (which is p * q), this speeds up the exponentiation a lot because the multi-word numbers are smaller.

CRT can also be used to attack RSA. Consider an implementation that does not use any armoring. The encryption operation is simply the RSA primitive itself. A sender wants to send a message to three separate recipients. Thus, the sender calculates:

m^e mod A
m^e mod B
m^e mod C

The attacker can’t calculate the inverse of one of these encryptions directly because the e^th root problem in each ring is difficult. However, because the message is the same for each recipient (but different ciphertexts), she can convert these operations into a group where the inverse operation is easy. To do this, she uses CRT to combine the three ciphertexts to get:

m^e mod A*B*C

Since m is smaller than each of A, B, and C, m^e is smaller than A*B*C if e=3. This means the attacker just has to calculate the cube root of the result, an operation that is easy in the monoid of integers modulo A*B*C. This is essentially an integer cube root, ordinary arithmetic. This shows why PKCS #1 armoring for encryption has always been randomized. It can be fatal to encrypt the same message multiple times, even to the same recipient. For signatures, it is more secure to randomize the padding as in RSASSA-PSS, but it is not yet fatal for legacy systems to continue to use PKCS #1 v1.5 signature padding, which is not randomized.

In public key crypto, padding is not an optional feature. It is a critical part of the cryptosystem security. The latest version of PKCS #1 (v2.1 as of this writing) should be used for both encryption/signing and decryption/verification. For new implementation, use the approaches that first appeared in v2.0 (RSAES-OAEP for encryption and RSASSA-PSS for signing). Failure to properly manage RSA armoring could allow attackers to forge signatures, decrypt ciphertext, or even recover your private key.