Advances in RSA fault attacks

A few months ago, there was an article on attacking an RSA private key by choosing specific messages to exercise multiplier bugs that may exist in modern CPUs. Adi Shamir (the “S” in “RSA”) announced the basics of this ongoing research, and it will be interesting to review the paper once it appears. My analysis is that this is a neat extension to an existing attack and another good reason not to implement your own public key crypto, but if you use a mainstream library, you’re already protected.

The attack depends on the target using a naively-implemented crypto library on a machine that has a bug in the multiplier section of the CPU. Luckily, all crypto libraries I know of (OpenSSL, crypto++, etc.) guard against this kind of error by checking the signature before outputting it. Also, hardware multipliers probably have less bugs than dividers (ahem, FDIV) due to the increase in logic complexity for the latter. An integer multiplier is usually implemented as a set of adders with additional control logic to perform an occasional shift, while a divider actually performs successive approximation (aka “guess-and-check”). The design of floating point divider logic is clever, and I recommend that linked paper for an overview.

The basic attack was first discovered in 1996 by Boneh et al and applied to smart cards. I even spoke about this at the RSA 2004 conference, see page 11 of the linked slides. (Shameless plug: come see my talk, “Designing and Attacking DRM” at RSA 2008.) Shamir has provided a clever extension of that attack, applied to a general purpose CPU where the attacker doesn’t have physical access to the device to cause a fault in the computation.

It may be counterintuitive, but a multiplication error in any one of the many multiplies that occur during an RSA private key operation is enough for an attacker who sees the erroneous result to quickly recover the private key. He doesn’t need to know which multiply failed or how far off it is from the correct result. This is an astounding conclusion, so be sure to read the original paper.

The standard RSA private key operation is:

md mod n

It is typically implemented in two stages that are later recombined with the CRT.
This is done for performance since p and q are about half the size of n.

s1 = md mod p
s2 = md mod q
S = CRT(s1, s2)

The power analysis graph in my talk clearly shows these two phases of exponentiation with a short pause in between. Remember that obtaining either p or q is sufficient to recover the private key since the other can be found by dividing, e.g. n/p = q. Lastly, d can be obtained once you know p and q.

The way to obtain the key from a faulty signature, assuming a glitch appeared during the exponentiation mod p is:

q = GCD((m – S’e) mod n, n)

Remember that the bad signature S’ is a combination of the correct value s2 = md mod q and some garbage G approximately the same size. GCD (which is fast) can be used to calculate q since the difference m – m’ is almost certainly not divisible by p.

The advance Shamir describes involves implementing this attack. In the past, it required physical possession of the device so glitches could be induced via voltage or clock transients. Such glitch attacks were once used sucessfully against pay television smart cards.

Shamir may be suggesting he has found some way to search the vast space (2128) of possible values for A * B for a given device and find some combination that is calculated incorrectly. If an attacker can use such values in a message that is to be signed or decrypted with the private key, he can recover the private key via the Boneh et al attack. This indeed would be a great advance since it could be implemented from remote.

There are two solutions to preventing these attacks. The easiest is just to verify each signature after generating it:

S = md mod n
m’ = Se mod n
if m’ != m:
    Fatal, discard S

Also, randomized padding schemes like RSASSA-PSS can help.

All crypto libraries I know of implement at least the former approach, and RSASSA-PSS is also available nearly everywhere. So the moral is, use an off-the-shelf library but also make sure it has countermeasures to this kind of attack.

Preventing RSA cache timing attacks

It has been known for a while that side-channel information about crypto operations (i.e., timing) can give enough information to recover secret keys. The original paper by Kocher even indicated RAM cache behavior as a potential source of timing variance.

In 2005, Osvik and Percival separately published papers on cache timing attacks. The latter resulted in a PR incident for Intel since Hyperthreading, which AMD doesn’t have, was announced as the key vantage point for observing the timing information of crypto running on another core. Since cache-related side channels are exploitable even in single-CPU systems, crypto libraries needed to be updated to fix this issue whether Hyperthreading was present or not.

For RSA, it’s necessary to understand modular exponentiation in order to understand the attack. When encrypting a message using the private exponent d, the server computes md mod n. To do this efficiently with big numbers, a number of techniques are used.

The basic technique, square-and-multiply, takes advantage of the fact that squaring is fast in binary (just a shift left). This method starts by walking through each bit of the exponent. If the bit is 0, the accumulating result (initially set to m) is squared. If it is 1, it is squared and then multiplied by m. In nearly all implementations, it takes longer to square and then multiply than it does to merely square, giving a very distinguishable timing difference. An attacker can figure out which bits are 0 or 1 by this difference.

Since basic square-and-multiply is still too slow, a method called “windowed exponentiation” was invented. Instead of examining each individual bit of the exponent, it walks through groups of bits of size w. There are both fixed and sliding window variants, but I’ll stick to fixed-window here for the sake of simplicity. Square-and-multiply is simply fixed-window exponentiation with w = 1.

For the common size w = 3, these values are precomputed mod n: m2, m3, m4, m5, m6, and m7. Obviously, m0 is 1 and m1 is just m so they are already known.

Just like in square-and-multiply, the accumulator is always first “squared”. However, since we are operating on multiple bits at once, it’s actually raised to 2w or 23 = 8 in our example. Then, the value of the bits in the window are checked. If 000b, nothing more happens. If 001b, the accumulator is multiplied by m. If 010b, it’s multiplied by the precomputed m2 and so forth. This speeds up the processing by handling batches of exponent bits, at the expense of some small precomputation.

Those precomputed values were typically stored in an array in memory, each of size of the modulus n (e.g., 256 bytes each for a 2048-bit n). Since the cache line size on modern x86 CPUs is 64 bytes, the multiply operation would take a little longer if the pre-computed value wasn’t already in the cache.

An attacker could repeatedly cause these values to be evicted from the cache and then based on the computation time, see which ones were used. Or, he could “touch” an array of garbage that was aligned along cache lines, access the SSL server, then time how long it took to read each value from his own array. If a value was quickly accessible, the SSL’s RSA implementation was likely to have used the associated pre-computed value. For example, if the 2nd element in the array was “hot”, then the server probably accessed its precomputed m3 and thus the corresponding bits of the private exponent were 011b.

This attack could be repeated many times until the entire key was extracted. Hyperthreading provided a good vantage point since the spy process could be accessing its array over and over while running at the same time as the SSL implementation on another core. However, this same attack could work on a single CPU system as well, albeit with more samples required.

To address these attacks for OpenSSL, Matthew Wood (Intel) submitted a patch that first appeared in version 0.9.7h. The key change is to the modular exponentiation function in openssl/crypto/bn/bn_exp.c.

This patch stripes the pre-computed m2…x values across cache lines instead of storing them sequentially in the array. That is, the first byte of m2 would be at address 0, the first byte of m3 at 1, etc. A memory dump of this region with w = 3 would look like this:

0: m2[0], m3[0], … m7[0]
64: m2[1], m3[1], … m7[1]
128: m2[2], m3[2], … m7[2]

Thus, the access pattern for reading any pre-computed value is exactly the same as any other: 256 sequential cache line reads. This is a clever way of removing the timing leak. I think it is commendable that Intel spent the time developing this code and contributing it to OpenSSL, especially after the widespread criticism that surrounded this problem was directed primarily at them.

There are two problems regarding this approach. The first is the limitation that it can’t fix problems with AES cache timing (see also this detailed analysis of Bernstein’s work and this proposed cache architecture).

The second may only affect certain hardware. The patch is configurable for machines with a 32 or 64-byte cache line size. The default is set to 64. If an older machine or a non-x86 CPU with a smaller cache line size runs the default compile of OpenSSL, it will still have a small leak if a large window is used (>= actual cache line size). For example, consider a machine with an 8-byte cache line size and a 12-bit window. OpenSSL would store m2…9[0] in the first cache line and m10…11[0] in the second, allowing an attacker to determine if a given set of exponent bits was less than 210 or not. This might be exploitable in certain situations.

To be truly safe against timing attacks, you need to carefully examine the behavior of the underlying hardware. Be especially careful if adapting an existing crypto implementation to a new platform and get an expert to review it.

Memory remanence attack analysis

You have probably heard by now of the memory remanence attack by Halderman et al. They show that it is easy to recover cryptographic keys from RAM after a reset or moving the DIMM to another system. This is important to any software that stores keys in RAM, and they targeted disk encryption. It’s a nice paper with a very polished but responsible publicity campaign, including a video.

Like most good papers, some parts of the attack were known for a long time and others were creative improvements. Memory remanence has been a known issue ever since the first key had to be zeroed after use. In the PC environment, the trusted computing efforts have been aware of this as well. (See “Hardware Attacks”, chapter 13 — S3 is suspend-to-ram and SCLEAN is a module that must be run during power-on to clear RAM). However, the Halderman team is publishing the first concrete results in this area and it should shake things up.

One outcome I do not want to see from this is a blind movement to closed hardware crypto (e.g., hard disk drives with onboard encryption). Such systems are ok in principle, but in practice often compromise security in more obvious ways than a warm reboot. For example, a hard drive that stores encryption keys in a special “lock sector” that the drive firmware won’t access without a valid password can be easily circumvented by patching the firmware. Such a system would be less secure in a cold power-on scenario than well-implemented software. The solution here is to ask vendors for documentation on their security implementation before making a purchase or only buy hardware that has been reviewed by a third-party with a report that matches your expectations. (Full disclosure: I perform this kind of review at Root Labs.)

Another observation is that this attack underlines the need to apply software protection techniques to other security applications besides DRM. If an attacker can dump your RAM, you need effective ways to hide the key in memory like white-box crypto, obfuscate and tamper-protect software that uses it, and randomize each install to prevent “break once, run everywhere” attacks. Yes, this is the exact same threat model DRM has faced for years but this time you care because you’re the target.

It will be interesting to see how vendors respond to this. Zeroing memory on reboot is an obvious change that addresses some of their methods. A more subtle hack is to set up page mapping and cache configuration such that the key is loaded into a cache line and never evicted (as done for fast IP routing table lookup in this awesome paper). However, none of this stops attacks that move the DIMM to another system. On standard x86 hardware, there’s no place other than RAM to put keys. However, the VIA C7 processors have hardware AES built into the CPU, and it’s possible more vendors will take this approach to providing secure key storage and crypto acceleration.

Whatever the changes, it will probably take a long time before this attack is effectively addressed. Set your encrypted volumes to auto-detach during suspend or a reasonable timeout and keep an eye on your laptop.

BitTorrent peer list encryption

BitTorrent has added a new method for encrypting peer lists. This is an attempt to avoid ISPs (such as Comcast) blocking connections that seem to be P2P traffic, as I previously wrote. This extension’s advantages and limitations are a good example for illustrating the fundamental leverage both sides have in this battle.

The actual extension is pretty straightforward.  Trackers manage files by their SHA-1 (aka infohash).  The extension specifies that a tracker RC4-encrypt the peer list with a key of SHA-1(infohash).  Thus, a peer must know the infohash of the file they are requesting to decrypt the peer list.  Obviously, they have the infohash since they had to know it to look up the file in the first place.

There are a couple weaknesses in this design.  If an ISP can read the infohash from the peer’s tracker connection, then they can also decrypt the peer list.  This is mitigated by some trackers supporting SSL connections.  Also, the specification allows for reuse of the RC4 keystream, a definite no-no.

Encryption is only a small skirmish in this battle.  Eventually, all traffic will be encrypted and probably encapsulated in SSL-like headers to avoid signature detection.  That will leave ISPs with traffic analysis as their only tool.  Since it appears that at least Comcast is already doing this and not relying on data in the stream, it’s unclear how effective this additional layer of obfuscation will be.

The BitTorrent developers have the advantage in that they control all endpoints (peers and trackers). Their software can take any measures it wants to obfuscate its data signatures (i.e., encryption) and traffic characteristics (i.e., timing or message sizes).  However, the biggest disadvantage is that they don’t control the behavior of non-BitTorrent hosts.

The ISPs have an advantage in that they see all the traffic between the hosts they care about and the Internet.  They can forge packets or even add filters to block particular IPs.  They know the mapping between IP or MAC address and subscriber account.   However, they can’t be sure what software is running on an endpoint and could lose subscribers if their response is too drastic or the wrong application is affected.

Even though traditional wisdom says BitTorrent has the advantage, I think the ISPs have the technical edge in this battle.  Since the BitTorrent developers can’t control hosts that don’t run their software, they will be forced to conform to the traffic profile of web browsing.  Because this is asymmetrical (uplink is only used for small ACKs), the performance advantages of BitTorrent would be eliminated.

However, it’s likely political/judicial moves will have a longer term impact on which side wins.  I think this would be a good thing.  Since there are only two broadband circuit providers in the US (telco or cable), competition won’t eliminate a progression of more onerous requirements for endpoints.  Without intervention, I could see a slow creep towards requiring virus scanning of endpoints or prohibitions on specific applications.  I already have to lie and claim to be running Windows to get tech support to run a line test when my DSL goes down (“oh yeah, it rebooted fine but the line is still down”).

Assuming a bad law doesn’t get added, I think regulation of ISPs would be a good thing to prevent further interference with our traffic.  I refuse to use the Internet as a TV.

TLS/SSL predictable IV flaw

Another attack that was addressed in TLS 1.1 results from a predictable initialization vector for encryption. This allows an attacker to verify guesses about previous plaintext and is an interesting example of how slight implementation variations in well-known cryptographic constructions can introduce exploitable flaws. Phil Rogaway and Bodo Moeller have some more detailed notes on such problems.

Remember that CBC encryption is a way of chaining multiple blocks together into a longer message. It first requires an IV to kick things off, then the encryption of subsequent blocks is made unique via each previous block’s ciphertext. Compare this to ECB, where each ciphertext block is independent and thus reveals information about its contents if plaintext is repeated.

An IV must have the following properties:

  • Unique: must not be repeated for any message encrypted with a given key
  • Unpredictable: an attacker who observes any number of messages and their IVs should have no information to predict the next one with probability of success greater than 50% per bit (i.e., indistinguishable from random)

Uniqueness is necessary because CBC devolves to ECB without it. It’s critically necessary for other modes of operation like OFB or stream ciphers where a repeated seed produces a repeated keystream, which is totally insecure.

Unpredictability is more subtle. The attack on TLS’s CBC IV is based on it being predictable, even though it was unique. More on that later.

Note that an IV does not have to be random. There’s a difference between computational indistinguishability and true randomness. Since you want some assurance that each IV is unique, it’s theoretically better to load an initial seed into a secure PRNG once and then generate only 2n/2 output bits before re-seeding it. If the PRNG is based on a secure permutation (say, a block cipher), you are guaranteed the sequence will not repeat if you limit the number of output bits before re-seeding. However, in practice, it’s also effective to continue feeding the PRNG entropy as it becomes available as a short cycle is extremely unlikely.

TLS’s record layer provides message boundaries for the application. Each message is typically encrypted in CBC mode if a block cipher like AES is being used. Each time a new message is sent, the last block of the previous message’s ciphertext is used as the IV. This means that an attacker observing the encrypted traffic knows what the next IV will be, even though it is unique/non-repeating.

The attack is simple. After observing a message, the attacker knows the IV for the next message will be ciphertext block Cn-1. Using this knowledge, the attacker can try to guess any previous plaintext block Px. He does this by constructing a plaintext block with the following format:

Pguess = Cn-1 XOR Px XOR Cx-1

Let’s break this down. The first item, Cn-1, is the known IV for the next message. Px is the guess for some previous block of plaintext, any will do. Finally, Cx-1 is the original block of ciphertext before our guessed block of plaintext. We know based on the operation of CBC that Px was chained with this value.

When Pguess is encrypted, the IV will cancel out (A XOR A = 0), leaving:

Cguess = ENCRYPT(Px XOR Cx-1)

As you can see, if the guess for was correct, the ciphertext Cguess will be identical to Cx. If the guess is wrong, the ciphertext will be different. This attack may be unrealistic in scenarios where the attacker cannot submit plaintext to the same TLS session as the target. However, this is feasible in shared connections such as a TLS/SSL VPN.

The important lesson here is that both uniqueness and unpredictability are vital when using IVs.

TLS/SSL MAC security flaw

Following my recent posts on TLS/SSL security, I gave a talk (slides are here) on a security flaw in the record layer that was fixed in TLS 1.1. The last page of my slides gives some interesting links if you’re interested in understanding SSL security better.

This flaw (found by Bodo Moeller) is in the use of padding as part of the integrity protection of the actual data being exchanged. Padding is needed because block ciphers encrypt data in chunks and something has to go in the remainder of the last block. This attack is particularly interesting because it allows an attacker to iteratively decrypt part of the message using side-channel leakage.

Side channel attacks are still often neglected, despite proof that they can be performed over the Internet. System designers always seem to have the same initial response when learning about timing attacks: make the computation time constant by adding a calibrated delay. When problems in this strategy are pointed out, their next move is to add a random delay after the computation (not blinding).

This usually repeats with each approach getting shot down until they eventually admit this is a hard problem and that appropriate measures need to be integrated with the actual process (not bolted on) and carefully evaluated for unforeseen problems. For example, one fix for this attack is to always compute the MAC even if the padding is incorrect. However, the logic path of noting that the padding is incorrect but continuing anyway still requires a conditional branch, which creates a small but observable timing difference that can be used in a successful attack.

Preventing side channel attacks is a difficult problem. If confronted with them, take the time to get your countermeasures carefully evaluated.