Advances in RSA fault attacks

A few months ago, there was an article on attacking an RSA private key by choosing specific messages to exercise multiplier bugs that may exist in modern CPUs. Adi Shamir (the “S” in “RSA”) announced the basics of this ongoing research, and it will be interesting to review the paper once it appears. My analysis is that this is a neat extension to an existing attack and another good reason not to implement your own public key crypto, but if you use a mainstream library, you’re already protected.

The attack depends on the target using a naively-implemented crypto library on a machine that has a bug in the multiplier section of the CPU. Luckily, all crypto libraries I know of (OpenSSL, crypto++, etc.) guard against this kind of error by checking the signature before outputting it. Also, hardware multipliers probably have less bugs than dividers (ahem, FDIV) due to the increase in logic complexity for the latter. An integer multiplier is usually implemented as a set of adders with additional control logic to perform an occasional shift, while a divider actually performs successive approximation (aka “guess-and-check”). The design of floating point divider logic is clever, and I recommend that linked paper for an overview.

The basic attack was first discovered in 1996 by Boneh et al and applied to smart cards. I even spoke about this at the RSA 2004 conference, see page 11 of the linked slides. (Shameless plug: come see my talk, “Designing and Attacking DRM” at RSA 2008.) Shamir has provided a clever extension of that attack, applied to a general purpose CPU where the attacker doesn’t have physical access to the device to cause a fault in the computation.

It may be counterintuitive, but a multiplication error in any one of the many multiplies that occur during an RSA private key operation is enough for an attacker who sees the erroneous result to quickly recover the private key. He doesn’t need to know which multiply failed or how far off it is from the correct result. This is an astounding conclusion, so be sure to read the original paper.

The standard RSA private key operation is:

md mod n

It is typically implemented in two stages that are later recombined with the CRT.
This is done for performance since p and q are about half the size of n.

s1 = md mod p
s2 = md mod q
S = CRT(s1, s2)

The power analysis graph in my talk clearly shows these two phases of exponentiation with a short pause in between. Remember that obtaining either p or q is sufficient to recover the private key since the other can be found by dividing, e.g. n/p = q. Lastly, d can be obtained once you know p and q.

The way to obtain the key from a faulty signature, assuming a glitch appeared during the exponentiation mod p is:

q = GCD((m – S’e) mod n, n)

Remember that the bad signature S’ is a combination of the correct value s2 = md mod q and some garbage G approximately the same size. GCD (which is fast) can be used to calculate q since the difference m – m’ is almost certainly not divisible by p.

The advance Shamir describes involves implementing this attack. In the past, it required physical possession of the device so glitches could be induced via voltage or clock transients. Such glitch attacks were once used sucessfully against pay television smart cards.

Shamir may be suggesting he has found some way to search the vast space (2128) of possible values for A * B for a given device and find some combination that is calculated incorrectly. If an attacker can use such values in a message that is to be signed or decrypted with the private key, he can recover the private key via the Boneh et al attack. This indeed would be a great advance since it could be implemented from remote.

There are two solutions to preventing these attacks. The easiest is just to verify each signature after generating it:

S = md mod n
m’ = Se mod n
if m’ != m:
    Fatal, discard S

Also, randomized padding schemes like RSASSA-PSS can help.

All crypto libraries I know of implement at least the former approach, and RSASSA-PSS is also available nearly everywhere. So the moral is, use an off-the-shelf library but also make sure it has countermeasures to this kind of attack.

Trapping access to debug registers

If you’re designing or attacking a software protection scheme, the debug registers are a great resource. Their use is mostly described in the Intel SDM Volume 3B, chapter 18. They can only be accessed by ring 0 software, but their breakpoints can be triggered by execution of unprivileged code.

The debug registers provide hardware support for setting up to four different breakpoints. They have been around since the 386, as this fascinating history describes. Each breakpoint can set to occur on an execute, write, read/write, or IO read/write (i.e., in/out instructions). Each monitored address can be a range of 1, 2, 4, or 8 bytes.

DR0-3 store the addresses to be monitored. DR6 provides status bits that describe which event occurred. DR7 configures the type of event to monitor for each address. DR4-5 are aliases for DR6-7 if the CR4.DE bit is clear. Otherwise, accessing these registers yields an undocumented opcode exception. This behavior might be useful for obfuscation.

When a condition is met for one of the four breakpoints, INT1 is triggered. This is the same exception as for a single-step trap (EFLAGS.TF = 1). INT3 is for software breakpoints and is useful when setting more than four breakpoints. However, software breakpoints require modifying the code to insert an int3 instruction and can’t monitor reads/writes to memory.

One very useful feature of the debug registers is DR7.GD (bit 13). Setting this bit causes reads or writes to any of the debug registers to generate an INT1. This was originally intended to support ICE (In-Circuit Emulation) since some x86 processors implemented test mode by executing normal instructions. This mode was the same as SMM (System Management Mode), the feature that makes your laptop power management work. SMM has been around since the 386SL and is the original x86 hypervisor.

To analyze a protection scheme that accesses the debug registers, hook INT1 and set DR7.GD. When your handler is called, check DR6.BD (also bit 13). If it is set, the instruction at the faulting EIP was about to read or write to a debug register. You’re probably somewhere near the protection code.  Since this is a faulting exception, the MOV DRx instruction has not executed yet and can be skipped by updating the EIP on the stack before executing IRET.

If you’re designing software protection, there are some interesting ways to use this feature to prevent attackers from having easy access to the debug registers. I’ll have to leave that for another day.

Mesh design pattern: error correction

Our previous mesh design pattern, hash-and-decrypt, requires the attacker either to run the system to completion or reverse-engineer enough of it to limit the search space. If any bit of the input to the hash function is incorrect, the decryption key is completely wrong. This could be used, for example, in a game to unlock a subsequent level after the user has passed a number of objectives on the previous level. It could also be used with software protection to be sure a set of integrity checks or anti-debugger countermeasures have been running continuously.

Another pattern that is somewhat rare is error correction. An error correcting code uses compressed redundancy to allow data that has been modified to be repaired. It is commonly used in network protocols or hard disks to handle unintentional errors but can also be useful for software protection. In this case, an attacker who patches the software or modifies its data would find that the changes have no effect as they are silently repaired. This can be combined with other techniques (e.g., anti-debugging) to require an attacker to locate all points in the mesh and disable them simultaneously. Turbo codes, LDPC, and Reed-Solomon are some commonly used algorithms.

Hashing and error correction are very similar. A cryptographic hash is analogous to a compressed form of the original data, since by design it is extremely difficult to generate a collision (two sets of data that have the same fingerprint.) Instead of comparing every byte of two data sets, many systems just compare the hash. You can build a crude form of error correction by storing multiple copies of the original data and throwing out any that have an incorrect hash due to a patching attempt or other error. However, this results in bloat, and it’s relatively easy for the reverse engineer to find all copies of the identical data in memory, even if the hash is somewhat hidden.

Turbo codes are an efficient form of error correction. To put it simply, three different chunks of data are stored: the message itself (m bits) and two parity blocks (n/2 bits each). The total storage required is m + n bits, coding data at a rate of m / (m + n). You can think of this as a sort of crossword puzzle where one parity block stores the clues for “across” and the other stores the clues for “down”. Two decoders process the parity blocks and vote on their confidence in the output bits. If the vote is inconclusive, the process iterates.

turbocode.png

To use error correction for software protection, take a block of data or instructions that are important to security. Generate an encoded block for it using a turbo code. Now, insert a decoder in the code which calls into or processes this block of data. If an attacker patches the encoded data (say, to insert a breakpoint), the decoder will generate a repaired version of that data before using it.

This has a number of advantages. If the decoding is not done in-place, the attacker will not see the data being repaired, just that the patch had no effect. The parity blocks look nothing like the original data itself so it looks like there is only one copy of the data in memory. The decoder can be obfuscated in various ways and inlined to prevent it from being a single point of failure. The calling code can hash the state of the decoder as part of hash-and-decrypt so that errors are detected as well, allowing the software protection to later degrade the experience rather than immediately failing. This hides the location of the protection check (temporal distance.)

Like all mesh techniques, error correction is best used in ways that are mutually reinforcing. The linker can be adapted to automatically encode data and insert the decoding logic throughout the program, based on control flow analysis. Continually-running integrity checking routines can be encoded with this approach. The more intertwined the software protection, the harder it is to bypass.

Functional languages and reverse engineering

In a previous discussion, Tim Newsham said

“I would like to see someone reverse engineer some small Haskell programs. The compilation techniques are totally foreign to anyone familiar with standard imperative languages and there are no tools designed specifically for the task.”

He then provided a link to some examples to analyze. Another commenter brought up Standard ML, another functional language. (I assume he means the NJ Standard ML implementation, but it could also be OCaml or Moscow ML as Dan Moniz pointed out.) Tim responded:

“I don’t know entirely. I’m not familiar with ML compiler implementation. They could use similar compilation techniques, but might not. ML is not ‘pure’ (and additionally is strict) so the compilation techniques might be different.”

He also provided links to a couple papers on implementing compilers for functional language. One commenter took a brief look at Tim’s examples:

“I took a look. The compiled Haskell is definitely different from the compiled ML I looked at. Roughly the same order of magnitude as to how terrible it was, though. Mine actually used Peano arithmetic on lists for simple arithmetic operations. What was funny was the authors of that program bragging about how algorithmically fast their technology was. I couldn’t help but think, after examining some entire functions and finding that all of the code was dead except for a tiny fraction of the instructions, how much a decent back-end (something with constant propagation and dead-code elimination) could have improved the runtime performance.”

Since one common obfuscation technique is to implement a VM and then write your protection code in that enviroment, how obfuscated is compiled object code from standard functional programming languages?