Hardware design and glitch attacks

The first step in understanding glitch attacks is to look at how hardware actually works. Each chip is made up of transistors that are combined to produce gates and then high-level features like RAM, logic, lookup tables, state machines, etc. In turn, those features are combined to produce a CPU, video decoder, coprocessor, etc. We’re most interested in secure CPUs and the computations they perform.

Each of the feature blocks on a chip are coordinated by a global clock signal. Each time it “ticks”, signals propagate from one step to another and among the various blocks. The speed of propagation is based on the chip’s architecture and physical silicon process, which together determine how quickly the main clock can run. This is why every CPU has a maximum (but not minimum) megahertz rating.

In hardware design, the logic blocks are made up of multiple stages, very much like CPU pipelining. Each clock cycle, data is read from an internal register, passes through some combinational logic, and is stored in a register to wait for the next clock cycle. The register can be the same one (i.e. for repeated processing as in multiple rounds of a block cipher) or a different one.

chippipe1.png

The maximum clock rate for the chip is constrained by the slowest block. If it takes one block 10 ns to propagate a signal from register to register, your maximum clock rate is 100 MHz, even if most of the other blocks are faster. If this is the case, the designer can either slice that function up into smaller blocks (but increase the total latency) or try to redesign it to take less time.

A CPU is made up of multiple blocks. There is logic like the ALU or branch prediction, RAM for named registers and cache, and state machines for coordinating the whole show. If you examine an assembly instruction datasheet, you’ll find that each instruction takes one or more clocks and sometimes a variable number. For example, branch instructions often take more clock cycles if the branch is taken or if there is a branch predictor and it got it wrong. As signals propagate between each block, the CPU is in an intermediate state. At a higher level, it is also in an intermediate state during execution of multi-cycle instructions.

As you can see from all this, a CPU is very sensitive to all the signals that pass through it and their timing. These signals can be influenced by the voltage at external pins, especially the clock signal since it is distributed to every block on a chip. When signals that have out-of-spec timing or voltage are applied to the pins, computation can be corrupted in surprisingly useful ways.

Glitch attacks revealed

(First in a series of articles on attacking hardware and software by inducing faults)

One of the common assumptions software authors make is that the underlying hardware works reliably. Very few operating systems add their own parity bits or CRC to memory accesses. Even fewer applications check the results of a computation. Yet when it comes to cryptography and software protection, the attacker controls the platform in some manner and thus faulty operation has to be considered.

Fault induction is often used to test hardware during production or simulation runs. It was probably first observed when mildly radioactive material that is a natural part of chip packaging led to random memory bit flips.

When informed that an attacker in possession of a device can induce faults, most engineers respond that nothing useful could come of that. This is a similar response to when buffer overflows were first discovered in software (“so what, the software crashes?”) I often find this “engineering mentality” gets in the way of improving security, even insisting you must prove exploitability before fixing a problem.

A good overview paper is “The Sorcerer’s Apprentice Guide to Fault Attacks” by Bar-el et al. In their 1997 paper “Low Cost Attacks on Tamper Resistant Devices,” Anderson and Kuhn conclude:

“We have improved on Differential Fault Analysis. Rather than needing about 200 faulty ciphertexts to recover a DES key, we need between one and ten. We can factor RSA moduli with a single faulty ciphertext. We can also reverse engineer completely unknown algorithms; this appears to be faster than Biham and Shamir’s approach in the case of DES, and is particularly easy with algorithms that have a compact software implementation such as RC5.”

This is quite a powerful class of attacks, and is sometimes applicable to software-only systems as well. For instance, a signal handler often can be triggered from remote, inducing faults in execution if the programmer wasn’t careful.

Of course, glitch attacks are most applicable to smart cards, HSMs, and other tamper-resistant hardware. Given the movement to DRM and trusted computing, we can expect to see this category of attack and its defenses become more sophisticated.  Why rob banks? Because that’s where the money is.

RSA public keys are not private (implementation)

Previously, I described a proposed system that will both sign and encrypt code updates using an RSA private key. The goal is to derive the corresponding public key even though it is kept securely within the device.

If you delve into the details of RSA, the public key is the exponent and modulus pair (e, n) and private key is (d, n). The modulus is shared between the two keys and e is almost always 3, 17, or 65537. So in the vendor’s system, only d and n are actually secrets. If we could figure out d, we’ve broken RSA, so that’s likely a dead-end. But if we figure out e and n, we can decrypt the updates.

Let’s see how to efficiently figure out n using guesses of e and a pair of encrypted updates. The signature verification process, RSA-Verify, can be represented by the standard equation shown below in various forms.

rsa-sig2.png

The first step is the standard RSA-Pub-Op and the result is m, the padded hash of the message. The remaining steps in RSA-Verify (i.e., validating the signature) are not relevant to this attack. In the second equation, m is refactored to the left side. Finally, 0 mod n is described in an alternate form as k * n. The particular k value is the number of times the odometer has wrapped during exponentiation. It’s not important to us, just the modulus n.

Now, assume we’ve seen two updates and their signatures, sig1 and sig2. We can convert them to the above form by raising each signature to a guess for e and subtracting the associated message m. Note that m is still encrypted and we don’t know its contents.

rsa-sig3.png

This gives us two values with n as the common factor. The GCD algorithm gives us the common factors of two values very quickly. The reason there is no “=” sign above is that we actually get n times some small prime factors, not just n. Those can easily be removed by trial division by primes up to 1000 since the real factors of n (p and q) are at least 100-digit numbers. Once we have n, we can take each encrypted message and decrypt it with me mod n as usual, giving us the plaintext code updates.

I implemented this attack in python for small values of e. Try it out and let me know what you think.

Doing this attack with e = 65537 is difficult due to the fact that it would have to work with values millions of bits long. However, it is feasible and was implemented by Jaffe and Kocher in 1997 using an interesting FFT-multiply method. Perhaps they will publish it at some point. Thanks go to them for explaining this attack to me a few years ago.

Finally, I’d like to reiterate what I’ve said before. Public key cryptosystems and RSA in particular are extremely fragile. Do not use them differently than they were designed. Better yet, seek expert review or design assistance when working with any design involving crypto.

RSA public keys are not private

RSA is an extremely malleable primitive and can hardly be called “encryption” in the traditional sense of block ciphers. (If you disagree, try to imagine how AES could be vulnerable to full plaintext exposure to anyone who can submit a message and get a return value of “ok” or “decryption incorrect” based only on the first couple bits.) When I am reviewing a system and see the signing operation described as “RSA-encrypt the signature with the private key,” I know I’ll be finding some kind of flaw in that area of the implementation.

The entire PKCS#1 standard exists to add structure to the payload of RSA to eliminate as much malleability as possible. To distinguish the RSA primitive operation from higher-level constructs that use the RSA primitive, I use the following terminology.

RSA-Pub/Priv-Op Perform an RSA operation using the appropriate key (exponent/modulus): dataexp mod n
RSA-Encrypt Add random padding and transform with RSA-Pubkey-Op
RSA-Decrypt Transform with RSA-Privkey-Op and check padding in result
RSA-Sign Add padding and transform with RSA-Privkey-Op
RSA-Verify Transform with RSA-Pubkey-Op and check padding in result, hash data, compare with hash in payload and return True if match, else False

With that out of the way, a common need in embedded systems is secure code updates. A good approach is to RSA-Sign the code updates at the manufacturer. The fielded device then loads the update into RAM, performs RSA-Verify to be sure it is valid, checks version and other info, then applies the update to flash.

One vendor was signing their updates, but decided they also wanted privacy for the update contents. Normally, this would involve encrypting the update with AES-CBC before signing. However, they didn’t want to add another key to the secure processor since the design was nearly complete.

The vendor decided that since they already had an RSA public key in the device for signature verification, they would also encrypt the update to that key, using the corresponding private key. The “public” key was never revealed outside their secure processor so it was considered secret.

Let’s call the public/private key pair P1 and S1, respectively. Here’s their new update process:

rsa-sig1.png

On receiving the update, the secure processor would do the inverse, verifying the signature and then decrypting the code. (Astute reader: yes, there would be a chaining problem if encrypting more than one RSA block’s worth of data but let’s assume the code update is small.)

What’s wrong? Nothing, if you’re thinking about RSA in terms of conventional crypto. The vendor is encrypting something to a key that is not provided to an attacker. Shouldn’t this be as secure as a secret AES key? Next post, I’ll reveal how the malleability of the RSA primitive allows n to be easily calculated.

Functional languages and reverse engineering

In a previous discussion, Tim Newsham said

“I would like to see someone reverse engineer some small Haskell programs. The compilation techniques are totally foreign to anyone familiar with standard imperative languages and there are no tools designed specifically for the task.”

He then provided a link to some examples to analyze. Another commenter brought up Standard ML, another functional language. (I assume he means the NJ Standard ML implementation, but it could also be OCaml or Moscow ML as Dan Moniz pointed out.) Tim responded:

“I don’t know entirely. I’m not familiar with ML compiler implementation. They could use similar compilation techniques, but might not. ML is not ‘pure’ (and additionally is strict) so the compilation techniques might be different.”

He also provided links to a couple papers on implementing compilers for functional language. One commenter took a brief look at Tim’s examples:

“I took a look. The compiled Haskell is definitely different from the compiled ML I looked at. Roughly the same order of magnitude as to how terrible it was, though. Mine actually used Peano arithmetic on lists for simple arithmetic operations. What was funny was the authors of that program bragging about how algorithmically fast their technology was. I couldn’t help but think, after examining some entire functions and finding that all of the code was dead except for a tiny fraction of the instructions, how much a decent back-end (something with constant propagation and dead-code elimination) could have improved the runtime performance.”

Since one common obfuscation technique is to implement a VM and then write your protection code in that enviroment, how obfuscated is compiled object code from standard functional programming languages?

WOOT = Usenix + Blackhat

The call for papers is now up for a new Usenix workshop, WOOT (Workshop On Offensive Technologies, but don’t think the name came before the acronym.) The workshop will be co-hosted with Usenix Security and will focus on new practical attacks.

I was recently saying that vulnerability research could use more Peer Review instead of the other kind of PR (i.e., vague news stories, user-scaring Month of X Bugs). So help the community out here by submitting quality papers, especially if you’ve never submitted one before. I think the goal of bridging the gap between slideware (e.g., Blackhat) and 15th generation theoretical overlay network designs (e.g. Usenix Security) is a great one.

Also, I’m on the program committee but don’t hold that against them.