History of TEMPEST and side channel attacks

A very interesting paper, “TEMPEST: A Signal Problem“, was recently declassified by the NSA. It gives a short history of TEMPEST and other side channel attacks, that is ways to discover a secret cryptographic key by monitoring indirect channels like RF emissions, sound, or power. Besides the new information that TEMPEST was discovered by Bell Labs in 1943, there are a number of lessons to learn from this paper.

It’s interesting that RF emissions in the form of oscilloscope perturbation were the first side channel found. After that, the paper covers acoustical, power line, seismics, and flooding. The last two are uncertain since the body of the text is still classified. In modern terminology, the attacks were SPA (simple side-channel analysis) since the plaintext was read directly from distinct “fingerprints” that appeared on the scope. DPA (differential side-channel analysis) involves more complex acquisition and statistical correlation of multiple samples, something this paper does not reveal as known at the time.

The history of attempted countermeasures is a good case study. First, Bell Labs tried to produce a heavily-shielded cipher machine. The army would not purchase it because it wasn’t a simple retrofit to the existing model. The final recommendation was that field personnel control a zone 200 feet in diameter from all cipher machines. There was no study that showed this was the perfect range, only that most field offices could do this without difficulty and it probably helped. This is very similar to countermeasures today, where cost or deployment effort are often more important than achieving the best security.

The categories of countermeasure they identified were:

  • Shielding/filtering: reducing the signal strength of emissions
  • Masking: adding random noise to the environment

If you think of a side channel attack as a communications problem, it’s obvious this is the classic signal-to-noise ratio. The paper states that they had trouble effectively shielding devices and that masking wasn’t effective either. This fits with the environment today, where addressing side-channel leakage in a modern embedded system is extremely difficult.

As power consumption decreases naturally with shrinking hardware, things improve but similar increases in the sensitivity of monitoring equipment improve as well. Also, processors and FPGAs get faster every day, allowing for more complicated signal processing. As the paper concluded in 1972, side-channel attacks today tend to lead countermeasure sophistication. If you’re concerned about such attacks, be sure to get your design reviewed.

Wii hacking and the Freeloader

tmbinc wrote a great post describing the history of hacking the Wii and why certain holes were not publicized. This comes on the heels of Datel releasing a loader that can be used to play copied games by exploiting an RSA signature verification bug. I last heard of Datel when they made the Action Replay debug cartridge for the C64, and it looks like they’ve stayed around, building the same kind of thing for newer platforms.

First, the hole itself is amazingly bad for a widely-deployed commercial product. I wrote a long series of articles with Thomas Ptacek a year ago on how RSA signature verification requires careful padding checks. You might want to re-read that to understand the background. However, the Wii bug is much worse. The list of flaws includes:

  1. Using strncmp() instead of memcmp() to compare the SHA hash
  2. The padding is not checked at all

The first bug is fatal by itself. As soon as a terminating nul byte is reached, strncmp() returns. As long as the hash matched up to that point, the result would be success. If the first byte was nul, no comparison would be done and the check would pass.

It’s easy to create a chunk of data that hashes to a leading 0x00 byte. Here’s some sample code:

a = "rdist security blog"
import binascii, hashlib
for i in range(256):
    h = hashlib.sha1(chr(i)+a).digest()
    if ord(h[0]) == 0:
        print 'Found match with pad byte', i
        print 'SHA1:', "".join([binascii.b2a_hex(x) for x in h])
        break
else:
    print 'No pre-image found, try increasing the range.'

I got the following for my choice of string:

Found match with pad byte 80
SHA1: 00d50719c58e45c485e7d497e4021b48d814df33

The second bug is more subtle to exploit, but would still be open if only the strncmp() was fixed. It is well-known that if only 1/3 of the modulus length is validated, forgeries can be generated. If only 2/3 of the modulus length is validated, existential forgeries can be found. It would take another series of articles to explain all this, so see the citations of the original article for more detail.

tmbinc questions Datel’s motive in releasing an exploit for this bug. He and his team kept it secret in order to keep it usable to explore the system to find deeper flaws. Since it was easily patchable in software, it would be quickly closed. It turns out Nintendo fixed it two weeks after the Datel product became available.

I am still amazed how bad this hole was. Since such an important component failed open, it’s clear higher assurance development techniques are needed for software protection and crypto. I continue to do research in this area and hope to be able to publish more about it this year.

Apple iPhone bootloader attack

News and details of the first iPhone bootloader hack appeared today. My analysis of the publicly-available details released by the iPhone Dev Team is that this has nothing to do with a possible new version of the iPhone, contrary to Slashdot. It involves exploiting low-level software, not the new SDK. It is a good example how systems built from a series of links in a chain are brittle. Small modifications (in this case, a patch of a few bytes) can compromise this kind of design, and it’s hard to verify that all links have no such flaws.

A brief disclaimer: I don’t own an iPhone nor have I seen all the details of the attack. So my summary may be incomplete, although the basic principles should be applicable. My analysis is also completely based on published details, so I apologize for any inaccuracies.

For those who are new to the iPhone architecture, here’s a brief recap of what hackers have found and published. The iPhone has two CPUs of interest, the main ARM11 applications processor and an Infineon GSM processor. Most hacks up until now have involved compromising applications running on the main CPU to load code (aka “jailbreak”). Then, using that vantage point, the attacker will run a flash utility (say, “bbupdater”) to patch the GSM CPU to ignore the type of SIM installed and unlock the phone to run on other networks.

As holes have been found in the usermode application software, Apple has released firmware updates that patch them. This latest attack is a pretty big advance in that now a software attack can fully compromise the bootloader, which provides lower-level control and may be harder to patch.

The iPhone boot sequence, according to public docs, is as follows. The ARM CPU begins executing a secure bootloader (probably in ROM) on power-up. It then starts a low-level bootloader (“LLB”), which then runs the main bootloader, “iBoot”. The iBoot loader starts the OSX kernel, which then launches the familiar Unix usermode environment. This appears to be a traditional chain-of-trust model, where each element verifies the next element is trusted and then launches it.

Once one link of this chain is compromised, it can fully control all the links that follow it. Additionally, since developers may assume all links of the chain are trusted, they may not protect upstream elements from potentially malicious downstream ones. For example, the secure bootloader might not protect against malicious input from iBoot if part or all of it remains active after iBoot is launched.

This new attack takes advantage of two properties of the bootloader system. The first is that NOR flash is trusted implicitly. The other is that there appears to be an unauthenticated system for patching the secure bootloader.

There are two kinds of flash in the iPhone: NOR and NAND. Each has different properties useful to embedded designers. NOR flash is byte-addressable and thus can be directly executed. However, it is more costly and so usually much smaller than NAND. NAND flash must be accessed via a complicated series of steps and only in page-size chunks. However, it is much cheaper to manufacture in bulk, and so is used as the 4 or 8 GB main storage in the iPhone. The NOR flash is apparently used as a kind of cache for applications.

The first problem is that software in the NOR flash is apparently unsigned. In fact, the associated signature is discarded as verified software is written to flash. So if an attacker can get access to the flash chip pins, he can just store unsigned applications there directly. However, this requires opening up the iPhone and so a software-only attack is more desirable. If there is some way to get an unsigned application copied to NOR flash, then it is indistinguishable from a properly verified app and will be run by trusted software.

The second problem is that there is a way to patch parts of the secure bootloader before iBoot uses them. It seems that the secure bootloader acts as a library for iBoot, providing an API for verifying signatures on applications. During initialization, iBoot copies the secure bootloader to RAM and then performs a series of fix-ups for function pointers that redirect back into iBoot itself. This is a standard procedure for embedded systems that work with different versions of software. Just like in Windows when imports in a PE header are rebased, iBoot has a table of offsets and byte patches it applies to the secure bootloader before calling it. This allows a single version of the secure bootloader in ROM to be used with ever-changing iBoot revisions since iBoot has the intelligence to “fix up” the library before using it.

The hackers have taken advantage of this table to add their own patches. In this case, the patch is to disable the “is RSA signature correct?” portion of the code in the bootloader library after it’s been copied to RAM. This means that the function will now always return OK, no matter what the signature actually is.

There are a number of ways this attack could have been prevented. The first is to use a mesh-based design instead of a chain with a long series of links. This would be a major paradigm shift, but additional upstream and downstream integrity checks could have found that the secure bootloader had been modified and was thus untrustworthy. This would also catch attackers if they used other means to modify the bootloader execution, say by glitching the CPU as it executed.

A simpler patch would be to include self-tests to be sure everything is working. For example, checking a random, known-bad signature at various times during execution would reveal that the signature verification routine had been modified. This would create multiple points that would need to be found and patched out by an attacker, reducing the likelihood that a single, well-located glitch is sufficient to bypass signature checking. This is another concrete example of applying mesh principles to security design.

Hackers are claiming there’s little or nothing Apple can do to counter this attack. It will be interesting to watch this as it develops and see if Apple comes up with a clever response.

Finally, if you find this kind of thing fascinating, be sure to come to my talk “Designing and Attacking DRM” at RSA 2008. I’ll be there all week so make sure to say “hi” if you will be also.

Advances in RSA fault attacks

A few months ago, there was an article on attacking an RSA private key by choosing specific messages to exercise multiplier bugs that may exist in modern CPUs. Adi Shamir (the “S” in “RSA”) announced the basics of this ongoing research, and it will be interesting to review the paper once it appears. My analysis is that this is a neat extension to an existing attack and another good reason not to implement your own public key crypto, but if you use a mainstream library, you’re already protected.

The attack depends on the target using a naively-implemented crypto library on a machine that has a bug in the multiplier section of the CPU. Luckily, all crypto libraries I know of (OpenSSL, crypto++, etc.) guard against this kind of error by checking the signature before outputting it. Also, hardware multipliers probably have less bugs than dividers (ahem, FDIV) due to the increase in logic complexity for the latter. An integer multiplier is usually implemented as a set of adders with additional control logic to perform an occasional shift, while a divider actually performs successive approximation (aka “guess-and-check”). The design of floating point divider logic is clever, and I recommend that linked paper for an overview.

The basic attack was first discovered in 1996 by Boneh et al and applied to smart cards. I even spoke about this at the RSA 2004 conference, see page 11 of the linked slides. (Shameless plug: come see my talk, “Designing and Attacking DRM” at RSA 2008.) Shamir has provided a clever extension of that attack, applied to a general purpose CPU where the attacker doesn’t have physical access to the device to cause a fault in the computation.

It may be counterintuitive, but a multiplication error in any one of the many multiplies that occur during an RSA private key operation is enough for an attacker who sees the erroneous result to quickly recover the private key. He doesn’t need to know which multiply failed or how far off it is from the correct result. This is an astounding conclusion, so be sure to read the original paper.

The standard RSA private key operation is:

md mod n

It is typically implemented in two stages that are later recombined with the CRT.
This is done for performance since p and q are about half the size of n.

s1 = md mod p
s2 = md mod q
S = CRT(s1, s2)

The power analysis graph in my talk clearly shows these two phases of exponentiation with a short pause in between. Remember that obtaining either p or q is sufficient to recover the private key since the other can be found by dividing, e.g. n/p = q. Lastly, d can be obtained once you know p and q.

The way to obtain the key from a faulty signature, assuming a glitch appeared during the exponentiation mod p is:

q = GCD((m – S’e) mod n, n)

Remember that the bad signature S’ is a combination of the correct value s2 = md mod q and some garbage G approximately the same size. GCD (which is fast) can be used to calculate q since the difference m – m’ is almost certainly not divisible by p.

The advance Shamir describes involves implementing this attack. In the past, it required physical possession of the device so glitches could be induced via voltage or clock transients. Such glitch attacks were once used sucessfully against pay television smart cards.

Shamir may be suggesting he has found some way to search the vast space (2128) of possible values for A * B for a given device and find some combination that is calculated incorrectly. If an attacker can use such values in a message that is to be signed or decrypted with the private key, he can recover the private key via the Boneh et al attack. This indeed would be a great advance since it could be implemented from remote.

There are two solutions to preventing these attacks. The easiest is just to verify each signature after generating it:

S = md mod n
m’ = Se mod n
if m’ != m:
    Fatal, discard S

Also, randomized padding schemes like RSASSA-PSS can help.

All crypto libraries I know of implement at least the former approach, and RSASSA-PSS is also available nearly everywhere. So the moral is, use an off-the-shelf library but also make sure it has countermeasures to this kind of attack.

Taxonomy of glitch and side channel attacks

There are a number of things to try when developing such attacks, depending on the device and countermeasures present. We’ll assume that the attacker has possession of several instances of the device and a moderate budget. This limits an attacker to non-invasive and slightly invasive methods.

Timing attacks work at the granularity of entire device operations (request through result) and don’t require any hardware tools. However, hardware may be used to acquire timing information, for example, by using an oscilloscope and counting the clock cycles an operation takes. I call this observation point external since only information about the entire operation (not its intermediate steps) is available. All software, including commonly used applications or operating systems, need to be aware of timing attacks when working with secrets. The first published timing attack was against RSA, but any kind of CPU access to secret data can reveal information about that data (e.g., cache misses.)

A common misconception is that noise alone can prevent timing attacks. Boneh et al disproved this handily when they mounted timing attacks against OpenSSL over a WAN. If there is noise, just take more measurements. Since noise is random but the key is constant, noise tends to average out the greater your sample size.

Power, EM, thermal, and audio side channel attacks measure more detailed internal behavior throughout an operation. If the intermediate state of an operation is visible in a timing attack, I classify it as an internal side channel attack as well (e.g., Percival’s cache timing attack.) The granularity of measurement is important. Thus, thermal and audio attacks are less powerful given the slow response of the signal compared to the speed of the computation. In other words, they have built-in averaging.

Simple side channel attacks (i.e. SPA) involve observing differences of behavior within a single sample. The difference in height of the peaks of power consumption during a DES operation might indicate the number of 1 bits in the key for that particular round. Since most crypto is based on an iterative model, similarities and differences between each iteration directly reflect the secret data being processed.

Differential side channel attacks (i.e. DPA) are quite a bit different. Instead of requiring an observable, repeatable difference in behavior, any slight variation in behavior can be leveraged using statistics and knowledge of cipher structure. It would take an entire series of articles to explain the various forms of DPA, but I’ll summarize by saying that DPA can automatically extract keys from traces that individually appear completely random.

Glitch attacks (aka fault induction) involve deliberately inducing an error in hardware behavior. They are usually non-invasive but occasionally partially invasive. If power lines are accessible, the power supply can be subjected to a momentary excessive voltage or a brown-out. Removing decoupling capacitors can magnify this effect. If IO lines are accessible, they can be subjected to high-frequency analog signals in an attempt to corrupt the logic behind the IO buffer. But usually these approaches can be prevented by careful engineering.

Most glitch attacks use the clock line since it is especially critical to chip operation. In addition to over-voltage, complex high-frequency waveforms can induce interesting behavior. Flip-flops and latches have a timing parameter called “setup and hold” which indicates how long a 0 or 1 bit needs to be applied before the hardware can remember the bit. High frequency waveforms at the edge of this limit cause some flip-flops to register a new value (possibly random) and others to keep their old value. Natural manufacturing variances mean this is impossible to prevent. Pulse, triangle, and sawtooth waveforms provide more possibilities for variation.

Optical and EM glitch attacks induce faults using radiation. Optical attacks are partially invasive in that the chip has to be partially removed from its package (decapping). EM attacks can usually penetrate the housing. The nice thing about this glitching approach is that individual areas of the chip can be targeted, like RAM which is particularly vulnerable to bit flips. Optical attacks can be done using a flash bulb or laser pointer.

With more resources, tools like FIB workstations become available. These allow for fully invasive attacks, where the silicon is modified or tapped at various places to extract information or induce insecure behavior. Such tools are available (Ross Anderson’s group has been using one since the mid 90’s) but are generally not used by the hobbyist hacker community.

Hardware design and glitch attacks

The first step in understanding glitch attacks is to look at how hardware actually works. Each chip is made up of transistors that are combined to produce gates and then high-level features like RAM, logic, lookup tables, state machines, etc. In turn, those features are combined to produce a CPU, video decoder, coprocessor, etc. We’re most interested in secure CPUs and the computations they perform.

Each of the feature blocks on a chip are coordinated by a global clock signal. Each time it “ticks”, signals propagate from one step to another and among the various blocks. The speed of propagation is based on the chip’s architecture and physical silicon process, which together determine how quickly the main clock can run. This is why every CPU has a maximum (but not minimum) megahertz rating.

In hardware design, the logic blocks are made up of multiple stages, very much like CPU pipelining. Each clock cycle, data is read from an internal register, passes through some combinational logic, and is stored in a register to wait for the next clock cycle. The register can be the same one (i.e. for repeated processing as in multiple rounds of a block cipher) or a different one.

chippipe1.png

The maximum clock rate for the chip is constrained by the slowest block. If it takes one block 10 ns to propagate a signal from register to register, your maximum clock rate is 100 MHz, even if most of the other blocks are faster. If this is the case, the designer can either slice that function up into smaller blocks (but increase the total latency) or try to redesign it to take less time.

A CPU is made up of multiple blocks. There is logic like the ALU or branch prediction, RAM for named registers and cache, and state machines for coordinating the whole show. If you examine an assembly instruction datasheet, you’ll find that each instruction takes one or more clocks and sometimes a variable number. For example, branch instructions often take more clock cycles if the branch is taken or if there is a branch predictor and it got it wrong. As signals propagate between each block, the CPU is in an intermediate state. At a higher level, it is also in an intermediate state during execution of multi-cycle instructions.

As you can see from all this, a CPU is very sensitive to all the signals that pass through it and their timing. These signals can be influenced by the voltage at external pins, especially the clock signal since it is distributed to every block on a chip. When signals that have out-of-spec timing or voltage are applied to the pins, computation can be corrupted in surprisingly useful ways.