Advances in RSA fault attacks

A few months ago, there was an article on attacking an RSA private key by choosing specific messages to exercise multiplier bugs that may exist in modern CPUs. Adi Shamir (the “S” in “RSA”) announced the basics of this ongoing research, and it will be interesting to review the paper once it appears. My analysis is that this is a neat extension to an existing attack and another good reason not to implement your own public key crypto, but if you use a mainstream library, you’re already protected.

The attack depends on the target using a naively-implemented crypto library on a machine that has a bug in the multiplier section of the CPU. Luckily, all crypto libraries I know of (OpenSSL, crypto++, etc.) guard against this kind of error by checking the signature before outputting it. Also, hardware multipliers probably have less bugs than dividers (ahem, FDIV) due to the increase in logic complexity for the latter. An integer multiplier is usually implemented as a set of adders with additional control logic to perform an occasional shift, while a divider actually performs successive approximation (aka “guess-and-check”). The design of floating point divider logic is clever, and I recommend that linked paper for an overview.

The basic attack was first discovered in 1996 by Boneh et al and applied to smart cards. I even spoke about this at the RSA 2004 conference, see page 11 of the linked slides. (Shameless plug: come see my talk, “Designing and Attacking DRM” at RSA 2008.) Shamir has provided a clever extension of that attack, applied to a general purpose CPU where the attacker doesn’t have physical access to the device to cause a fault in the computation.

It may be counterintuitive, but a multiplication error in any one of the many multiplies that occur during an RSA private key operation is enough for an attacker who sees the erroneous result to quickly recover the private key. He doesn’t need to know which multiply failed or how far off it is from the correct result. This is an astounding conclusion, so be sure to read the original paper.

The standard RSA private key operation is:

md mod n

It is typically implemented in two stages that are later recombined with the CRT.
This is done for performance since p and q are about half the size of n.

s1 = md mod p
s2 = md mod q
S = CRT(s1, s2)

The power analysis graph in my talk clearly shows these two phases of exponentiation with a short pause in between. Remember that obtaining either p or q is sufficient to recover the private key since the other can be found by dividing, e.g. n/p = q. Lastly, d can be obtained once you know p and q.

The way to obtain the key from a faulty signature, assuming a glitch appeared during the exponentiation mod p is:

q = GCD((m – S’e) mod n, n)

Remember that the bad signature S’ is a combination of the correct value s2 = md mod q and some garbage G approximately the same size. GCD (which is fast) can be used to calculate q since the difference m – m’ is almost certainly not divisible by p.

The advance Shamir describes involves implementing this attack. In the past, it required physical possession of the device so glitches could be induced via voltage or clock transients. Such glitch attacks were once used sucessfully against pay television smart cards.

Shamir may be suggesting he has found some way to search the vast space (2128) of possible values for A * B for a given device and find some combination that is calculated incorrectly. If an attacker can use such values in a message that is to be signed or decrypted with the private key, he can recover the private key via the Boneh et al attack. This indeed would be a great advance since it could be implemented from remote.

There are two solutions to preventing these attacks. The easiest is just to verify each signature after generating it:

S = md mod n
m’ = Se mod n
if m’ != m:
    Fatal, discard S

Also, randomized padding schemes like RSASSA-PSS can help.

All crypto libraries I know of implement at least the former approach, and RSASSA-PSS is also available nearly everywhere. So the moral is, use an off-the-shelf library but also make sure it has countermeasures to this kind of attack.

Panasonic CF-Y4 laptop disassembly

I’m a big fan of the lightweight Panasonic ultraportable laptops.  The R-series is small but still usable.  The Y-series offers a full 1400×1050 screen, built-in DVD-RW drive, and long battery life in a 3 pound package.  As a FreeBSD developer, I also find the BIOS in the Panasonic and Lenovo/IBM laptops are mostly compliant, meaning suspend/resume and power management work fine.

Recently, I upgraded the hard drive on my CF-Y4.  I found that these disassembly instructions (another good source) for the CF-Y2 are mostly accurate.  However, there are a few caveats I wanted to note for others with the R/W/T/Y series laptops.

First, all the notes about 3.3 volt logic versus 5 volt logic for the hard drive no longer apply.  The Toshiba hard drive that came in my Y4 uses 5 volt logic, along with 5 volt motor supply.  In fact, the pins are tied together internally.  It was straightforward to swap in a WD 250 GB drive with no clipping pins necessary.  This may apply to the newer R-series as well, though I haven’t verified it.  If in doubt, use an ohmmeter to verify no resistance between pins 41 and 42 on the stock hard drive.

Next, heed the warnings about stripping the top two large hinge screws.  They screw directly into plastic, while the other two hinge screws have a steel sleeve.  Use a good jeweler’s screwdriver for the small screws.  You don’t need to remove the two screws that hold the VGA connector to the case.

When removing the keyboard, pry smoothly in multiple places but don’t be afraid to put a little effort into it.  The glue used to hold it down is surprisingly strong.  Be sure you removed all the small screws from the bottom, of course, otherwise it won’t pop out.

Be sure to clean the CPU’s heat sink connection carefully and use some good thermal paste when reassembling.  These laptops have no fan (awesome!) but that means it’s critical to make a good connection between the CPU and the keyboard heat sink area.  Also, don’t forget the GPU, which sinks heat through the bottom of the motherboard.  I cut a small piece of plastic to use as a spreader to eliminate any bubbles.  I also put a thin amount of paste along other parts of the internal skeleton where it touches the keyboard.  Once you reassemble the case, monitor the system temperature for a while to be sure you didn’t make a mistake.  I found my temperature actually dropped compared to the factory thermal paste.

Vintage Computer Festival 2007

This past weekend I attended the Vintage Computer Festival at the Computer History Museum (article). There were numerous highlights at the exhibits. I saw a demo of the Minskytron and Spacewar! on an original PDP-1 by Steve Russell. The Magic-1 was a complete homebrew computer made of discrete 74xx logic chips running Minix. The differential analyzer showed how analog computers worked. I also met Wesley Clark and watched team members type demo code into the LINC, similar to ed on a very small terminal.

One question I asked other attendees was what recent or modern laptop I could get for outdoor use. I am looking for a low-power device with a high-contrast screen for typing notes or coding while camping. Older LCD devices like the eMate met these criteria but a more modern version is preferable. Most recommended the OLPC XO-1, and in monochrome mode, it sounds like what I want. But I think I’ll wait for the second version to be sure the bugs are worked out.

After looking around at attendees, I was concerned for our future. Other than a few dads with their kids, most people were 40+ years old. While I missed out on the golden era of computer diversity (I got my first C64 in 1987), I was always fascinated with how computers were invented. I checked out books from the library and read old copies of Byte magazine found in a dumpster. Once I got on the Internet, I browsed the Lyons Unix source code commentary and studied the Rainbow Books to understand supervisor design.

So where was the under-30 crowd? Shouldn’t computer history be of interest to most computer science/electrical engineering students, and especially to security folks? Many auto mechanics enjoy viewing and maintaining old hotrods. Architectural history is important to civil engineers. I appreciate the work bunnie is doing to educate people on semiconductor design, including old chips. Is this having an effect?

If you’re under 30, I’m interested in hearing your response.

IOMMU – virtualization or DRM?

Before deciding how to enable DMA protection, it’s important to figure out what current and future threats you’re trying to prevent. Since there are performance trade-offs with various approaches to adding an IOMMU, it’s important to figure out if you need one, and if so, how it will be used.Current threats using DMA have centered around the easiest to use interface, Firewire (IEEE 1394). Besides being a peripheral interconnect method, Firewire provides a message type that allows a device to directly DMA into a host’s memory. Some of the first talks on this include “0wned by an iPod” and “Hit by a Bus“. I especially like the latter method, where the status registers of an iPod are spoofed to convince the Windows host to disable Firewire’s built-in address restrictions.

Yes, Firewire already has DMA protection built in (see the OHCI spec.) There are a set of registers that the host-side 1394 device driver can program to specify what addresses are allowed. This allows legitimate data transfer to a buffer allocated by the OS while preventing devices from overwriting anything else.  Matasano previously wrote about how those registers can be accessed from the host side to disable protection.

There’s another threat that is quite scary once it appears but is probably still a long way off. Researchers, including myself, have long talked about rootkits persisting by storing themselves in a flash-updateable device and then taking over the OS on each boot by patching it via DMA. This threat has not emerged yet for a number of reasons. It’s by nature a targeted attack since you need to write a different rootkit for each model of device you want to backdoor. Patching the OS reliably becomes an issue if the user reinstalls it, so it would be a lot of work to maintain an OS-specific table of offsets. Mostly, there are just so many easier ways to backdoor systems that it’s not necessary to go this route.  So no one even pretends this is the reason for adding an IOMMU.

If you remember what happened with virtualization, I think there’s some interesting insight to what is driving the deployment of these features.  Hardware VM support (Intel VT, AMD SVM) were being developed around the same time as trusted-computing chipsets (Intel SMX, AMD skinit).  Likewise, DMA blocking (Intel NoDMA, AMD DEV) appeared before IOMMUs, which only start shipping in late 2007.

My theory about all this is that virtualization is something everyone wants.  Servers, desktops, and even laptops can now fully virtualize the OS.  Add an IOMMU and each OS can run native drivers on bare hardware.  When new virtualization features appear, software developers rush to support them.

DRM is a bit more of a mess.  Features like Intel SMX/AMD skinit go unused.  Where can I download one of these signed code segments all the manuals mention?  I predict you won’t see DMA protection being used to implement a protected path for DRM for a while, yet direct device access (i.e., faster virtualized IO) is already shipping in Xen.

The fundamental problem is one of misaligned interests.  The people that have an interest in DRM (content owners) do not make hardware or software.  Thus new capabilities that are useful for both virtualization and DRM, for example, will always first support virtualization.  We haven’t yet seen any mainstream DRM application support TPMs, and those have been out for four years.  So when is the sky going to fall?

Protecting memory from DMA

Previously, we discussed how DMA works in the PC architecture. The northbridge is only aware of physical addresses and directs transactions to the appropriate devices or RAM based solely on that address.

Unlike within the CPU where there is virtual memory translation and protection, the chipset previously did not perform any translation of physical addresses or place restrictions on which addresses can be accessed. From the RAM’s perspective, a memory access that originated from the CPU or from the integrated Ethernet is exactly the same. Stability and security depended on the device driver properly programming the device to DMA only to physical addresses that were within the buffer assigned by the OS. This was fine except for device driver or hardware bugs, since ring 0 code was trusted.

With system-wide virtualization and DRM becoming more common, ring 0 code is no longer trusted. To avoid DMA corruption between guests or the host, the hypervisor previously would create a fake device for each OS instance. The guest talked to the fake device, and the host would multiplex the transactions over the real device. This has a lot of overhead, so it would be preferable to let the guest talk directly to the real device.

An IOMMU provides translation and protection for physical addresses. The hypervisor sets up a page table within the northbridge that groups page table entries by their device IDs. Then, when a DMA request arrives at the northbridge from a device, it is looked up by its ID, translated into the actual destination physical address, and allowed or denied based on the protection settings. If a write is denied, no data is transferred to RAM. If it’s a read, all bits are set to 1 in the response. Either way, an abort error is returned to the device as well.

DMA protection (AMD: DEV, Intel: NoDMA table) is currently available in shipping products and physical address translation (AMD: IOMMU, Intel: VT-d) is coming very soon. While these features were implemented separately, it is expected that they will usually be used together.

There have been a few surprising studies of IOMMU performance. The first paper, by IBM researchers, shows that the overhead in setting up and tearing down mappings consumed up to 60% more CPU than without. They discuss various mapping allocation strategies to address this. However, they all have their disadvantages. One of the strategies, setting up the mappings at guest startup and never changing them, interferes with the hypervisor strategy called “ballooning”, where resources are only allocated to a guest as it uses them. This is what allows VMware to run guests with more RAM available to them than the host actually has. Read the paper for more analysis of their other strategies.

Another paper, by Rice University researchers, proposes virtualization support built into the devices themselves (“CDNA”). They build a NIC that maintains a unique set of registers for each guest. Each guest believes it has direct access to the NIC, although requests to set up DMA go through the hypervisor. The NIC hardware manages the fair scheduling of DMA among all the register contexts, so actual packets going out on the wire will be balanced between the various guests sending them. This approach requires no IOMMU, but each device needs to be capable of maintaining multiple register contexts. Again, read this paper for a different take on device virtualization.

This research shows that an IOMMU is not the only way to achieve DMA protection, and it’s important to carefully design how a hypervisor uses an IOMMU to prevent a loss of performance. Next time, we’ll examine some usage scenarios for IOMMUs, both in virtualization and DRM.

PC memory architecture overview

The topics of DMA protection and a new Intel/AMD feature called an IOMMU (or VT-d) are becoming more prevalent. I believe this is due to two trends: increased use of virtualization and hardware protection for DRM. It’s important to first understand how memory works in a traditional PC before discussing the benefits and issues with using an IOMMU.

DMA (direct memory access) is a general term for architectures where devices can talk directly to RAM, without the CPU being involved. In PCs, the CPU is not even notified when DMA is in progress, although some chipsets do report a little information (i.e., bus mastering status bit or BM_STS). DMA was conceived to provide higher performance than the alternative, which is for the CPU to copy each byte of data from the device to memory (aka programmed IO). To write data to a hard drive controller via DMA, the driver running on the CPU writes the memory address of the data to the hardware and then goes on to doing other tasks. The drive controller finishes reading the data via DMA and generates an interrupt to notify the CPU that the write is complete.

DMA can actually be slower than programmed IO if the overhead in talking to the DMA controller to initiate the transaction takes longer than the transaction itself. This may be true for very short data. That’s why the original PC parallel port (LPT) doesn’t support DMA. When there are only 8 bits of data per transaction, it doesn’t make sense to spend time telling the hardware where to put the data, just read it yourself.

Behind this concept of DMA, common to nearly all modern architectures, the PC has a particular breakdown of responsibilities between the various chips. The CPU executes code and talks to the northbridge (Intel MCH). Integrated devices like USB and Ethernet are all located in the southbridge (Intel ICH), with the exception of on-board video, which is located in the northbridge. Between each of these chips is an Intel or AMD proprietary bus, which is why your Intel CPU won’t work with your AMD chipset, even if you were to rework the socket to fit it. Your RAM is accessed only via the northbridge (Intel) or via a bus shared with the northbridge (AMD).

Interfacing with the CPU is very simple. All complexities (privilege level, paging, task management, segmentation, MSRs) are handled completely internally. On the external bus shared with the northbridge, a CPU has a set of address and data lines and a few control/status lines. Besides power supply, the address and data pins are the most numerous. In the Intel quad-core spec, there are only about 60 types of pins. Only three pins (LINT[0:1], SMI#) are used to signal all interrupts, even on systems with dozens of devices.

Remember, these addresses are purely physical addresses as all virtual memory translation is internal to the CPU. There are two types of addresses known to the northbridge: memory and IO space. The latter are generated by the in/out asm instructions and merely result in a special value being written to the address lines on the next clock cycle after the address is sent. IO space addresses are typically used for device configuration or legacy devices.

The northbridge is relatively dumb compared to the CPU. It is like a traffic cop, directing the CPU’s accesses to devices or RAM. Likewise, when a device on the southbridge wants to access RAM via DMA, the northbridge merely routes the request to the correct location. It maintains a map, set during PCI configuration, which says something like “these address ranges go to the southbridge, these others go to the integrated video”.

With integrated peripherals, PCI is no longer a bus, it’s merely a protocol. There is no set of PCI bus lines within your southbridge that are hooked to the USB and Ethernet components of the chip. Instead, only PCI configuration remains in common with external devices on a PCI bus. PCI configuration is merely a set of IO port reads/writes to walk the logical device hierarchy, programming the northbridge with which regions it decodes to which device. It’s setting up the table for the traffic cop.

Next time, we’ll examine the advent of IOMMUs and DEVs/NoDMA tables.