Preventing RSA cache timing attacks

February 28, 2008March 6, 2008 ~ Nate Lawson ~ 2 Comments

It has been known for a while that side-channel information about crypto operations (i.e., timing) can give enough information to recover secret keys. The original paper by Kocher even indicated RAM cache behavior as a potential source of timing variance.

In 2005, Osvik and Percival separately published papers on cache timing attacks. The latter resulted in a PR incident for Intel since Hyperthreading, which AMD doesn’t have, was announced as the key vantage point for observing the timing information of crypto running on another core. Since cache-related side channels are exploitable even in single-CPU systems, crypto libraries needed to be updated to fix this issue whether Hyperthreading was present or not.

For RSA, it’s necessary to understand modular exponentiation in order to understand the attack. When encrypting a message using the private exponent d, the server computes m^d mod n. To do this efficiently with big numbers, a number of techniques are used.

The basic technique, square-and-multiply, takes advantage of the fact that squaring is fast in binary (just a shift left). This method starts by walking through each bit of the exponent. If the bit is 0, the accumulating result (initially set to m) is squared. If it is 1, it is squared and then multiplied by m. In nearly all implementations, it takes longer to square and then multiply than it does to merely square, giving a very distinguishable timing difference. An attacker can figure out which bits are 0 or 1 by this difference.

Since basic square-and-multiply is still too slow, a method called “windowed exponentiation” was invented. Instead of examining each individual bit of the exponent, it walks through groups of bits of size w. There are both fixed and sliding window variants, but I’ll stick to fixed-window here for the sake of simplicity. Square-and-multiply is simply fixed-window exponentiation with w = 1.

For the common size w = 3, these values are precomputed mod n: m², m³, m⁴, m⁵, m⁶, and m⁷. Obviously, m⁰ is 1 and m¹ is just m so they are already known.

Just like in square-and-multiply, the accumulator is always first “squared”. However, since we are operating on multiple bits at once, it’s actually raised to 2^w or 2³ = 8 in our example. Then, the value of the bits in the window are checked. If 000b, nothing more happens. If 001b, the accumulator is multiplied by m. If 010b, it’s multiplied by the precomputed m² and so forth. This speeds up the processing by handling batches of exponent bits, at the expense of some small precomputation.

Those precomputed values were typically stored in an array in memory, each of size of the modulus n (e.g., 256 bytes each for a 2048-bit n). Since the cache line size on modern x86 CPUs is 64 bytes, the multiply operation would take a little longer if the pre-computed value wasn’t already in the cache.

An attacker could repeatedly cause these values to be evicted from the cache and then based on the computation time, see which ones were used. Or, he could “touch” an array of garbage that was aligned along cache lines, access the SSL server, then time how long it took to read each value from his own array. If a value was quickly accessible, the SSL’s RSA implementation was likely to have used the associated pre-computed value. For example, if the 2nd element in the array was “hot”, then the server probably accessed its precomputed m³ and thus the corresponding bits of the private exponent were 011b.

This attack could be repeated many times until the entire key was extracted. Hyperthreading provided a good vantage point since the spy process could be accessing its array over and over while running at the same time as the SSL implementation on another core. However, this same attack could work on a single CPU system as well, albeit with more samples required.

To address these attacks for OpenSSL, Matthew Wood (Intel) submitted a patch that first appeared in version 0.9.7h. The key change is to the modular exponentiation function in openssl/crypto/bn/bn_exp.c.

This patch stripes the pre-computed m^2…x values across cache lines instead of storing them sequentially in the array. That is, the first byte of m² would be at address 0, the first byte of m³ at 1, etc. A memory dump of this region with w = 3 would look like this:

0: m²[0], m³[0], … m⁷[0]
64: m²[1], m³[1], … m⁷[1]
128: m²[2], m³[2], … m⁷[2]

Thus, the access pattern for reading any pre-computed value is exactly the same as any other: 256 sequential cache line reads. This is a clever way of removing the timing leak. I think it is commendable that Intel spent the time developing this code and contributing it to OpenSSL, especially after the widespread criticism that surrounded this problem was directed primarily at them.

There are two problems regarding this approach. The first is the limitation that it can’t fix problems with AES cache timing (see also this detailed analysis of Bernstein’s work and this proposed cache architecture).

The second may only affect certain hardware. The patch is configurable for machines with a 32 or 64-byte cache line size. The default is set to 64. If an older machine or a non-x86 CPU with a smaller cache line size runs the default compile of OpenSSL, it will still have a small leak if a large window is used (>= actual cache line size). For example, consider a machine with an 8-byte cache line size and a 12-bit window. OpenSSL would store m^2…9[0] in the first cache line and m^10…11[0] in the second, allowing an attacker to determine if a given set of exponent bits was less than 2¹⁰ or not. This might be exploitable in certain situations.

To be truly safe against timing attacks, you need to carefully examine the behavior of the underlying hardware. Be especially careful if adapting an existing crypto implementation to a new platform and get an expert to review it.

Memory remanence attack analysis

February 24, 2008March 6, 2008 ~ Nate Lawson ~ 1 Comment

You have probably heard by now of the memory remanence attack by Halderman et al. They show that it is easy to recover cryptographic keys from RAM after a reset or moving the DIMM to another system. This is important to any software that stores keys in RAM, and they targeted disk encryption. It’s a nice paper with a very polished but responsible publicity campaign, including a video.

Like most good papers, some parts of the attack were known for a long time and others were creative improvements. Memory remanence has been a known issue ever since the first key had to be zeroed after use. In the PC environment, the trusted computing efforts have been aware of this as well. (See “Hardware Attacks”, chapter 13 — S3 is suspend-to-ram and SCLEAN is a module that must be run during power-on to clear RAM). However, the Halderman team is publishing the first concrete results in this area and it should shake things up.

One outcome I do not want to see from this is a blind movement to closed hardware crypto (e.g., hard disk drives with onboard encryption). Such systems are ok in principle, but in practice often compromise security in more obvious ways than a warm reboot. For example, a hard drive that stores encryption keys in a special “lock sector” that the drive firmware won’t access without a valid password can be easily circumvented by patching the firmware. Such a system would be less secure in a cold power-on scenario than well-implemented software. The solution here is to ask vendors for documentation on their security implementation before making a purchase or only buy hardware that has been reviewed by a third-party with a report that matches your expectations. (Full disclosure: I perform this kind of review at Root Labs.)

Another observation is that this attack underlines the need to apply software protection techniques to other security applications besides DRM. If an attacker can dump your RAM, you need effective ways to hide the key in memory like white-box crypto, obfuscate and tamper-protect software that uses it, and randomize each install to prevent “break once, run everywhere” attacks. Yes, this is the exact same threat model DRM has faced for years but this time you care because you’re the target.

It will be interesting to see how vendors respond to this. Zeroing memory on reboot is an obvious change that addresses some of their methods. A more subtle hack is to set up page mapping and cache configuration such that the key is loaded into a cache line and never evicted (as done for fast IP routing table lookup in this awesome paper). However, none of this stops attacks that move the DIMM to another system. On standard x86 hardware, there’s no place other than RAM to put keys. However, the VIA C7 processors have hardware AES built into the CPU, and it’s possible more vendors will take this approach to providing secure key storage and crypto acceleration.

Whatever the changes, it will probably take a long time before this attack is effectively addressed. Set your encrypted volumes to auto-detach during suspend or a reasonable timeout and keep an eye on your laptop.

Trapping access to debug registers

October 15, 2007October 14, 2007 ~ Nate Lawson

If you’re designing or attacking a software protection scheme, the debug registers are a great resource. Their use is mostly described in the Intel SDM Volume 3B, chapter 18. They can only be accessed by ring 0 software, but their breakpoints can be triggered by execution of unprivileged code.

The debug registers provide hardware support for setting up to four different breakpoints. They have been around since the 386, as this fascinating history describes. Each breakpoint can set to occur on an execute, write, read/write, or IO read/write (i.e., in/out instructions). Each monitored address can be a range of 1, 2, 4, or 8 bytes.

DR0-3 store the addresses to be monitored. DR6 provides status bits that describe which event occurred. DR7 configures the type of event to monitor for each address. DR4-5 are aliases for DR6-7 if the CR4.DE bit is clear. Otherwise, accessing these registers yields an undocumented opcode exception. This behavior might be useful for obfuscation.

When a condition is met for one of the four breakpoints, INT1 is triggered. This is the same exception as for a single-step trap (EFLAGS.TF = 1). INT3 is for software breakpoints and is useful when setting more than four breakpoints. However, software breakpoints require modifying the code to insert an int3 instruction and can’t monitor reads/writes to memory.

One very useful feature of the debug registers is DR7.GD (bit 13). Setting this bit causes reads or writes to any of the debug registers to generate an INT1. This was originally intended to support ICE (In-Circuit Emulation) since some x86 processors implemented test mode by executing normal instructions. This mode was the same as SMM (System Management Mode), the feature that makes your laptop power management work. SMM has been around since the 386SL and is the original x86 hypervisor.

To analyze a protection scheme that accesses the debug registers, hook INT1 and set DR7.GD. When your handler is called, check DR6.BD (also bit 13). If it is set, the instruction at the faulting EIP was about to read or write to a debug register. You’re probably somewhere near the protection code. Since this is a faulting exception, the MOV DRx instruction has not executed yet and can be skipped by updating the EIP on the stack before executing IRET.

If you’re designing software protection, there are some interesting ways to use this feature to prevent attackers from having easy access to the debug registers. I’ll have to leave that for another day.

IOMMU – virtualization or DRM?

October 3, 2007October 3, 2007 ~ Nate Lawson ~ 5 Comments

Before deciding how to enable DMA protection, it’s important to figure out what current and future threats you’re trying to prevent. Since there are performance trade-offs with various approaches to adding an IOMMU, it’s important to figure out if you need one, and if so, how it will be used.Current threats using DMA have centered around the easiest to use interface, Firewire (IEEE 1394). Besides being a peripheral interconnect method, Firewire provides a message type that allows a device to directly DMA into a host’s memory. Some of the first talks on this include “0wned by an iPod” and “Hit by a Bus“. I especially like the latter method, where the status registers of an iPod are spoofed to convince the Windows host to disable Firewire’s built-in address restrictions.

Yes, Firewire already has DMA protection built in (see the OHCI spec.) There are a set of registers that the host-side 1394 device driver can program to specify what addresses are allowed. This allows legitimate data transfer to a buffer allocated by the OS while preventing devices from overwriting anything else. Matasano previously wrote about how those registers can be accessed from the host side to disable protection.

There’s another threat that is quite scary once it appears but is probably still a long way off. Researchers, including myself, have long talked about rootkits persisting by storing themselves in a flash-updateable device and then taking over the OS on each boot by patching it via DMA. This threat has not emerged yet for a number of reasons. It’s by nature a targeted attack since you need to write a different rootkit for each model of device you want to backdoor. Patching the OS reliably becomes an issue if the user reinstalls it, so it would be a lot of work to maintain an OS-specific table of offsets. Mostly, there are just so many easier ways to backdoor systems that it’s not necessary to go this route. So no one even pretends this is the reason for adding an IOMMU.

If you remember what happened with virtualization, I think there’s some interesting insight to what is driving the deployment of these features. Hardware VM support (Intel VT, AMD SVM) were being developed around the same time as trusted-computing chipsets (Intel SMX, AMD skinit). Likewise, DMA blocking (Intel NoDMA, AMD DEV) appeared before IOMMUs, which only start shipping in late 2007.

My theory about all this is that virtualization is something everyone wants. Servers, desktops, and even laptops can now fully virtualize the OS. Add an IOMMU and each OS can run native drivers on bare hardware. When new virtualization features appear, software developers rush to support them.

DRM is a bit more of a mess. Features like Intel SMX/AMD skinit go unused. Where can I download one of these signed code segments all the manuals mention? I predict you won’t see DMA protection being used to implement a protected path for DRM for a while, yet direct device access (i.e., faster virtualized IO) is already shipping in Xen.

The fundamental problem is one of misaligned interests. The people that have an interest in DRM (content owners) do not make hardware or software. Thus new capabilities that are useful for both virtualization and DRM, for example, will always first support virtualization. We haven’t yet seen any mainstream DRM application support TPMs, and those have been out for four years. So when is the sky going to fall?

Protecting memory from DMA

September 28, 2007September 27, 2007 ~ Nate Lawson ~ 5 Comments

Previously, we discussed how DMA works in the PC architecture. The northbridge is only aware of physical addresses and directs transactions to the appropriate devices or RAM based solely on that address.

Unlike within the CPU where there is virtual memory translation and protection, the chipset previously did not perform any translation of physical addresses or place restrictions on which addresses can be accessed. From the RAM’s perspective, a memory access that originated from the CPU or from the integrated Ethernet is exactly the same. Stability and security depended on the device driver properly programming the device to DMA only to physical addresses that were within the buffer assigned by the OS. This was fine except for device driver or hardware bugs, since ring 0 code was trusted.

With system-wide virtualization and DRM becoming more common, ring 0 code is no longer trusted. To avoid DMA corruption between guests or the host, the hypervisor previously would create a fake device for each OS instance. The guest talked to the fake device, and the host would multiplex the transactions over the real device. This has a lot of overhead, so it would be preferable to let the guest talk directly to the real device.

An IOMMU provides translation and protection for physical addresses. The hypervisor sets up a page table within the northbridge that groups page table entries by their device IDs. Then, when a DMA request arrives at the northbridge from a device, it is looked up by its ID, translated into the actual destination physical address, and allowed or denied based on the protection settings. If a write is denied, no data is transferred to RAM. If it’s a read, all bits are set to 1 in the response. Either way, an abort error is returned to the device as well.

DMA protection (AMD: DEV, Intel: NoDMA table) is currently available in shipping products and physical address translation (AMD: IOMMU, Intel: VT-d) is coming very soon. While these features were implemented separately, it is expected that they will usually be used together.

There have been a few surprising studies of IOMMU performance. The first paper, by IBM researchers, shows that the overhead in setting up and tearing down mappings consumed up to 60% more CPU than without. They discuss various mapping allocation strategies to address this. However, they all have their disadvantages. One of the strategies, setting up the mappings at guest startup and never changing them, interferes with the hypervisor strategy called “ballooning”, where resources are only allocated to a guest as it uses them. This is what allows VMware to run guests with more RAM available to them than the host actually has. Read the paper for more analysis of their other strategies.

Another paper, by Rice University researchers, proposes virtualization support built into the devices themselves (“CDNA”). They build a NIC that maintains a unique set of registers for each guest. Each guest believes it has direct access to the NIC, although requests to set up DMA go through the hypervisor. The NIC hardware manages the fair scheduling of DMA among all the register contexts, so actual packets going out on the wire will be balanced between the various guests sending them. This approach requires no IOMMU, but each device needs to be capable of maintaining multiple register contexts. Again, read this paper for a different take on device virtualization.

This research shows that an IOMMU is not the only way to achieve DMA protection, and it’s important to carefully design how a hypervisor uses an IOMMU to prevent a loss of performance. Next time, we’ll examine some usage scenarios for IOMMUs, both in virtualization and DRM.

PC memory architecture overview

September 27, 2007February 14, 2011 ~ Nate Lawson

The topics of DMA protection and a new Intel/AMD feature called an IOMMU (or VT-d) are becoming more prevalent. I believe this is due to two trends: increased use of virtualization and hardware protection for DRM. It’s important to first understand how memory works in a traditional PC before discussing the benefits and issues with using an IOMMU.

DMA (direct memory access) is a general term for architectures where devices can talk directly to RAM, without the CPU being involved. In PCs, the CPU is not even notified when DMA is in progress, although some chipsets do report a little information (i.e., bus mastering status bit or BM_STS). DMA was conceived to provide higher performance than the alternative, which is for the CPU to copy each byte of data from the device to memory (aka programmed IO). To write data to a hard drive controller via DMA, the driver running on the CPU writes the memory address of the data to the hardware and then goes on to doing other tasks. The drive controller finishes reading the data via DMA and generates an interrupt to notify the CPU that the write is complete.

DMA can actually be slower than programmed IO if the overhead in talking to the DMA controller to initiate the transaction takes longer than the transaction itself. This may be true for very short data. That’s why the original PC parallel port (LPT) doesn’t support DMA. When there are only 8 bits of data per transaction, it doesn’t make sense to spend time telling the hardware where to put the data, just read it yourself.

Behind this concept of DMA, common to nearly all modern architectures, the PC has a particular breakdown of responsibilities between the various chips. The CPU executes code and talks to the northbridge (Intel MCH). Integrated devices like USB and Ethernet are all located in the southbridge (Intel ICH), with the exception of on-board video, which is located in the northbridge. Between each of these chips is an Intel or AMD proprietary bus, which is why your Intel CPU won’t work with your AMD chipset, even if you were to rework the socket to fit it. Your RAM is accessed only via the northbridge (Intel) or via a bus shared with the northbridge (AMD).

Interfacing with the CPU is very simple. All complexities (privilege level, paging, task management, segmentation, MSRs) are handled completely internally. On the external bus shared with the northbridge, a CPU has a set of address and data lines and a few control/status lines. Besides power supply, the address and data pins are the most numerous. In the Intel quad-core spec, there are only about 60 types of pins. Only three pins (LINT[0:1], SMI#) are used to signal all interrupts, even on systems with dozens of devices.

Remember, these addresses are purely physical addresses as all virtual memory translation is internal to the CPU. There are two types of addresses known to the northbridge: memory and IO space. The latter are generated by the in/out asm instructions and merely result in a special value being written to the address lines on the next clock cycle after the address is sent. IO space addresses are typically used for device configuration or legacy devices.

The northbridge is relatively dumb compared to the CPU. It is like a traffic cop, directing the CPU’s accesses to devices or RAM. Likewise, when a device on the southbridge wants to access RAM via DMA, the northbridge merely routes the request to the correct location. It maintains a map, set during PCI configuration, which says something like “these address ranges go to the southbridge, these others go to the integrated video”.

With integrated peripherals, PCI is no longer a bus, it’s merely a protocol. There is no set of PCI bus lines within your southbridge that are hooked to the USB and Ethernet components of the chip. Instead, only PCI configuration remains in common with external devices on a PCI bus. PCI configuration is merely a set of IO port reads/writes to walk the logical device hierarchy, programming the northbridge with which regions it decodes to which device. It’s setting up the table for the traffic cop.

Next time, we’ll examine the advent of IOMMUs and DEVs/NoDMA tables.