PC memory architecture overview

The topics of DMA protection and a new Intel/AMD feature called an IOMMU (or VT-d) are becoming more prevalent. I believe this is due to two trends: increased use of virtualization and hardware protection for DRM. It’s important to first understand how memory works in a traditional PC before discussing the benefits and issues with using an IOMMU.

DMA (direct memory access) is a general term for architectures where devices can talk directly to RAM, without the CPU being involved. In PCs, the CPU is not even notified when DMA is in progress, although some chipsets do report a little information (i.e., bus mastering status bit or BM_STS). DMA was conceived to provide higher performance than the alternative, which is for the CPU to copy each byte of data from the device to memory (aka programmed IO). To write data to a hard drive controller via DMA, the driver running on the CPU writes the memory address of the data to the hardware and then goes on to doing other tasks. The drive controller finishes reading the data via DMA and generates an interrupt to notify the CPU that the write is complete.

DMA can actually be slower than programmed IO if the overhead in talking to the DMA controller to initiate the transaction takes longer than the transaction itself. This may be true for very short data. That’s why the original PC parallel port (LPT) doesn’t support DMA. When there are only 8 bits of data per transaction, it doesn’t make sense to spend time telling the hardware where to put the data, just read it yourself.

Behind this concept of DMA, common to nearly all modern architectures, the PC has a particular breakdown of responsibilities between the various chips. The CPU executes code and talks to the northbridge (Intel MCH). Integrated devices like USB and Ethernet are all located in the southbridge (Intel ICH), with the exception of on-board video, which is located in the northbridge. Between each of these chips is an Intel or AMD proprietary bus, which is why your Intel CPU won’t work with your AMD chipset, even if you were to rework the socket to fit it. Your RAM is accessed only via the northbridge (Intel) or via a bus shared with the northbridge (AMD).

Interfacing with the CPU is very simple. All complexities (privilege level, paging, task management, segmentation, MSRs) are handled completely internally. On the external bus shared with the northbridge, a CPU has a set of address and data lines and a few control/status lines. Besides power supply, the address and data pins are the most numerous. In the Intel quad-core spec, there are only about 60 types of pins. Only three pins (LINT[0:1], SMI#) are used to signal all interrupts, even on systems with dozens of devices.

Remember, these addresses are purely physical addresses as all virtual memory translation is internal to the CPU. There are two types of addresses known to the northbridge: memory and IO space. The latter are generated by the in/out asm instructions and merely result in a special value being written to the address lines on the next clock cycle after the address is sent. IO space addresses are typically used for device configuration or legacy devices.

The northbridge is relatively dumb compared to the CPU. It is like a traffic cop, directing the CPU’s accesses to devices or RAM. Likewise, when a device on the southbridge wants to access RAM via DMA, the northbridge merely routes the request to the correct location. It maintains a map, set during PCI configuration, which says something like “these address ranges go to the southbridge, these others go to the integrated video”.

With integrated peripherals, PCI is no longer a bus, it’s merely a protocol. There is no set of PCI bus lines within your southbridge that are hooked to the USB and Ethernet components of the chip. Instead, only PCI configuration remains in common with external devices on a PCI bus. PCI configuration is merely a set of IO port reads/writes to walk the logical device hierarchy, programming the northbridge with which regions it decodes to which device. It’s setting up the table for the traffic cop.

Next time, we’ll examine the advent of IOMMUs and DEVs/NoDMA tables.

Mesh design pattern: error correction

Our previous mesh design pattern, hash-and-decrypt, requires the attacker either to run the system to completion or reverse-engineer enough of it to limit the search space. If any bit of the input to the hash function is incorrect, the decryption key is completely wrong. This could be used, for example, in a game to unlock a subsequent level after the user has passed a number of objectives on the previous level. It could also be used with software protection to be sure a set of integrity checks or anti-debugger countermeasures have been running continuously.

Another pattern that is somewhat rare is error correction. An error correcting code uses compressed redundancy to allow data that has been modified to be repaired. It is commonly used in network protocols or hard disks to handle unintentional errors but can also be useful for software protection. In this case, an attacker who patches the software or modifies its data would find that the changes have no effect as they are silently repaired. This can be combined with other techniques (e.g., anti-debugging) to require an attacker to locate all points in the mesh and disable them simultaneously. Turbo codes, LDPC, and Reed-Solomon are some commonly used algorithms.

Hashing and error correction are very similar. A cryptographic hash is analogous to a compressed form of the original data, since by design it is extremely difficult to generate a collision (two sets of data that have the same fingerprint.) Instead of comparing every byte of two data sets, many systems just compare the hash. You can build a crude form of error correction by storing multiple copies of the original data and throwing out any that have an incorrect hash due to a patching attempt or other error. However, this results in bloat, and it’s relatively easy for the reverse engineer to find all copies of the identical data in memory, even if the hash is somewhat hidden.

Turbo codes are an efficient form of error correction. To put it simply, three different chunks of data are stored: the message itself (m bits) and two parity blocks (n/2 bits each). The total storage required is m + n bits, coding data at a rate of m / (m + n). You can think of this as a sort of crossword puzzle where one parity block stores the clues for “across” and the other stores the clues for “down”. Two decoders process the parity blocks and vote on their confidence in the output bits. If the vote is inconclusive, the process iterates.

turbocode.png

To use error correction for software protection, take a block of data or instructions that are important to security. Generate an encoded block for it using a turbo code. Now, insert a decoder in the code which calls into or processes this block of data. If an attacker patches the encoded data (say, to insert a breakpoint), the decoder will generate a repaired version of that data before using it.

This has a number of advantages. If the decoding is not done in-place, the attacker will not see the data being repaired, just that the patch had no effect. The parity blocks look nothing like the original data itself so it looks like there is only one copy of the data in memory. The decoder can be obfuscated in various ways and inlined to prevent it from being a single point of failure. The calling code can hash the state of the decoder as part of hash-and-decrypt so that errors are detected as well, allowing the software protection to later degrade the experience rather than immediately failing. This hides the location of the protection check (temporal distance.)

Like all mesh techniques, error correction is best used in ways that are mutually reinforcing. The linker can be adapted to automatically encode data and insert the decoding logic throughout the program, based on control flow analysis. Continually-running integrity checking routines can be encoded with this approach. The more intertwined the software protection, the harder it is to bypass.

Blackhat next week

I’m headed for the Blackhat conference next week. We’ll be giving our talk on why a 100% undetectable hypervisor is impossible

We’ll also be releasing our toolkit (“samsara”, an ongoing cycle of rebirth). This is the same code we will use for the Blue Pill challenge whenever Joanna and crew are ready. My hope is that it provides a nice implementation of the tests we’ll describe in our talk and a useful framework for other researchers to add new tests. We expect this will end the irrational fear of hypervisor rootkits and show attackers why spending their time developing one would be futile.

If you run into me, be sure to say hello.

TPM hardware attacks (part 2)

Previously, I described a recent attack on TPMs that only requires a short piece of wire. Dartmouth researchers used it to reset the TPM and then insert known-good hashes in the TPM’s PCRs. The TPM version 1.2 spec has changes to address such simple hardware attacks.

It takes a bit of work to piece together the 1.2 changes since they aren’t all in one spec. The TPM 1.2 changes spec introduces the concept of “locality”, the LPC 1.1 spec describes new firmware messages, and other information available from Google show how it all fits together.

In the TPM 1.1 spec, the PCRs were reset when the TPM was reset, and software could write to them on a “first come, first served” basis. However, in the 1.2 spec, setting certain PCRs requires a new locality message. Locality 4 is only active in a special hardware mode. This special hardware mode corresponds in the PC architecture to the SENTER instruction.

Intel SMX (now “TXT”, formerly “LT”) adds a new instruction called SENTER. AMD has a similiar instruction called SKINIT. This instruction performs the following steps:

  1. Load a module into RAM (usually stored in the BIOS)
  2. Lock it into cache
  3. Verify its signature
  4. Hash the module into a PCR at locality 4
  5. Enable certain new chipset registers
  6. Begin executing it

This authenticated code (AC) module then hashes the OS boot loader into a PCR at locality 3, disables the special chipset registers, and continues the boot sequence. Each time the locality level is lowered, it can’t be raised again. This means the AC module can’t overwrite the locality 4 hash and the boot loader can’t overwrite the locality 3 hash.

Locality is implemented in hardware by the chipset using the new LPC firmware commands to encapsulate messages to the TPM. Version 1.1 chipsets will not send those commands. However, a man-in-the-middle device can be built with a simple microcontroller attached to the LPC bus. While more complex than a single wire, it’s well within range of modchip manufacturers.

This microcontroller would be attached to the clock, frame, and 4-bit address/data bus, 6 lines in total. While the LPC bus is idle, this device could drive the frame and A/D lines to insert a locality 4 “reset PCR” message. Malicious software could then load whatever value it wanted into the PCRs. No one has implemented this attack as far as I know, but it has been discussed numerous times.

What is the TCG going to do about this? Probably nothing. Hardware attacks are outside their scope, at least according to their documents.

“The commands that the trusted process sends to the TPM are the normal TPM commands with a modifier that indicates that the trusted process initiated the command… The assumption is that spoofing the modifier to the TPM requires more than just a simple hardware attack, but would require expertise and possibly special hardware.”

— Proof of Locality (section 16)

This shows why drawing an arbitrary attack profile and excluding anything that is outside it often fails. Too often, the list of excluded attacks does not realistically match the value of the protected data or underestimates the cost to attackers.

In the designers’ defense, any effort to add tamper-resistance to a PC is likely to fall short. There are too many interfaces, chips, manufacturers, and use cases involved. In a closed environment like a set-top box, security can be designed to match the only intended use for the hardware. With a PC, legacy support is very important and no single party owns the platform, despite the desires of some companies.

It will be interesting to see how TCPA companies respond to the inevitable modchips, if at all.