Next Baysec: March 2 at Kate O’Briens

February 22, 2010 ~ Nate Lawson

The next Baysec meeting is Tuesday March 2nd at Kate O’Briens. Come out and meet fellow security people from all over the Bay Area. As always, this is not a sponsored meeting, there is no agenda or speakers, and no RSVP is needed.

See you Tuesday, March 2, 7-11 pm. We’ll be towards the back.

Kate O’Briens
579 Howard St. @ 2nd, San Francisco
(415) 882-7240

Reverse-engineering a smart meter

February 15, 2010February 12, 2010 ~ Nate Lawson ~ 40 Comments

In 2008, a nice man from PG&E came out to work on my house. He installed a new body for the gas meter and said someone would come by later to install the electronics module to make it a “smart meter“. Since I work with security for embedded systems, this didn’t sound very exciting. I read up on smart meters and found they not only broadcast billing information (something I consider only a small privacy risk) but also provide remote control. A software bug, typo at the control center, or hacker could potentially turn off my power and gas. But how vulnerable was I actually?

I decided to look into how smart meters work. Since the electronics module never was installed, I called up various parts supply houses to try to buy one. They were quite suspicious, requesting company background info and letterhead before deciding if they could send an evaluation sample. Even though this was long before IOActive outed smart meter flaws to CNN, they had obviously gotten the message that these weren’t just ordinary valves or pipes.

Power, gas, and water meters have a long history of tampering attacks. People have drilled into them, shorted them out, slowed them down, and rewired them to run backwards. I don’t think I need to mention that doing those kinds of things is extremely dangerous and illegal. This history is probably why the parts supplier wasn’t eager to sell any smart meter boards to the public.

There’s always an easier way. By analyzing the vendor’s website, I guessed that they use the same radio module across product lines and other markets wouldn’t be so paranoid. Sure enough, the radio module for a water meter made by the same vendor was available on Ebay for $30. It arrived a few days later.

The case was hard plastic to prevent water damage. I used a bright light and careful tapping to be sure I wasn’t going to cut into anything with the Dremel. I cut a small window to see inside and identified where else to cut. I could see some of the radio circuitry and the battery connector.

After more cutting, it appeared that the battery was held against the board by the case and had spring-loaded contacts (see above). This would probably zeroize the device’s memory if it was cut open by someone trying to cheat the system. I applied hot glue to hold the contacts to the board and then cut away the rest of the enclosure.

Inside, the board had a standard MSP430F148 microcontroller and a metal cage with the radio circuitry underneath. I was in luck. I had previously obtained all the tools for working with the MSP430 in the Fastrak transponder. These CPUs are popular in the RFID world because they are very low power. I used the datasheet to identify the JTAG pinouts on this particular model and found the vendor even provided handy pads for them.

Since the pads matched the standard 0.1″ header spacing, I soldered a section of header directly to the board. For the ground pin, I ran a small wire to an appropriate location found with my multimeter. Then I added more hot glue to stabilize the header. I connected the JTAG cable to my programmer. The moment of truth was at hand — was the lock bit set?

Not surprisingly (if you read about the Fastrak project), the lock bit was not set and I was able to dump the firmware. I loaded it into the IDA Pro disassembler via the MSP430 CPU plugin. The remainder of the work would be to trace the board’s IO pins to identify how the microcontroller interfaced with the radio and look for protocol handling routines in the firmware to find crypto or other security flaws.

I haven’t had time to complete the firmware analysis yet. Given the basic crypto flaws in other smart meter firmware (such as Travis Goodspeed finding a PRNG whose design was probably drawn in crayon), I expect there would be other stomach-churning findings in this one. Not even taking rudimentary measures such as setting the lock bit does not bode well for its security.

I am not against the concept of smart meters. The remote reading feature could save a lot of money and dog bites with relatively minimal privacy exposure, even if the crypto was weak. I would be fine if power companies offered an opt-in remote control feature in exchange for lower rates. Perhaps this feature could be limited to cutting a house’s power to 2000 watts or something.

However, something as important as turning off power completely should require a truck roll. A person driving a truck will not turn off the mayor’s power or hundreds of houses at once without asking questions. A computer will. Remote control should not be a mandatory feature bundled with remote reading.

PS3 hypervisor exploit reproduced

February 8, 2010 ~ Nate Lawson ~ 4 Comments

There’s a nice series of articles by xorloser on reproducing the recent PS3 hypervisor hack. He used a microcontroller to send the glitch and improved the software exploit to work on multiple firmware revisions. Here’s a picture of his final setup.

It remains to be seen what security measures Sony has taken to address a hypervisor compromise. One countermeasure would be to lock down the OtherOS environment, since the attack depends on the ability to manipulate low-level OS memory structures. They could be using a simpler hypervisor than the GameOS side (say, one that just prevents access to the GPU). Perhaps the SPEs have a disable bit that turns off the hardware decryption unit, and the hypervisor does this before booting OtherOS.

Beyond this, they may not be using a single global key that is shared amongst all SPEs. Broadcast encryption schemes have long been used in the pay TV industry to allow fine-grained revocation of keys that have leaked. They work by embedding a subset of keys from a matrix or tree in each device. If the keys leak, they can be excluded from subsequent software releases. This requires attackers to keep extracting keys and discarding the devices as they are revoked.

Also, it’s possible there are software protection measures in place. For example, the SPE could request hashes of regions of the calling hypervisor and use this to detect patching. This results in a cat-and-mouse game where firmware updates (or even individual games) use different methods of detecting attackers. Meanwhile, attackers would try to come up with new ways to avoid these countermeasures. This has already been happening in the Xbox 360 world, as well as with nearly every other game console before now.

We’ll have to wait and see if Sony used this kind of defense-in-depth and planned for this eventuality or built a really tall wall with nothing more behind it.

How the PS3 hypervisor was hacked

January 27, 2010January 27, 2010 ~ Nate Lawson ~ 107 Comments

George Hotz, previously known as an iPhone hacker, announced that he hacked the Playstation 3 and then provided exploit details. Various articles have been written about this but none of them appear to have analyzed the actual code. Because of the various conflicting reports, here is some more analysis to help understand the exploit.

The PS3, like the Xbox360, depends on a hypervisor for security enforcement. Unlike the 360, the PS3 allows users to run ordinary Linux if they wish, but it still runs under management by the hypervisor. The hypervisor does not allow the Linux kernel to access various devices, such as the GPU. If a way was found to compromise the hypervisor, direct access to the hardware is possible, and other less privileged code could be monitored and controlled by the attacker.

Hacking the hypervisor is not the only step required to run pirated games. Each game has an encryption key stored in an area of the disc called ROM Mark. The drive firmware reads this key and supplies it to the hypervisor to use to decrypt the game during loading. The hypervisor would need to be subverted to reveal this key for each game. Another approach would be to compromise the Blu-ray drive firmware or skip extracting the keys and just slave the decryption code in order to decrypt each game. After this, any software protection measures in the game would need to be disabled. It is unknown what self-protection measures might be lurking beneath the encryption of a given game. Some authors might trust in the encryption alone, others might implement something like SecuROM.

The hypervisor code runs on both the main CPU (PPE) and one of its seven Cell coprocessors (SPE). The SPE thread seems to be launched in isolation mode, where access to its private code and data memory is blocked, even from the hypervisor. The root hardware keys used to decrypt the bootloader and then hypervisor are present only in the hardware, possibly through the use of eFUSEs. This could also mean that each Cell processor has some unique keys, and decryption does not depend on a single global root key (unlike some articles that claim there is a single, global root key).

George’s hack compromises the hypervisor after booting Linux via the “OtherOS” feature. He has used the exploit to add arbitrary read/write RAM access functions and dump the hypervisor. Access to lv1 is a necessary first step in order to mount other attacks against the drive firmware or games.

His approach is clever and is known as a “glitching attack“. This kind of hardware attack involves sending a carefully-timed voltage pulse in order to cause the hardware to misbehave in some useful way. It has long been used by smart card hackers to unlock cards. Typically, hackers would time the pulse to target a loop termination condition, causing a loop to continue forever and dump contents of the secret ROM to an accessible bus. The clock line is often glitched but some data lines are also a useful target. The pulse timing does not always have to be precise since hardware is designed to tolerate some out-of-spec conditions and the attack can usually be repeated many times until it succeeds.

George connected an FPGA to a single line on his PS3’s memory bus. He programmed the chip with very simple logic: send a 40 ns pulse via the output pin when triggered by a pushbutton. This can be done with a few lines of Verilog. While the length of the pulse is relatively short (but still about 100 memory clock cycles of the PS3), the triggering is extremely imprecise. However, he used software to setup the RAM to give a higher likelihood of success than it would first appear.

His goal was to compromise the hashed page table (HTAB) in order to get read/write access to the main segment, which maps all memory including the hypervisor. The exploit is a Linux kernel module that calls various system calls in the hypervisor dealing with memory management. It allocates, deallocates, and then tries to use the deallocated memory as the HTAB for a virtual segment. If the glitch successfully desynchronizes the hypervisor from the actual state of the RAM, it will allow the attacker to overwrite the active HTAB and thus control access to any memory region. Let’s break this down some more.

The first step is to allocate a buffer. The exploit then requests that the hypervisor create lots of duplicate HTAB mappings pointing to this buffer. Any one of these mappings can be used to read or write to the buffer, which is fine since the kernel owns it. In Unix terms, think of these as multiple file handles to a single temporary file. Any file handle can be closed, but as long as one open file handle remains, the file’s data can still be accessed.

The next step is to deallocate the buffer without first releasing all the mappings to it. This is ok since the hypervisor will go through and destroy each mapping before it returns. Immediately after calling lv1_release_memory(), the exploit prints a message for the user to press the glitching trigger button. Because there are so many HTAB mappings to this buffer, the user has a decent chance of triggering the glitch while the hypervisor is deallocating a mapping. The glitch probably prevents one or more of the hypervisor’s write cycles from hitting memory. These writes were intended to deallocate each mapping, but if they fail, the mapping remains intact.

At this point, the hypervisor has an HTAB with one or more read/write mappings pointing to a buffer it has deallocated. Thus, the kernel no longer owns that buffer and supposedly cannot write to it. However, the kernel still has one or more valid mappings pointing to the buffer and can actually modify its contents. But this is not yet useful since it’s just empty memory.

The exploit then creates a virtual segment and checks to see if the associated HTAB is located in a region spanning the freed buffer’s address. If not, it keeps creating virtual segments until one does. Now, the user has the ability to write directly to this HTAB instead of the hypervisor having exclusive control of it. The exploit writes some HTAB entries that will give it full access to the main segment, which maps all of memory. Once the hypervisor switches to this virtual segment, the attacker now controls all of memory and thus the hypervisor itself. The exploit installs two syscalls that give direct read/write access to any memory address, then returns back to the kernel.

It is quite possible someone will package this attack into a modchip since the glitch, while somewhat narrow, does not need to be very precisely timed. With a microcontroller and a little analog circuitry for the pulse, this could be quite reliable. However, it is more likely that a software bug will be found after reverse-engineering the dumped hypervisor and that is what will be deployed for use by the masses.

Sony appears to have done a great job with the security of the PS3. It all hangs together well, with no obvious weak points. However, the low level access given to guest OS kernels means that any bug in the hypervisor is likely to be accessible to attacker code due to the broad API it offers. One simple fix would be to read back the state of each mapping after changing it. If the write failed for some reason, the hypervisor would see this and halt.

It will be interesting to see how Sony responds with future updates to prevent this kind of attack.

[Edit: corrected the description of virtual segment allocation based on a comment by geohot.]

Smart meter crypto flaw worse than thought

January 11, 2010 ~ Nate Lawson ~ 10 Comments

Travis Goodspeed has continued finding flaws in TI microcontrollers, branching out from the MSP430 to ZigBee radio chipsets. A few days ago, he posted a flaw in the random number generator. Why is this important? Because the MSP430 and ZigBee are found in many wireless sensor systems, including most Smart Meters.

Travis describes two flaws: the PRNG is a 16-bit LFSR and it is not seeded with very much entropy. However, the datasheet recommends this random number generator be used to create cryptographic keys. It’s extremely scary to find such a poor understanding of crypto in a device capable of forging billing records or turning off the power to your house.

The first flaw is that the PRNG is not cryptographically secure. The entropy pool is extremely small (16 bits), which can be attacked with a brute-force search in a fraction of a second, even if used with a secure PRNG such as Yarrow. Also, the PRNG is never re-seeded, which could have helped if implemented properly.

Even if the entropy pool was much larger, it would still be vulnerable because an LFSR is not a cryptographically-secure PRNG. An attacker who has seen some subset of the output can recreate the LFSR taps (even if they’re secret) and then generate any future sequence from it.

The second problem is that it is seeded from a random source that has very little entropy. Travis produced a frequency count graph for the range of values returned by the random source, ADCTSTL, a radio register. As you can see from that graph, a few 8-bit values are returned many times (clustered around 0 and 100) and some are not returned at all. This bias could be exploited even if it was used with a cryptographically-secure PRNG.

These problems are each enough to make the system trivially insecure to a simple brute-force attack, as Travis points out. However, it gets worse because the insecure PRNG is used with public-key crypto. The Z-Stack library includes ECC code written by Certicom. I have not reviewed that code, but it seems reasonable to use a library from a company that employs cryptographers. But the ECC code makes the critical mistake of leaving implementation of primitives such as the PRNG up to the developer. Other libraries (such as OpenSSL, Mozilla’s NSS, and Microsoft’s Crypto API) all have their own PRNG, even if seeding it has to be left up to the developer. That at least reduces the risk of PRNG flaws.

ECC, like other public key crypto, falls on its face when the design spec is violated. In particular, ECDSA keys are completely exposed if even a few bits of the random nonce are predictable. Even if the keys were securely generated in the factory during the manufacturing process, a predictable PRNG completely exposes them in the field. Since this kind of attack is based on poor entropy, it would still be possible even if TI replaced their PRNG with one that is cryptographically secure.

Given that these chips are used in critical infrastructure such as smart meters and this attack can be mounted from remote, it is important that it be fixed carefully. This will be difficult to fix since it will require hardware changes to the random source of entropy, and there is already an unknown number of devices in the field. Once again, crypto proves fragile and thorough review is vital.

Timing-independent array comparison

January 7, 2010August 31, 2010 ~ Nate Lawson ~ 17 Comments

I find it interesting that people tend to follow the same stages of awareness when first hearing about timing attacks. A recent discussion about Django redesigning their cookie scheme shows people at the various stages.

1. The timing difference is too small to attack

This is the result of not understanding statistics or how jitter can be modeled and filtered. Usually, people are thinking that a single measurement for a given input is probably obscured by noise. This is correct. That’s why you take many measurements for the same input and average them. The more noise, the more measurements you need. But computers are fast, so it doesn’t take long to do thousands or even millions of runs per guessed byte.

The basic statistical method is simple. You apply a hypothesis test of the mean of all runs with a given guess for a byte (say, 0) against the mean for all the others (1, 2, …, 255) If there is a significant difference in the means, then you have guessed that byte correctly. If no difference, repeat for the other 255 values. If still no difference, take more measurements.

The other misconception is that jitter is too great to get anything useful out of the measurements, especially in a network. There is an excellent paper by Crosby and Wallach that debunks this myth by carefully analyzing noise and its causes as well as how to filter it. They conclude that an attacker can reliably detect processing differences as low as 200 nanoseconds on a LAN or 30 microseconds on a WAN given only 1000 measurements. If your server is hosted somewhere an attacker can also buy rackspace, then you are vulnerable to LAN timing attacks.

2. I’ll just add a random delay to my function

Then I’ll take more measurements. See above.

3. I’ll add a deadline to my function and delay returning until it is reached

This usually has a higher overhead and is more prone to errors than implementing your function such that its timing does not vary, based on the input data. If you make a mistake in your interval calculation or an attacker is able to “stun” your function somewhere in the middle such that all target computations occur after the interval timer has expired, then you’re vulnerable again. This is similar to avoiding buffer overflow attacks by lengthening the buffer — better to fix the actual problem instead.

4. Ok, ok, I developed a constant-time function

How carefully did you analyze it? After citing my Google talk on crypto flaws, here was one attempt from the Reddit thread:

def full_string_cmp(s1, s2):
    total = 0
    for a, b in zip(s1, s2):
        total += (a != b)
    return not total

The first bug is that this opens a worse hole if the two strings are not the same length. In fact, an attacker can send in a zero-length string and the result of zip() is an empty array. This results in the for() loop never executing and any input being accepted as valid! The fix is to check the two lengths first to make sure they’re equal or in C, compare two fixed-length arrays that are guaranteed to be the same size.

The next bug is smaller, but still valid. There’s a timing leak in the comparison of a and b. When you write:

total += (a != b)

Both Python and a C compiler actually generate low-level instructions equivalent to:

if (a != b)
    tmp = 1
else
    tmp = 0
total += tmp

Thus, there is still a small timing leak. If they are equal, an additional branch instruction is executed. If they are not equal, this is skipped. A common intuitive mistake among programmers is that single lines of code are atomic. This is why you might find a C version of this such as “total += (a != b) ? 1 : 0”, which generates the same code. Just because it fits on a single line does not mean it is atomic or constant-time.

5. Ok, I fixed those bugs. Are we done yet?

Let’s see what we have now (adapted from my keyczar posting).

if len(userMsg) != len(correctValue):
    return False
result = 0
for x, y in zip(userMsg, correctValue):
    result |= ord(x) ^ ord(y)
return result == 0

This now uses an arithmetic operator (XOR) instead of a logical compare (!=), and thus should be constant-time. The += operator could possibly have a leak if carries take longer than an add without carry, so we went with |= instead. We check the lengths and abort if they don’t match. This does leak timing information about the length of the correct value, so we have to make sure that is not a problem. Usually it’s not, but if these were passwords instead of cryptographic values, it would be better to hash them with PBKDF2 or bcrypt instead of working with them directly. But are we done?

Maybe. But first we need to figure out if our high-level language has any behavior that affects the timing of this function. For example, what if there is sign-bit handling code in the Python interpreter that behaves differently if x or y is negative? What if the zip() operator has an optimization that compares two arrays first to see if they’re identical before returning their union?

The answer is you can never be sure. In C, you have more assurance but still not enough. For example, what if the compiler optimization settings change? Is this still safe? What about the microarchitecture of the CPU? What if Intel came out with a new byte-wise cache alignment scheme in the Pentium 1000?

I hope this has convinced some people that side channel attacks are not easily solved. Important code should be carefully reviewed to have higher assurance this class of attack has been mitigated.