Smart meter crypto flaw worse than thought

Travis Goodspeed has continued finding flaws in TI microcontrollers, branching out from the MSP430 to ZigBee radio chipsets. A few days ago, he posted a flaw in the random number generator. Why is this important? Because the MSP430 and ZigBee are found in many wireless sensor systems, including most Smart Meters.

Travis describes two flaws: the PRNG is a 16-bit LFSR and it is not seeded with very much entropy. However, the datasheet recommends this random number generator be used to create cryptographic keys. It’s extremely scary to find such a poor understanding of crypto in a device capable of forging billing records or turning off the power to your house.

The first flaw is that the PRNG is not cryptographically secure. The entropy pool is extremely small (16 bits), which can be attacked with a brute-force search in a fraction of a second, even if used with a secure PRNG such as Yarrow. Also, the PRNG is never re-seeded, which could have helped if implemented properly.

Even if the entropy pool was much larger, it would still be vulnerable because an LFSR is not a cryptographically-secure PRNG. An attacker who has seen some subset of the output can recreate the LFSR taps (even if they’re secret) and then generate any future sequence from it.

The second problem is that it is seeded from a random source that has very little entropy. Travis produced a frequency count graph for the range of values returned by the random source, ADCTSTL, a radio register. As you can see from that graph, a few 8-bit values are returned many times (clustered around 0 and 100) and some are not returned at all. This bias could be exploited even if it was used with a cryptographically-secure PRNG.

These problems are each enough to make the system trivially insecure to a simple brute-force attack, as Travis points out. However, it gets worse because the insecure PRNG is used with public-key crypto. The Z-Stack library includes ECC code written by Certicom. I have not reviewed that code, but it seems reasonable to use a library from a company that employs cryptographers. But the ECC code makes the critical mistake of leaving implementation of primitives such as the PRNG up to the developer. Other libraries (such as OpenSSL, Mozilla’s NSS, and Microsoft’s Crypto API) all have their own PRNG, even if seeding it has to be left up to the developer. That at least reduces the risk of PRNG flaws.

ECC, like other public key crypto, falls on its face when the design spec is violated. In particular, ECDSA keys are completely exposed if even a few bits of the random nonce are predictable. Even if the keys were securely generated in the factory during the manufacturing process, a predictable PRNG completely exposes them in the field. Since this kind of attack is based on poor entropy, it would still be possible even if TI replaced their PRNG with one that is cryptographically secure.

Given that these chips are used in critical infrastructure such as smart meters and this attack can be mounted from remote, it is important that it be fixed carefully. This will be difficult to fix since it will require hardware changes to the random source of entropy, and there is already an unknown number of devices in the field. Once again, crypto proves fragile and thorough review is vital.

Timing-independent array comparison

I find it interesting that people tend to follow the same stages of awareness when first hearing about timing attacks. A recent discussion about Django redesigning their cookie scheme shows people at the various stages.

1. The timing difference is too small to attack

This is the result of not understanding statistics or how jitter can be modeled and filtered. Usually, people are thinking that a single measurement for a given input is probably obscured by noise. This is correct. That’s why you take many measurements for the same input and average them. The more noise, the more measurements you need. But computers are fast, so it doesn’t take long to do thousands or even millions of runs per guessed byte.

The basic statistical method is simple. You apply a hypothesis test of the mean of all runs with a given guess for a byte (say, 0) against the mean for all the others (1, 2, …, 255) If there is a significant difference in the means, then you have guessed that byte correctly. If no difference, repeat for the other 255 values. If still no difference, take more measurements.

The other misconception is that jitter is too great to get anything useful out of the measurements, especially in a network. There is an excellent paper by Crosby and Wallach that debunks this myth by carefully analyzing noise and its causes as well as how to filter it. They conclude that an attacker can reliably detect processing differences as low as 200 nanoseconds on a LAN or 30 microseconds on a WAN given only 1000 measurements. If your server is hosted somewhere an attacker can also buy rackspace, then you are vulnerable to LAN timing attacks.

2. I’ll just add a random delay to my function

Then I’ll take more measurements. See above.

3. I’ll add a deadline to my function and delay returning until it is reached

This usually has a higher overhead and is more prone to errors than implementing your function such that its timing does not vary, based on the input data. If you make a mistake in your interval calculation or an attacker is able to “stun” your function somewhere in the middle such that all target computations occur after the interval timer has expired, then you’re vulnerable again. This is similar to avoiding buffer overflow attacks by lengthening the buffer — better to fix the actual problem instead.

4. Ok, ok, I developed a constant-time function

How carefully did you analyze it? After citing my Google talk on crypto flaws, here was one attempt from the Reddit thread:

def full_string_cmp(s1, s2):
    total = 0
    for a, b in zip(s1, s2):
        total += (a != b)
    return not total

The first bug is that this opens a worse hole if the two strings are not the same length. In fact, an attacker can send in a zero-length string and the result of zip() is an empty array. This results in the for() loop never executing and any input being accepted as valid! The fix is to check the two lengths first to make sure they’re equal or in C, compare two fixed-length arrays that are guaranteed to be the same size.

The next bug is smaller, but still valid. There’s a timing leak in the comparison of a and b. When you write:

total += (a != b)

Both Python and a C compiler actually generate low-level instructions equivalent to:

if (a != b)
    tmp = 1
else
    tmp = 0
total += tmp

Thus, there is still a small timing leak. If they are equal, an additional branch instruction is executed. If they are not equal, this is skipped. A common intuitive mistake among programmers is that single lines of code are atomic. This is why you might find a C version of this such as  “total += (a != b) ? 1 : 0”, which generates the same code. Just because it fits on a single line does not mean it is atomic or constant-time.

5. Ok, I fixed those bugs. Are we done yet?

Let’s see what we have now (adapted from my keyczar posting).

if len(userMsg) != len(correctValue):
    return False
result = 0
for x, y in zip(userMsg, correctValue):
    result |= ord(x) ^ ord(y)
return result == 0

This now uses an arithmetic operator (XOR) instead of a logical compare (!=), and thus should be constant-time. The += operator could possibly have a leak if carries take longer than an add without carry, so we went with |= instead. We check the lengths and abort if they don’t match. This does leak timing information about the length of the correct value, so we have to make sure that is not a problem. Usually it’s not, but if these were passwords instead of cryptographic values, it would be better to hash them with PBKDF2 or bcrypt instead of working with them directly. But are we done?

Maybe. But first we need to figure out if our high-level language has any behavior that affects the timing of this function. For example, what if there is sign-bit handling code in the Python interpreter that behaves differently if x or y is negative? What if the zip() operator has an optimization that compares two arrays first to see if they’re identical before returning their union?

The answer is you can never be sure. In C, you have more assurance but still not enough. For example, what if the compiler optimization settings change? Is this still safe? What about the microarchitecture of the CPU? What if Intel came out with a new byte-wise cache alignment scheme in the Pentium 1000?

I hope this has convinced some people that side channel attacks are not easily solved. Important code should be carefully reviewed to have higher assurance this class of attack has been mitigated.

Interesting talks at 26c3

I hope to attend a CCC event some day. While there are many great talks in the 26c3 schedule, here are some talks that look particularly interesting.

Others that may be interesting but haven’t posted slides or papers yet:

Hope everyone at 26c3 has a great time. Best wishes for a safe and secure 2010.

Just another day at the office

The following does not take place over a 24 hour period. But any one of these situations is a good example of a typical day at Root Labs.

Attack Windows device driver protection scheme

Certain drivers in Windows must implement software protection (PMP) in order to prevent audio/video ripping attacks. Since drivers run in ring 0, they can do a lot more than just the standard SEH tricks. It’s 1992 all over again as we dig into the IDT and attempt to find how they are trying to protect their decryption process.

Whitebox cryptography is a method of combining a key and cipher implementation to create a keyed cipher. It can only do whatever operation it was initialized with and uses a hard-coded key. The entire operation becomes a series of table lookups. The intermediate values are obscured by randomness merged into the tables. But it’s not impossible to defeat.

To get at the cipher though, we’re going to need a way to bypass some of these anti-debugging traps. Since ring 0 code has direct access to all the CPU’s registers (including the debug registers, MSRs, and page tables), it is free to wreak havoc with our attempts to circumvent it. A common approach is to disable any breakpoints in the registers by overwriting them. A less common method is to use the debug registers as a local procedure call gate so that an attacker that writes to them breaks the main loop.

We write our own hook and patch into the int 1 handler at a point that isn’t integrity-checked and set the GD bit. This causes our hook to be called whenever anyone writes or reads the debug registers. The trap springs! There’s the heart of the protection code right there. Now to figure out what it’s doing.

Design self-reinforcing integrity checks

Preventing attackers from patching your code is very difficult. One approach is to insert small hash functions that verify a region of code or data has not been changed. Of course, there’s the obvious problem of “who watches the watcher?” Since it’s easy for attackers to NOP out the checksum routine or modify their patch to compensate if you use CRC, our mission today is to design and implement a more robust approach.

First, we analyze the general problem of mutually-reinforcing checks. The code and data for the check itself both need to be covered. But if two checks exactly mirrored each other, you’d have a chicken-and-egg problem in building them since a change in one would require changes in the other, and so on. It’s like standing between two mirrors, infinite recursion.

A data structure that describes the best we can do is a directed acyclic graph. There are no cycles so the checks (and checksums) can be generated in reverse order. But the multiple paths through the graph provide overlapping coverage so we can be certain that a single patch cannot bypass the protection. Of course, the roots of each of these paths needs to be hidden throughout the code. Putting them all in one big loop run by a single watchman thread would be a mistake. Now we have to come up with a way of automatically generating, randomizing, and inserting them into the code while also hiding the root nodes in various places.

Reverse-engineer RFID transponder

What exactly is in those FasTrak transponders? Do they securely identify themselves via cryptographic challenge/response? Do they preserve your privacy? What about this rumor that they are tracked by antennas all over Bay Area freeways?

Not being an existing user, we bought one at a local supermarket and took it apart. Inside was a microcontroller, some passive electronics, and not a whole lot else. An older one turned out to still have JTAG enabled, so it was simple to dump its firmware. The newer one did have the lock bit set, and Flylogic Engineering was kind enough to decap it and zap it for us. The firmware was identical. Does IDA have an MSP430 plugin? Well, there was one but it was on Geocities and then vaporized. Time to dig around a bit. Find the source code and hack it to work with the newer SDK.

Then it’s off to drop in the code and analyze it for switch statements (protocol handlers usually). Trace everything that starts from the IO pins that map to the receive side of the antenna. Then manually walk up the stack to see where it goes. Hey, there’s an over-the-air update function in here. Just send the magic handshake, then the next packet gets written to flash. There are some checks to try to maintain the writes within the area that stores the ID. But there are multiple paths to this function and one of them is obviously not tested because it pushes the wrong size argument on the stack (a 2-byte instead of 1-byte length argument). Time to notify the agency that 1 million transponders are at risk of a permanent DoS at best and code execution at worst.

Coda

If you’ve made it this far, you might be interested to know we’re hiring. We tackle difficult security problems in environments with clever, persistent adversaries. We’re just as likely to design a system as attack it (and often do both for the same customer). If this sounds like your kind of job, please see here for more details.

Why RSA encryption padding is critical

Public key cryptosystems like RSA are often taught by describing how they work at a high level. Even an algorithm textbook written by one of the inventors of RSA focuses more on exponentiation and how the decryption operation inverts ciphertext than on the underlying security details. However, all public key algorithms have various requirements that are not part of the basic primitive but are vital for security.

RSA requires the plaintext to be armored during encryption/signing and the result to be verified during decryption/verification. Unfortunately, this armoring is commonly called “padding”, which means some implementers think it functions like ordinary protocol padding. The interoperability principle (“be strict in what you send and lenient in what you accept”) is exactly opposite how public key crypto must be implemented. Padding cannot be ignored and if even one bit is out of place, the message is invalid. Failure to implement all the steps correctly could allow attackers to forge signatures, decrypt ciphertext, or even recover the private key.

I’ve written before about how failure to verify a signature result properly could result in forged signatures or recovery of the private key. This time, I’d like to explain an attack that shows why encryption armoring works differently than signature armoring.

Most RSA implementations use an optimization called the Chinese Remainder Theorem or CRT. With CRT, the implementation performs the exponentiation me mod n as two separate operations, me mod p and me mod q. The two parts are combined at the end. Since p and q are about half the size of n (which is p * q), this speeds up the exponentiation a lot because the multi-word numbers are smaller.

CRT can also be used to attack RSA. Consider an implementation that does not use any armoring. The encryption operation is simply the RSA primitive itself. A sender wants to send a message to three separate recipients. Thus, the sender calculates:

me mod A
me mod B
me mod C

The attacker can’t calculate the inverse of one of these encryptions directly because the eth root problem in each ring is difficult. However, because the message is the same for each recipient (but different ciphertexts), she can convert these operations into a group where the inverse operation is easy. To do this, she uses CRT to combine the three ciphertexts to get:

me mod A*B*C

Since m is smaller than each of A, B, and C, me is smaller than A*B*C if e=3. This means the attacker just has to calculate the cube root of the result, an operation that is easy in the monoid of integers modulo A*B*C. This is essentially an integer cube root, ordinary arithmetic. This shows why PKCS #1 armoring for encryption has always been randomized. It can be fatal to encrypt the same message multiple times, even to the same recipient. For signatures, it is more secure to randomize the padding as in RSASSA-PSS, but it is not yet fatal for legacy systems to continue to use PKCS #1 v1.5 signature padding, which is not randomized.

In public key crypto, padding is not an optional feature. It is a critical part of the cryptosystem security. The latest version of PKCS #1 (v2.1 as of this writing) should be used for both encryption/signing and decryption/verification. For new implementation, use the approaches that first appeared in v2.0 (RSAES-OAEP for encryption and RSASSA-PSS for signing). Failure to properly manage RSA armoring could allow attackers to forge signatures, decrypt ciphertext, or even recover your private key.

Awesome C64 visual debugger

I recently ran across a new visual debugger for C64 emulators called ICU64, written by Mathfigure. It will be released later this year and provides some amazing visualizations of both memory access and memory-mapped devices like the video and sound chip. See the video below for how it works.

There is also another video showing how the classic game Boulder Dash can be expanded to span multiple screens. He implemented this by grabbing graphics from the same RAM areas the game uses and displaying them on a custom screen. On this forum thread, the author discusses his goal of allowing more graphical expandability for classic games.

Learning to program on a small machine with a full view of every memory location is something I’ve always seen as a great way to start. I got a lot of insight into how my computer worked by running a background task that continually copied pages of RAM to screen, so you could “scroll through” memory. For example, the tick counter showed up as a blur.

I’ve always said that my kids would have to learn to program a C64 before they got a more modern system. Wouldn’t this be a great tool for teaching anyone new to low-level computing how assembly language and memory-mapped devices work? What about adapting some of these techniques to fuzzing or modern debuggers?

[Edit: added a link to the project now that it has been released]