root labs rdist

January 11, 2010

Smart meter crypto flaw worse than thought

Filed under: Crypto, Embedded, Hacking, Hardware, RFID, Security — Nate Lawson @ 1:08 pm

Travis Goodspeed has continued finding flaws in TI microcontrollers, branching out from the MSP430 to ZigBee radio chipsets. A few days ago, he posted a flaw in the random number generator. Why is this important? Because the MSP430 and ZigBee are found in many wireless sensor systems, including most Smart Meters.

Travis describes two flaws: the PRNG is a 16-bit LFSR and it is not seeded with very much entropy. However, the datasheet recommends this random number generator be used to create cryptographic keys. It’s extremely scary to find such a poor understanding of crypto in a device capable of forging billing records or turning off the power to your house.

The first flaw is that the PRNG is not cryptographically secure. The entropy pool is extremely small (16 bits), which can be attacked with a brute-force search in a fraction of a second, even if used with a secure PRNG such as Yarrow. Also, the PRNG is never re-seeded, which could have helped if implemented properly.

Even if the entropy pool was much larger, it would still be vulnerable because an LFSR is not a cryptographically-secure PRNG. An attacker who has seen some subset of the output can recreate the LFSR taps (even if they’re secret) and then generate any future sequence from it.

The second problem is that it is seeded from a random source that has very little entropy. Travis produced a frequency count graph for the range of values returned by the random source, ADCTSTL, a radio register. As you can see from that graph, a few 8-bit values are returned many times (clustered around 0 and 100) and some are not returned at all. This bias could be exploited even if it was used with a cryptographically-secure PRNG.

These problems are each enough to make the system trivially insecure to a simple brute-force attack, as Travis points out. However, it gets worse because the insecure PRNG is used with public-key crypto. The Z-Stack library includes ECC code written by Certicom. I have not reviewed that code, but it seems reasonable to use a library from a company that employs cryptographers. But the ECC code makes the critical mistake of leaving implementation of primitives such as the PRNG up to the developer. Other libraries (such as OpenSSL, Mozilla’s NSS, and Microsoft’s Crypto API) all have their own PRNG, even if seeding it has to be left up to the developer. That at least reduces the risk of PRNG flaws.

ECC, like other public key crypto, falls on its face when the design spec is violated. In particular, ECDSA keys are completely exposed if even a few bits of the random nonce are predictable. Even if the keys were securely generated in the factory during the manufacturing process, a predictable PRNG completely exposes them in the field. Since this kind of attack is based on poor entropy, it would still be possible even if TI replaced their PRNG with one that is cryptographically secure.

Given that these chips are used in critical infrastructure such as smart meters and this attack can be mounted from remote, it is important that it be fixed carefully. This will be difficult to fix since it will require hardware changes to the random source of entropy, and there is already an unknown number of devices in the field. Once again, crypto proves fragile and thorough review is vital.

January 7, 2010

Timing-independent array comparison

Filed under: Crypto, Hacking, Network, Security, python — Nate Lawson @ 3:54 pm

I find it interesting that people tend to follow the same stages of awareness when first hearing about timing attacks. A recent discussion about Django redesigning their cookie scheme shows people at the various stages.

1. The timing difference is too small to attack

This is the result of not understanding statistics or how jitter can be modeled and filtered. Usually, people are thinking that a single measurement for a given input is probably obscured by noise. This is correct. That’s why you take many measurements for the same input and average them. The more noise, the more measurements you need. But computers are fast, so it doesn’t take long to do thousands or even millions of runs per guessed byte.

The basic statistical method is simple. You apply a hypothesis test of the mean of all runs with a given guess for a byte (say, 0) against the mean for all the others (1, 2, …, 255) If there is a significant difference in the means, then you have guessed that byte correctly. If no difference, repeat for the other 255 values. If still no difference, take more measurements.

The other misconception is that jitter is too great to get anything useful out of the measurements, especially in a network. There is an excellent paper by Crosby and Wallach that debunks this myth by carefully analyzing noise and its causes as well as how to filter it. They conclude that an attacker can reliably detect processing differences as low as 200 nanoseconds on a LAN or 30 microseconds on a WAN given only 1000 measurements. If your server is hosted somewhere an attacker can also buy rackspace, then you are vulnerable to LAN timing attacks.

2. I’ll just add a random delay to my function

Then I’ll take more measurements. See above.

3. I’ll add a deadline to my function and delay returning until it is reached

This usually has a higher overhead and is more prone to errors than implementing your function such that its timing does not vary, based on the input data. If you make a mistake in your interval calculation or an attacker is able to “stun” your function somewhere in the middle such that all target computations occur after the interval timer has expired, then you’re vulnerable again. This is similar to avoiding buffer overflow attacks by lengthening the buffer — better to fix the actual problem instead.

4. Ok, ok, I developed a constant-time function

How carefully did you analyze it? After citing my Google talk on crypto flaws, here was one attempt from the Reddit thread:

def full_string_cmp(s1, s2):
    total = 0
    for a, b in zip(s1, s2):
        total += (a != b)
    return not total

The first bug is that this opens a worse hole if the two strings are not the same length. In fact, an attacker can send in a zero-length string and the result of zip() is an empty array. This results in the for() loop never executing and any input being accepted as valid! The fix is to check the two lengths first to make sure they’re equal or in C, compare two fixed-length arrays that are guaranteed to be the same size.

The next bug is smaller, but still valid. There’s a timing leak in the comparison of a and b. When you write:

total += (a != b)

Both Python and a C compiler actually generate low-level instructions equivalent to:

if (a != b)
    tmp = 1
else
    tmp = 0
total += tmp

Thus, there is still a small timing leak. If they are equal, an additional branch instruction is executed. If they are not equal, this is skipped. A common intuitive mistake among programmers is that single lines of code are atomic. This is why you might find a C version of this such asĀ  “total += (a != b) ? 1 : 0″, which generates the same code. Just because it fits on a single line does not mean it is atomic or constant-time.

5. Ok, I fixed those bugs. Are we done yet?

Let’s see what we have now (adapted from my keyczar posting).

if len(userMsg) != len(correctValue):
    return False
result = 0
for x, y in zip(userMsg, correctValue):
    result |= ord(x) ^ ord(y)
return result

This now uses an arithmetic operator (XOR) instead of a logical compare (!=), and thus should be constant-time. The += operator could possibly have a leak if carries take longer than an add without carry, so we went with |= instead. We check the lengths and abort if they don’t match. This does leak timing information about the length of the correct value, so we have to make sure that is not a problem. Usually it’s not, but if these were passwords instead of cryptographic values, it would be better to hash them with PBKDF2 or bcrypt instead of working with them directly. But are we done?

Maybe. But first we need to figure out if our high-level language has any behavior that affects the timing of this function. For example, what if there is sign-bit handling code in the Python interpreter that behaves differently if x or y is negative? What if the zip() operator has an optimization that compares two arrays first to see if they’re identical before returning their union?

The answer is you can never be sure. In C, you have more assurance but still not enough. For example, what if the compiler optimization settings change? Is this still safe? What about the microarchitecture of the CPU? What if Intel came out with a new byte-wise cache alignment scheme in the Pentium 1000?

I hope this has convinced some people that side channel attacks are not easily solved. Important code should be carefully reviewed to have higher assurance this class of attack has been mitigated.

January 4, 2010

Next Baysec: Jan 19 at Irish Bank

Filed under: Security — Nate Lawson @ 4:14 pm

The next Baysec meeting is Tuesday, January 19, 7 pm at the Irish Bank (note: new location). Come out and meet fellow security people from all over the Bay Area. As always, this is not a sponsored meeting, there is no agenda or speakers, and no RSVP is needed.

This is our first meeting at the new location so it would be great if we could get a good turnout. We have the entire back room reserved for our group. They will keep some tables on the edges for dining and clear out
the center.

See you there!

The Irish Bank
10 Mark Lane
San Francisco, CA
415.788.7152
theirishbank.com

December 30, 2009

Side channel attacks on cryptographic software

Filed under: Crypto, PC Architecture, Security — Nate Lawson @ 4:00 pm

Below is a recent article I wrote for IEEE Security and Privacy magazine titled “Side Channel Attacks on Cryptographic Software” (pdf). It covers simple timing attacks against HMAC validation, AES cache timing, and RSA branch prediction attacks. It’s a survey article, covering excellent research in side channel attacks over the past few years.

I think the attacker position is gaining an advantage recently. As CPU microarchitecture gets more complicated, more covert channels appear. Also, the move to virtualization and high-resolution timers gives better quality measurements and more opportunities to exploit even the tiniest of leaks. We’ll need to come up with more clever ways of modeling and eliminating covert channels, moving crypto operations into hardware, and giving software more control over microarchitectural state.

Let me know what you think.

December 28, 2009

Interesting talks at 26c3

Filed under: Crypto, Embedded, Hacking, Reverse engineering, Security — Nate Lawson @ 1:00 am

I hope to attend a CCC event some day. While there are many great talks in the 26c3 schedule, here are some talks that look particularly interesting.

Others that may be interesting but haven’t posted slides or papers yet:

Hope everyone at 26c3 has a great time. Best wishes for a safe and secure 2010.

December 23, 2009

xum1541 now supports nibbler

Filed under: C64, Hardware, Retrocomputing — Nate Lawson @ 11:46 pm

One thing I like about the holidays is the chance to finish off hobby projects. Earlier this month, I released the first beta of the xum1541, which is a C64 USB floppy adapter. The first release supported basic functions to read and write disks via the OpenCBM utilities.

Today, I finished testing for parallel nibbler support. The code is available in the OpenCBM cvs repository, and directions are on my xum1541 page. When used with nibtools, it can now copy protected disks and transfer data much faster than before. I’ve successfully tested both read and write support on Windows and Mac OS X. This is quite a milestone as it is the first USB interface to support the parallel nibbler protocol.

A bit of explanation is in order. The built-in interface for a 1541 floppy drive is serial and has CLK, DATA, and ATN signals. It is a serial version of the parallel IEEE-488 bus with conditions such as EOI signalled in-band instead of requiring separate wires. Commodore originally used IEEE-488 on the PET, but moved to the IEC serial protocol to cost-reduce the cables and avoid shortages in Belkin’s supply. The serial protocol is slower than parallel, but the legendary slowness of Commodore drives had more to do with attempting to maintain backwards compatibility with older drives, not the serial protocol itself. Third-party speeder cartridges fixed this in software by repurposing the serial signals for higher-speed signalling.

To get the full bitrate the drive mechanism is capable of though, hardware modifications were required. Copiers such as Burst Nibbler added an 8-bit parallel cable in addition to the serial lines. This was relatively easy since there are two 6522 IO chips in the 1541 drive. Each has two 8-bit IO ports, and one of them is not normally used. So the parallel cable can be connected to the unused lines. Since the drive ROM does not use these lines, the copier has to load a custom routine into the drive’s RAM while initializing. It is then activated to manage the data transfer.

When Commodore hardware died out, users still needed to transfer data to and from floppies. The X-series of cables was invented, using the PC printer port for interfacing. That worked for a while until Windows NT and above made it harder to get accurate inb/outb timing, and then the DB25 printer port disappeared completely. USB established itself as the next great thing.

USB is high bandwidth but also high latency. The bit-banging approach to interfacing via the printer port would no longer work. It takes around 1 ms to get data to a USB device, no matter how small. Since the 1541 drive mechanism transfers data at 40 KB/sec, that is about 25 microseconds per byte, much less than the latency. The xum1541 does all the handshaking with the drive in an AT90USB microcontroller running at 8 MHz, giving great accuracy. The data transfers to the host are done via a double-buffered hardware USB engine. It has a state machine that handles the actual USB signalling, so we can flip buffers while it is clocking data out to the host. This gives us the cycles we need for the drive.

The protocol is actually pretty simple. The setup routines, such as which track to select, signal a byte is ready for the drive by toggling ATN, while the drive toggles DATA to acknowledge it has seen it. The custom drive code reads these bytes and then jumps to the appropriate handler. When it is done, it sends back a status byte via the same protocol.

For the high-speed transfer, something even lighter weight is needed. The drive CPU is a 6502 running at 1 Mhz, which gives about 12 instructions per byte. The transfer protocol is started with a handshaked read or write as above. Then the drive begins to transfer data one byte at a time, toggling the DATA line each time a byte is ready. The microcontroller stays in sync by waiting for each transition and then reading a byte from the parallel cable. Thus, the path from the initial handshake to the data transfer loop must be very quick and then continue without interruption.

The parallel transfer gets you something else besides high speed. Many protection schemes were built on the fact that the 1541 only has 2 KB of RAM, not enough to store a full track, which is up to 8 KB. If a mastering machine wrote a track pattern that had many similarities, ordinary copying software that read the track in pieces could not be sure it lined up properly when reassembling the pattern on the backup copy. The protection scheme, which could read and analyze the entire track in one pass, would detect this difference and refuse to run the game. To duplicate this kind of disk, users either added 8 KB of RAM to the drive or added a parallel cable. Both allow an entire track to be read in a single pass.

It was fun implementing this protocol because microcontrollers are a dedicated platform. You can count clock cycles for your instructions and be guaranteed latency. Compared to desktop PCs, where you’re running concurrently with questionable software written by people who definitely don’t count cycles, this is a dream. If you make a mistake, it is your fault. There is nothing like an SMI handler that could lock the CPU for seconds while it handles a volume button press.

Happy Holidays from all of us at Root Labs!

C64 Christmas demo

« Previous PageNext Page »

Blog at WordPress.com.