Mixed voltage interfacing for design or hacking

January 6, 2012October 28, 2012 ~ Nate Lawson

Modern digital systems involve a wide array of voltages. Instead of just the classic 5V TTL, they now use components and busses ranging from 3.3V down to 1.0V. Interfacing with these systems is tricky, especially when you have multiple power sources, capacitive loads, and inrush current from devices being powered on. It’s important to understand the internal design of all drivers, both in your board and the target, to be sure your final product is cheap but reliable.

When doing a security audit of an embedded device, interfacing is one of the main tasks. You have to figure out what signals you want to tap or manipulate. You pick a speed range and signal characteristics. For example, you typically want to sample at 4x the target data rate to ensure a clean signal or barring that, lock onto a clock to synchronize at 1x. For active attacks such as glitching, you often have to do some analog design for the pulse generator and trigger it from your digital logic. Most importantly, you have to make sure your monitoring board does not disrupt the system. The Xbox tap board (pdf) built by bunnie is a great case study for this, as well as his recent NeTV device.

Building a mass market board is different than a quick hack for a one-off review, and cost becomes a big concern. Sure, you can use an external level shifter or transceiver chip. However, these come with a lot of trade-offs. They add latency to switching times. They often require dual power supplies, which may not be available for your particular application. They usually require extra control pins (e.g., to select signal direction). They force you to switch a group of pins all at once, not individually. Most importantly, they add cost and take up board space. With a little more thought, you can often come up with a simpler solution.

There was a great article in Circuit Cellar on mixed-voltage digital interfacing. It goes into a lot of the different design approaches, from current-limiting resistors all the way up to BJTs and MOSFETs. The first step is to understand the internals of I/O pins on most digital devices. Below is a diagram I reproduced from the article.

CMOS I/O Pin design — CMOS I/O pin, showing internal ESD diodes

Whether you are using a microcontroller or logic chip, usually the I/O pins have a similar design. The input to the gate is protected by diodes that will conduct current to ground if the input voltage goes more negative than the GND pin or to Vcc if the input is more positive than Vcc. These diodes are normally used to protect against static discharge, but can also be a useful part of your design if you are careful.

The current limiting resistor can be calculated by figuring out the maximum and minimum voltages your input will see and the maximum current flow you want into your device. You also have to take into account the diode’s voltage drop. Here’s an example calculation:

Max input level: 6V
Vcc: 5V
Diode drop: 0.7V (typical)
Max current: 100 microamps

R = V / I = (6 – 5 – 0.7) / 0.0001 = 3K ohms

While microcontrollers and 74xx logic devices are often tolerant of some moderate current, FPGAs are extremely sensitive. If you’re using an FPGA, use a level shifter chip or you’ll be sorry. Also, most devices are more sensitive to negative voltage than positive. Even if you’re within the specs for max current, a negative voltage may quickly lead to latch-up.

Level shifters usually take in dual voltages and safely isolate both sides. Some are one-way, which can be used as line drivers or line receivers. Others are bidirectional transceivers, and the direction is selected by an extra pin. If you can’t afford the extra pins, you can often combine an open-collector transmitter with a line receiver. However, if you happen to drive the lines by transmitting at the same time as the other side, you can fry your chips.

To summarize, mixed voltage interfacing is a skill you’ll need when building your own devices or hacking existing ones. You can get by with just a current-limiting resistor in many cases, but be sure you understand the I/O pin design to avoid costly failures, especially in the field over time. For more assurance or with sensitive parts like FPGAs, you’ll have to use a level shifter.

Improving ASLR with internal randomization

June 6, 2011June 5, 2011 ~ Nate Lawson ~ 11 Comments

Most security engineers are familiar with address randomization (ASLR). In the classic implementation, the runtime linker or image loader chooses a random base offset for the program, its dynamic libraries, heap, stack, and mmap() regions.

At a higher level, these can all be seen as obfuscation. The software protection field has led with many of these improvements because cracking programs is a superset of exploiting them. That is, an attacker with full access to a program’s entire runtime state is much more advantaged than one with only remote access to the process, filtered through an arbitrary protocol. Thus, I predict that exploit countermeasures will continue to recapitulate the historical progress of software protection.

The particular set of obfuscations used in ASLR were chosen for their ease of retrofitting existing programs. The runtime linker/loader is a convenient location for randomizing various memory offsets and its API is respected by most programs, with the most notable exceptions being malware and some software protection schemes. Other obfuscation mechanisms, like heap metadata checksumming, are hidden in the internals of system libraries. Standard libraries are a good, but less reliable location than the runtime linker. For example, many programs have their own internal allocator, reducing the obfuscation gains of adding protection to the system allocator.

A good implementation of ASLR can require attackers to use a memory disclosure vulnerability to discover or heap fung shui to create a known memory layout for reliable exploitation. While randomizing chunks returned from the standard library allocator can make it harder for attackers to create a known state, memory disclosure vulnerabilities will always allow a determined attacker to subvert obfuscation. I expect we’ll see more creativity in exercising partial memory disclosure vulnerabilities as the more flexible bugs are fixed.

ASLR has already forced researchers to package multiple bugs into a single exploit, and we should soon see attackers follow suit. However, once the base offsets of various libraries are known, the rest of the exploit can be applied unmodified. For example, a ROP exploit may need addresses of gadgets changed, but the relative offsets within libraries and the code gadgets available are consistent across systems.

The next logical step in obfuscation would be to randomize the internals of libraries and code generation. In other words, you re-link the internal functions and data offsets within libraries or programs so that code and data are at different locations in DLLs from different systems. At the same time, code generation can also be randomized so that different instruction sequences are used for the same operations. Since all this requires deep introspection, it will require a larger change in how software is delivered.

Fortunately, that change is on the horizon for other reasons. LLVM and Google NaCl are working on link-time optimization and runtime code generation, respectively. What this could mean for NaCl is that a single native executable in LLVM bitcode format would be delivered to the browser. Then, it would be translated to the appropriate native instruction set and executed.

Of course, we already have a form of this today with the various JIT environments (Java JVM, Adobe ActionScript, JavaScript V8, etc.) But these environments typically cover only a small portion of the attack surface and don’t affect the browser platform itself. Still, randomized JIT is likely to become more common this year.

One way to implement randomized code delivery is to add this to the installer. Each program could be delivered as LLVM IR and then native code generation and link addresses could be randomized as it was installed. This would not slow down the installation process significantly but would make each installation unique. Or, if the translation process was fast enough, this could be done on each program launch.

Assuming this was successfully deployed, it would push exploit development to be an online process. That is, an exploit would include a built-in ROP gadget generator and SMT solver to generate a process/system-specific exploit. Depending on the limitations of available memory disclosure vulnerabilities and specific process state, it might not be possible to automatically exploit a particular instance. Targeted attacks would have to be much more targeted and each system compromised would require the individual attention of a highly-skilled attacker.

I’m not certain software vendors will accept the nondeterminism of this approach. Obviously, it makes debugging production systems more difficult and installation-specific. However, logging the random seed used to initiate the obfuscation process could be used to recreate a consistent memory layout for testing.

For now, other obfuscation measures such as randomizing the allocator may provide more return on investment. As ROP-specific countermeasures are deployed, it will become easier to exploit a program’s specific internal logic (flags, offsets, etc.) than to try to get full code execution. It seems that, for now, exploit countermeasures will stay focused on randomizing and adding checksums to data structures, especially those in standard libraries.

But is this level of obfuscation where exploit countermeasures are headed? How long until randomized linking and code generation are part of a mainline OS?

Memory address layout vulnerabilities

February 21, 2011February 20, 2011 ~ Nate Lawson ~ 14 Comments

This post is about a programming mistake we have seen a few times in the field. If you live the TAOSSA, you probably already avoid this but it’s a surprisingly tricky and persistent bug.

Assume you’d like to exploit the function below on a 32-bit system. You control len and the contents of src, and they can be up to about 1 MB in size before malloc() or prior input checks start to error out early without calling this function.

int target_fn(char *src, int len)
{
    char buf[32];
    char *end;

    if (len < 0) return -1;
    end = buf + len;
    if (end > buf + sizeof(buf)) return -1;
    memcpy(buf, src, len);
    return 0;
}

Is there a flaw? If so, what conditions are required to exploit it? Hint: the obvious integer overflow using len is caught by the first if statement.

The bug is an ordinary integer overflow, but it is only exploitable in certain conditions. It depends entirely on the address of buf in memory. If the stack is located at the bottom of address space on your system or if buf was located on the heap, it is probably not exploitable. If it is near the top of address space as with most stacks, it may be exploitable. But it completely depends on the runtime location of buf, so exploitability depends on the containing program itself and how it uses other memory.

The issue is that buf + len may wrap the end pointer to memory below buf. This may happen even for small values of len, if buf is close enough to the top of memory. For example, if buf is at 0xffff0000, a len as small as 64KB can be enough to wrap the end pointer. This allows the memcpy() to become unbounded, up to the end of RAM. If you’re on a microcontroller or other system that allows accesses to low memory, memcpy() could wrap internally after hitting the top of memory and continue storing data at low addresses.

Of course, these kinds of functions are never neatly packaged in a small wrapper and easy to find. There’s usually a sea of them and the copy happens many function calls later, based on stored values. In this kind of situation, all of us (maybe even Mark Dowd) need some help sometimes.

There has been a lot of recent work on using SMT solvers to find boundary condition bugs. They are useful, but often limited. Every time you hit a branch, you have to add a constraint (or potentially double your terms, depending on the structure). Also, inferring the runtime contents of RAM is a separate and difficult problem.

We think the best approach for now is to use manual code review to identify potentially problematic sections, and then restrict the search space to that set of functions for automated verification. Despite some promising results, we’re still a long way from automated detection and exploitation of vulnerabilities. As the program verification field advances, additional constraints from ASLR, DEP, and even software protection measures reduce the ease of exploitation.

Over the next few years, it will be interesting to see if attackers can maintain their early lead by using program verification techniques. Microsoft has applied the same approach to defense, and it would be good to see this become general practice elsewhere.

Stuxnet is embarrassing, not amazing

January 17, 2011January 21, 2011 ~ Nate Lawson ~ 84 Comments

As the New York Times posts yet another breathless story about Stuxnet, I’m surprised that no one has pointed out its obvious deficiencies. Everyone seems to be hyperventilating about its purported target (control systems, ostensibly for nuclear material production) and not the actual malware itself.

There’s a good reason for this. Rather than being proud of its stealth and targeting, the authors should be embarrassed at their amateur approach to hiding the payload. I really hope it wasn’t written by the USA because I’d like to think our elite cyberweapon developers at least know what Bulgarian teenagers did back in the early 90’s.

First, there appears to be no special obfuscation. Sure, there are your standard routines for hiding from AV tools, XOR masking, and installing a rootkit. But Stuxnet does no better at this than any other malware discovered last year. It does not use virtual machine-based obfuscation, novel techniques for anti-debugging, or anything else to make it different from the hundreds of malware samples found every day.

Second, the Stuxnet developers seem to be unaware of more advanced techniques for hiding their target. They use simple “if/then” range checks to identify Step 7 systems and their peripheral controllers. If this was some high-level government operation, I would hope they would know to use things like hash-and-decrypt or homomorphic encryption to hide the controller configuration the code is targeting and its exact behavior once it did infect those systems.

Core Labs published a piracy protection scheme including “secure triggers”, which are code that only can be executed given a particular configuration in the environment. One such approach is to encrypt your payload with a key that can only be derived on systems that have a particular configuration. Typically, you’d concatenate all the desired input parameters and hash them to derive the key for encrypting your payload. Then, you’d do the same thing on every system the code runs on. If any of the parameters is off, even by one, the resulting key is useless and the code cannot be decrypted and executed.

This is secure except against a chosen-plaintext attack. In such an attack, the analyst can repeatedly run the payload on every possible combination of inputs, halting once the right configuration is found to trigger the payload. However, if enough inputs are combined and their ranges are not too limited, you can make such a brute-force attack infeasible. If this was the case, malware analysts could only say “here’s a worm that propagates to various systems, and we have not yet found out how to unlock its payload.”

Stuxnet doesn’t use any of these advanced features. Either the authors did not care if their payload was discovered by the general public, they weren’t aware of these techniques, or they had other limitations, such as time. The longer they remained undetected, the more systems that could be attacked and the longer Stuxnet could continue evolving as a deployment platform for follow-on worms. So disregard for detection seems unlikely.

We’re left with the authors being run-of-the-mill or in a hurry. If the former, then it was likely this code was produced by a “Team B”. Such a group would be second-tier in their country, perhaps a military agency as opposed to NSA (or the equivalent in other countries). It could be a contractor or loosely-organized group of hackers.

However, I think the final explanation is most likely. Whoever developed the code was probably in a hurry and decided using more advanced hiding techniques wasn’t worth the development/testing cost. For future efforts, I’d like to suggest the authors invest in a few copies of Christian Collberg’s book. It’s excellent and could have bought them a few more months of obscurity.

An obvious solution to the password problem

January 7, 2011 ~ Nate Lawson ~ 3 Comments

Many organizations try to solve problems by making rules. For example, they want to prevent accounts from being compromised due to weak passwords, so they institute a password policy. But any policy with specific rules gets in the way of legitimate choices and is vulnerable to being gamed by the lazy. This isn’t because people are bad, it’s because you didn’t properly align incentives.

For example, a bank might require passwords with at least one capital letter and a number. However, things like “Password1” are barely more secure than “password”. (You get them on the second phase of running Crack, not the first phase. Big deal.) A user who chose that password was just trying to get around the rule, not choose something secure. Meanwhile, a much more secure password like “jnizwbier uvnqera” would fail the rule.

The solution is not more rules. It is twofold: give users a “why” and a “how”. You put the “why” up in great big red letters and then refer to the “how”. If users ignore this step, your “why” is not compelling enough. Come up with a bigger carrot or stick.

The “why” is a benefit or penalty. You could give accountholders a free coffee if their account goes 1 year without being compromised or requiring a password reset. Or, you can make them responsible for any money spent on their account if an investigation shows it was compromised via a password.

The “how” in this case is a short tutorial on how to choose a good passphrase, access to a good random password generator program, and enough characters (256?) to prevent arbitrary limits on choices.

That’s it. Once you align incentives and provide the means to succeed, rules are irrelevant. This goes for any system, not just passwords.

Building a USB protocol analyzer

December 27, 2010December 27, 2010 ~ Nate Lawson ~ 10 Comments

The recent effort by bushing‘s team to develop an open-source USB protocol analyzer reminded me of a quick hack I did previously. I was debugging a tricky USB problem but only had an oscilloscope.

If you’ve been following this blog, you know one of my hobby projects has been designing a USB interface for old Commodore floppy drives. The goal is to archive old data, including the copy-protection bits, before the media fails. Back in January 2009, I was debugging the first prototype board. Most of the commands succeeded but one would fail immediately every time I sent it.

I tried a software USB analyzer, but it didn’t show any more information. The command was returning almost immediately with no data. Debugging output on the device’s UART didn’t show anything abnormal, except it was never receiving the problem command. So the problem had to be between the host and target’s USB stacks, and possibly was in the AVR‘s hardware USB state machine. Only a bus analyzer could reveal what was going on.

Like other hobby developers, I couldn’t justify the cost of a dedicated USB analyzer just to troubleshoot this one problem, especially in a design I would be releasing for free. Since I did have an oscilloscope at work, I decided to build a USB decoding stack on top of it.

USB, like Ethernet and TCP/IP, is a combination of protocols. The lowest layer is the physical cabling and bit signalling. On top of this is packet framing and device addressing. Next, each device has a set of endpoints. These are analogous to TCP/UDP ports and support control, bulk, or interrupt message types. The standard control endpoint (address 0) handles a set of common configuration messages. Other endpoints are device-specific.

High-speed signalling (480 Mbit/s) is a bit different from full/low-speed, so I won’t describe it here. Suffice to say, you can just put a USB 1.1 hub between your device and host to force it to downgrade speeds. Unless you’re trying to debug a problem with high-speed signalling itself, this is sufficient to debug protocol-level issues.

The USB physical layer uses differential current flow to signal bits. This balances the charge, decreasing the latency for line transitions and increasing noise rejection. I hooked up probes to the D+ and D- lines and saw a trace like this:

Each zero bit in USB is signalled by a transition, low to high or high to low. A one bit is signalled by no transition for the clock period. (This is called NRZI encoding). Obviously, there’s a chance for sender and receiver clocks to drift out of sync if there are too many one bits in a row, so a zero bit is stuffed into the frame after every 6 one bits. It is discarded by the receiver. An end-of-packet is signalled by a single-ended zero (SE0), which is both lines held low. You can see this at the beginning of the trace above

To start each packet, USB sends an 0x80 byte, least-significant bit first. This is 7 transitions followed by a one bit, allowing the receiver to synchronize their clock on it. You can see this in the trace above, just after the end-of-packet from the previous frame. After the sync bits, the rest of the frame is byte-oriented.

The host initiates every transaction. In a control transfer, it sends the command packet, generates an optional data phase (in/out from device), and ends with a status phase. If the transaction failed, the device returns an error byte.

My decoding script implemented all the layers in the quickest way possible. After taking a scope trace, I’d dump the samples to a file. The script would then run through them, looking for the first edge. If this edge was part of a sync byte, it would begin byte-aligned decoding of a frame to pass up to higher-level functions. At the end of the packet, it would go back to scanning for the next edge. Using python’s generators made this quite easy since it was just a series of nested loops instead of a complicated state machine.

Since this was a quick hack, I cut corners. To detect the SE0 end-of-packet, you really need to monitor both D+ and D-. At higher speeds, the peaks get lower since less current is exchanged. However, at lower speeds, you can ignore this and just put a scope probe on the D- line. Instead of proper decoding of the SE0, I’d just decode each frame until no more data was expected and then yield a fake EOP symbol to the upper layers.

After a few days of debugging, I found the problem. The LUFA USB stack I was using in my firmware had a bug. It had a filter for standard control messages (such as endpoint configuration) that it handled for you. Class-specific transactions were passed up to a handler in my firmware. The bug was that the filter was too permissive — all control transfers of type 6, even if they were class-specific, were captured by LUFA. This ended up returning an error without ever passing the message to my firmware. (By the way, the LUFA stack is excellent, and this bug has long since been fixed).

Back in the present, I’m glad to see the OpenVizsla project creating a cheaper USB analyzer. It should be a great product. Based on my experience, I have some questions about their approach I hope are helpful.

It seems kind of strange that they are going for high-speed support. Since the higher-level protocol messages you might want to reverse-engineer are the same regardless of speed, it would be cheaper to just handle low/full speed and use a hub to force devices to downgrade. I guess they might be dealing with proprietary devices, such as the Kinect, that refuse to operate at lower speeds. But if that isn’t the case, their namesake, the Beagle 12, is a great product for only $400.

I have used the Total Phase Beagle USB analyzers, and they’re really nice. As with most products these days, the software makes the difference. They support Windows, Mac, and Linux and have a useful API. They can output data in CSV or binary formats. They will be supporting USB 3.0 (5 Gbps) soon.

I am glad OpenVizsla will be driving down the price for USB analyzers and providing an option for hobbyists. At the same time, I have some concern that it will drive away business from a company that provides open APIs and well-supported software. Hopefully, Total Phase’s move upstream to USB 3.0 will keep them competitive for people doing commercial development and the OpenVizsla will fill an underserved niche.