Undetectable hypervisor rootkit challenge

June 28, 2007June 28, 2007 ~ Nate Lawson ~ 20 Comments

I’m starting to get some queries about the challenge Tom, Peter, and I issued to Joanna. In summary, we’ll be giving a talk at Blackhat showing how hypervisor-based rootkits are not invisible and the detector always has the fundamental advantage. Joanna’s work is very nice, but her claim that hypervisor rootkits are “100% undetectable” is simply not true. We want to prove that with code, not just words.

Joanna recently responded. In summary, she agrees to the challenge with the following caveats:

We can’t intentionally crash or halt the machine while scanning
We can’t consume more than 90% of the CPU for more than a second
We need to supply five new laptops, not two
We both provide source code to escrow before the challenge and it is released afterwards
We pay her $416,000

The first two requirements are easy to agree to. Of course, the rootkit also shouldn’t do either of those or it is trivially detectable by the user.

Five laptops? Sure, ok. The concern is that even a random guess could be right with 50% probability. She is right that we can make guessing a bad strategy by adding more laptops. But we can also do the same by just repeating the test several times. Each time we repeat the challenge, the probability that we’re just getting lucky goes down significantly. After five runs, the chance that we guessed correctly via random guesses is only 3%, the baseline she established for acceptability. But if she wants five laptops instead, that’s fine too.

I don’t have a problem open-sourcing the code afterwards. However, I don’t see why it’s necessary either. We can detect her software without knowing exactly how it’s implemented. That’s the point.

The final requirement is not surprising. She claims she has put four person-months work into the current Blue Pill and it would require twelve more person-months for her to be confident she could win the challenge. Additionally, she has all the experience of developing Blue Pill for the entire previous year.

We’ve put about one person-month into our detector software and have not been paid a cent to work on it. However, we’re confident even this minimal detector can succeed, hence the challenge. Our Blackhat talk will describe the fundamental principles that give the detector the advantage.

If Joanna’s time estimate is correct, it’s about 16 times harder to build a hypervisor rootkit than to detect it. I’d say that supports our findings.

[Edit: corrected the cost calculation from $384,000]

IBM Thinkpad overheating chipset fix

June 18, 2007June 18, 2007 ~ Nate Lawson ~ 2 Comments

I’ve used Thinkpads for a long time for the relative build and BIOS quality, although recently I switched to the Panasonic Y series. I still have an IBM R32 for guests to use but recently it began blue screening randomly.

The symptoms were that it would run ok for a while, a couple hours to a few days, and then would get a STOP error in ati2advg.dll. Often, it would do this when beginning to play a video. I had tried various combinations of video drivers with little change. RAM tests showed no problems.

I found on various forums that the T42 and possibly other systems had a build problem where picking up the laptop by one corner would cause a heatsink to momentarily detach from the chipset, resulting in a similar blue screen. This sounded familiar, so I disassembled the laptop. (Disclaimer: doing so voids your warranty, I take no responsibility for your actions.)

Getting a copy of the hardware maintenance manual helps find all the screws and tabs. On most laptops, the first step is to get the keyboard removed and then other screws become accessible.

As you can see, the heatsink for the graphics chip is merely a piece of aluminum tied to the drive bay. It had a bit of sticky tape between it and the chip, but it didn’t make full contact with the chip below.

After removing the CPU and graphics chip heat sinks, you can see the CPU on the upper left and the graphics chip in the multi-chip module near the center. Since the chips on this small module are in an L-shape and the heatsink was centered, it only contacted the very edges of each of the chips.

I decided this was likely the problem, so made a new heat spreader for the graphics chip while avoiding adding much weight. I took an old punch-out panel and cut a rectangle using my Dremel tool.

After cutting, it looked like this:

I put some heatsink compound on it and attached it to the graphics chips. To keep the corner from shorting out against the solder pads, I put a small square of double-stick tape in the empty space of the “L”. This helped secure the heat spreader, although I’m mostly depending on pressure from the heat sink to keep it in place. I then reattached the heat sink over it and reassembled the laptop.

When reassembling everything, be careful! The screws that hold the CPU heatsink snap extremely easily. Once they stop moving under even gentle force, don’t push them any farther. I ended up fabricating some out of some old parts. Don’t make the same mistake.

I’ve run various stress tests, including accidentally leaving it in the sun for a few hours and it hasn’t crashed since. The keyboard above the chip also feels noticeably cooler.

Fault-tolerant system design

June 8, 2007June 8, 2007 ~ Nate Lawson ~ 1 Comment

When designing a system, I have a hierarchy of properties. It is somewhat fluid depending on the application but roughly breaks down as follows.

Correct: must perform the basic required function
Robust: … under less-than-perfect operating conditions
Reliable: … handling and recovering from internal errors
Secure: … in the face of a malicious adversary
Performance: … as fast as possible

I’d like to talk about how to achieve reliability. Reliability is the property where the system recovers from out-of-spec conditions and implementation errors with minimal impact on the other properties (i.e., correctness, security). While most system designers have an idea of the desired outcome, surprisingly few have a strategy for getting there. Designing for reliability also produces a system that is easier to secure (ask Dan Bernstein.)

Break down your design into logical components

This should go without saying, but if you have a monolithic design, fault recovery becomes very difficult. If there is an implicit linkage between components, over time it will grow explicit. Corollary: const usage only decreases over time in a fielded system, never increases.

Each component of the system should be able to reset independently or in groups

As you break down the system into components, consider what dependencies a reset of each component triggers. A simple module with less dependencies is best. I use the metric of “cross-section” to describe the complexity of inter-module dependencies.

Implement reset in terms of destructors/constructors

Right from the beginning, implement reset. Since you’re already coding the constructors and should have been coding destructors (right?), reset should be easy. If possible, use reset as part of the normal initialization process to be sure it doesn’t become dead code.

Increase the speed of each component’s recovery and make them
restart independently if possible

If components take a long time to recover, it may result in pressure from customers or management to ditch this feature. A component should never take longer to recover than a full system reset, otherwise rethink its design. Independence means that reset can proceed in parallel, which also increases performance.

Add a rollback feature to components

In cases where a full reset results in loss of data or takes too long, rollback may be another option. Are there intermediate states that can be safely reverted while keeping the component running?

Add error-injection features to test fault recovery

Every component should have a maintenance interface to inject faults. Or, design a standard test framework that does this. At the very least, it should be possible to externally trigger a reset of every component. It’s even better to allow the tester to inject errors in components to see if they detect and recover properly.

Instrument components to get a good report of where fault appeared

A system is only as debuggable as its visibility allows. A lightweight trace generation feature (e.g., FreeBSD KTR) can give engineers the information needed to diagnose faults. It should be fast enough to always be available, not just in debug builds.