Old programming habits die hard

While programming, it’s enlightening to be aware of the many influences you have. Decisions such as naming internal functions, coding style, organization, threading vs. asynchronous IO, etc. all happen because of your background. I think you could almost look at someone’s code and tell how old they are, even if they keep up with new languages and patterns.

When I think of my own programming background, I remember a famous quote:

“It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.”
— Edsger W.Dijkstra, June 18, 1975

Large memory allocations are a problem

A common mistake is keeping a fixed memory allocation pattern in mind. Since our machines are still changing exponentially, even a linear approach would quickly fall behind.

Back in the 90’s, I would make an effort to keep frames within 4K or 8K total to avoid hitting a page fault to resize the stack. Deep recursion or copying from stack buffer to buffer were bad because they could trigger a fault to the kernel, which would resize the process and slow down execution. It was better to reuse data in-place and pass around pointers.

Nowadays, you can malloc() gigabytes and servers have purely in-memory databases. While memory use is still important, the scale that we’re dealing with now is truly amazing (unless your brain treats performance as a log plot).

Never jump out of a for loop

The BASIC interpreter on early machines had limited garbage collection capability. If you used GOTO in order to exit a loop early, the stack frame was left around, unless you followed some guidelines. Eventually you’d run out of memory if you did this repeatedly.

Because of this, it always feels a little awkward in C to call break from a for loop, which is GOTO at the assembly level. Fortunately, C does a better job at stack management than BASIC.

Low memory addresses are faster

On the 6502, instructions that access zero page addresses (00 – ff) use a more compact instruction encoding than other addresses and also execute one cycle faster. In DOS, you may have spent a lot of time trying to swap things below the 1 MB barrier. On an Amiga, it was chip and fast RAM.

Thus, it always feels a bit faster to me to use the first few elements of an array or when an address has a lot of leading zeros. The former rule of thumb has morphed into cache line access patterns, so it is still valid in a slightly different form. With virtualized addressing, the latter no longer applies.

Pointer storage is insignificant

In the distant past, programmers would make attempts to fold multiple pointers into a single storage unit (the famous XOR trick). Memory became a little less scarce and this practice was denounced, due to its impact on debugging and garbage collection. Meanwhile, on the PC, segmented memory made the 16-bit pointer size insignificant. As developers moved to 32-bit protected mode machines in the 90’s, RAM size was still not an issue because it had grown accordingly.

However, we’re at a peculiar juncture with RAM now. Increasing pointers from 32 to 64 bits uses 66% more RAM for a doubly-linked list implementation with each node storing a 32-bit integer. If your list took 2 GB of RAM, now it takes 3.3 GB for no good reason. With virtual addressing, it often makes sense to return to a flat model where every process in the system has non-overlapping address space. A data structure such as a sparse hash table might be better than a linked list.

Where working set size is less than 4 GB, it may make sense to stay with a 32-bit OS and use PAE to access physical RAM beyond that limit. You get to keep 32-bit pointers but each process can only address 4 GB of RAM. However, you can just run multiple processes to take advantage of the extra RAM. Today’s web architectures and horizontal scaling means this may be a better choice than 64-bit for some applications.

The world of computing changes rapidly. What kind of programming practices have you evolved over the years? How are they still relevant or not? In what ways can today’s new generation of programmers learn from the past?

Side channel attacks on cryptographic software

Below is a recent article I wrote for IEEE Security and Privacy magazine titled “Side Channel Attacks on Cryptographic Software” (pdf). It covers simple timing attacks against HMAC validation, AES cache timing, and RSA branch prediction attacks. It’s a survey article, covering excellent research in side channel attacks over the past few years.

I think the attacker position is gaining an advantage recently. As CPU microarchitecture gets more complicated, more covert channels appear. Also, the move to virtualization and high-resolution timers gives better quality measurements and more opportunities to exploit even the tiniest of leaks. We’ll need to come up with more clever ways of modeling and eliminating covert channels, moving crypto operations into hardware, and giving software more control over microarchitectural state.

Let me know what you think.

All about ACPI

Up until recently, I was the maintainer for the FreeBSD ACPI implementation.  ACPI is a standard for making features like power management, which were originally in the BIOS, available to the OS to enumerate and control.  I implemented the FreeBSD CPU frequency control framework and drivers for individual CPUs, as well as numerous bugfixes and updates to support new BIOS versions.  I wanted to share my perspective on ACPI, as well as provide resources for other open source developers interested in getting involved in this obtuse area of PCs.

ACPI is a complex, 600-page standard.  It assumes the reader has a lot of low-level knowledge about PC hardware and the BIOS.  In the past, the BIOS would be the first thing that started on the PC.  First, it would configure peripherals like PCI slots and boards.  To retain control, it would hook the system management interrupt (SMI) before booting the OS.  The SMI was tied to the actual hardware by the OEM, say by linking a laptop’s lid switch to a microcontroller.  The BIOS had hard-coded event types and logic to process them (“if lid switch pressed, turn off backlight”).  The nice thing was that the OS was completely unaware of the BIOS since the SMI is mostly invisible to software, so it could run on all PCs without changes.  The downside was that the user had to go into the BIOS settings to change things and the BIOS had to be relatively complex in order to coexist with the OS.

In the ACPI model, the BIOS (or EFI now) still performs basic initialization.  After the hardware is setup, it hooks the SMI.  The difference is that it now provides a set of tables in RAM for the OS to consume.  While most of the static tables describe the hardware, the most important one (“DSDT”) provides bytecode called AML.  The OS sets up its interpreter very early in boot process and loads the BIOS’s AML.  The AML is composed of methods and data types and is arranged in a tree that matches the logical arrangement of devices.  For example, methods related to a video device would be located under the AML Device object for its parent PCI host interface.

The OS walks the AML tree, initializing the various devices and assigning resources to them using AML and device-specific methods.  Usually, the OS keeps a mapping between its internal device driver tree and the handle to the AML node that shadows it.  This allows it to later call methods under that node that change the power level or configure its wake capabilities.  For example, you want the OS to power down an internal modem if it’s not in use.  This is done by calling AML methods for the Device node that represents the modem.

The OS interoperates with ACPI this way constantly.  The OS calls into AML (or vice versa via an interrupt) to power up/down devices, change backlight settings, change CPU frequency/voltage, or suspend/resume the system.  While this level of control is nice to have as an OS developer, there have been a lot of bumps along the way.

ACPI could be seen as an attempt by Microsoft and Intel to take control over areas formerly owned by Phoenix/AMI and the numerous Taiwan integrators that actually build systems for companies like Dell.  They had arguably good reasons for doing so, including the support cost of working around so many BIOS bugs.  The conspiracy theorists may think that this was an attempt by Bill Gates to undermine Linux, but the pain that ACPI caused has been also borne by Microsoft.  Gates never really got his proprietary extensions to ACPI, but Windows did enjoy a privileged position of OEMs using it as the validation suite for their BIOS implementations.  Because there was no open interoperability testing, the spec was so complex, and information about the low-level BIOS and hardware were tied up in NDAs, open source kernels suffered many compatibility problems as new PCs appeared.  The process definitely should have been more open and seems to be getting better, largely due to the efforts of Intel with Linux.

If you’re interested in learning more about ACPI, I recommend the following resources.  The papers and talks are the best introduction, and I haven’t read the book yet.  I hope this helps the open-source community improve support for ACPI.

Papers and talks



Interview about DRM on Security Focus

Security Focus just posted this interview of me, talking about DRM. Here are a few choice quotes.

On authoring software protection for Vista:

The rules of the game are changing recently with Microsoft Vista kernel patch protection. If you’re a rootkit author, you just bypass it. If you’re a software protection designer, you have to play by its rules. For the first time in the PC’s history, it’s not a level playing field any more. Virus scanner authors were the first to complain about this, and it will be interesting to see how this fundamental change affects the balance of power in the future.

On using custom hardware for protection:

Custom hardware often gives you a longer period until the first break since it requires an attacker’s time and effort to get up to speed on it. However, it often fails more permanently once cracked since the designers put all their faith in the hardware protection.

Anti-debugging: using up a resource versus checking it

After my last post claiming anti-debugging techniques are overly depended on, various people asked how I would improve them. My first recommendation is to go back to the drawing board and make sure all your components of your software protection design hang together in a mesh. In other words, each has the property that the system will fail closed if a given component is circumvented in isolation. However, anti-debugging does have a place (along other components including obfuscation, integrity checks, etc.) so let’s consider how to implement it better.

The first principle of implementing anti-debugging is to use up a resource instead of checking it. The initial idea most novices have is to check a resource to see if a debugger is present. This might be calling IsDebuggerPresent() or if more advanced, using GetThreadContext() to read the values of hardware debug registers DR0-7. Then, the implementation is often as simple as “if (debugged): FailNoisily()” or if more advanced, a few inline versions of that check function throughout the code.

The problem with this approach is it provides a choke point for the attacker, a single bottleneck that circumvents all instances of a check no matter where they appear in the code or how obfuscated they are. For example, IsDebuggerPresent() reads a value from the local memory PEB struct. So attaching a debugger and then setting that variable to zero circumvents all uses of IsDebuggerPresent(). Looking for suspicious values in DR0-7 means the attacker merely has to hook GetThreadContext() and return 0 for those registers. In both cases, the attacker could also skip over the check so it always returned “OK”.

Besides the API bottleneck, anti-debugging via checking is fundamentally flawed as suffering from “time of check, time of use“, aka race conditions. An attacker could quickly swap out the values or virtualize accesses to the particular resources in order to lie about their state. Meanwhile, they could still be using the debug registers for their own purposes while stubbing out GetThreadContext(), for example.

Contrast this with anti-debugging by using up a resource. For example, a timer callback could be set up to vector through PEB!BeingDebugged. (Astute readers will notice this is a byte-wide field, so it would actually have to be an offset into a jump table). This timer would fire normally until a debugger was attached. Then the OS would store a non-zero value in that location, overwriting the timer callback vector. The next time the timer fired, the process would crash. Even if an attacker poked the value to zero, it would still crash. They would have to find another route to read the magic value in that location to know what to store there after attaching the debugger. To prevent merely disabling the timer, the callback should perform some essential operation to keep the process operating.

Now this is a simple example and there are numerous ways around it, but it illustrates my point. If an application uses a resource that the attacker also wants to use, it’s harder to circumvent than a simple check of the resource state.

An old example of this form of anti-debugging in the C64 was storing loader code in screen memory. As soon as the attacker broke into a debugger, the prompt would overwrite the loader code. Resetting the machine would also overwrite screen memory, leaving no trace of the loader. Attackers had to realize this and jump to a separate routine elsewhere in memory that made a backup copy of screen memory first.

For x86 hardware debug registers, a program could keep an internal table of functions but do no compile-time linkage. Each site of a function call would be replaced with code to store an execute watchpoint for EIP+4 in a debug register and a “JMP SlowlyCrash” following that. Before executing the JMP, the CPU would trigger a debug exception (INT1) since the EIP matched the watchpoint. The exception handler would look up the calling EIP in a table, identify the real target function, set up the stack, and return directly there. The target could then return back to the caller. All four debug registers should be utilized to avoid leaving room for the attacker’s breakpoints.

There are a number of desirable properties for this approach. If the attacker executes the JMP, the program goes off into the weeds but not immediately. As soon as they set a hardware breakpoint, this occurs since the attacker’s breakpoint overwrites the function’s watchpoint and the JMP is executed. The attacker would have to write code that virtualized calls to get and set DR0-7 and performed virtual breakpoints based on what the program thought should be the correct values. Such an approach would work, but would slow down the program enough that timing checks could detect it.

I hope these examples make the case for anti-debugging via using up a resource versus checking it. This is one implementation approach that can make anti-debugging more effective.

[Followup: Russ Osterlund has pointed out in the comments section why my first example is deficient.  There’s a window of time between CreateProcess(DEBUG_PROCESS) and the first instruction execution where the kernel has already set PEB!BeingDebugged but the process hasn’t been able to store its jump table offset there yet.  So this example is only valid for detecting a debugger attach after startup and another countermeasure is needed to respond to attaching beforehand.   Thanks, Russ!]

Is the emulation singularity near?

While reading the book Computer Power and Human Reason, I began to wonder about emulation. Of course, any Turing machine can emulate any other (but possibly very slowly).  However, I am wondering if we’re near a singularity in whole-system emulation.

Whole-system emulation is where an entire computer, including CPU and peripherals, is emulated. This is different from a language VM (e.g., the JVM) or an OS VM (e.g., Linux UML). I gave a talk about VMs a few years ago that may be useful as a refresher. A whole system can be emulated by full instruction interpretation if the host CPU is a different architecture (e.g., Bochs or VICE) or by native execution with traps (e.g., VMware). The only difference is speed.

With the ever-growing prevalence of the x86 architecture, I am wondering whether we are near an emulation singularity. That is, a point in time where every machine ever produced in decent quantity, including every machine currently being sold, can be emulated at full speed on any widely-available desktop.

There are two classes of system that matter in this scenario: old and new machines. Old machines can be emulated via pure horsepower. When I run VICE, my virtual 6502 CPU, video chip, and sound chip are fully interpreted with cycle accuracy.  My desktop PC can be a Nintendo DS, numerous arcade games, or even a Playstation 2.  The fact that not only the main CPU but various custom graphics chips are being emulated is amazing.

New machines can be emulated at a decent speed if they share the same CPU architecture.  I think the spread of x86 is bringing us to that point in the near future.  Apple desktops and laptops, Sun servers, and the original Xbox are all based on x86 and can be emulated at full speed.  The advent of hardware virtualization makes it even easier to run multiple operating systems on the same platform.

There will continue to be new non-x86 systems (Xbox 360, Playstation 3, cellphones) but the gap is narrowing between their release and the appearance of the first emulator.  Since mainstream PCs are used as development platforms, the manufacturers themselves are designing the architecture to be emulated with a minimal amount of peripheral hardware (e.g., a chipset attached via USB).

Will we reach a singularity where every newly-released piece of mainstream hardware is readily emulated on current systems?  I think so — what do you think?