Why buffer overflow exploitation took so long to mature (part 2)

May 5, 2010May 1, 2010 ~ Nate Lawson ~ 5 Comments

Last time, I asked the question, “why did it take 24 years for buffer overflow exploits to mature?” The relevant factors to answering this question are particular to three eras: academic computing (early 1970’s), rise of the Internet (1988), and x86 unification (1996 to present).

In the first era of academia, few systems were networked so buffer overflow flaws could only be used for privilege escalation attacks. (The main type of networking was dialup terminal access, which was a simpler interface and usually not vulnerable to buffer overflows.) Gaining control of an application was relevant to the designers of the trusted computing criteria (Rainbow Books), where objects were tagged with a persistent security level enforced by the TCB, not the program that accesses them. Except for the most secure military systems, computers were mainly used by academia and businesses, which focused more on password security and preventing access by unsophisticated attackers. Why write a complex exploit when there are weak or no passwords on system accounts?

In 1988, the wide availability of Unix source code and a common CPU architecture (VAX) led to the first malicious buffer overflow exploit. Before the rise of the minicomputer, there were many different CPU architectures on the Arpanet. The software stacks were also changing rapidly (TOPS-20, RSX-11, VMS). But by the late 1980’s, the popularity of DEC, Unix, and BSD in particular had led to a significant number of BSD VAX systems on the Internet, probably the most homogeneous it had been up to that point.

I think the widespread knowledge of the VAX architecture along with the number of BSD systems on the Internet led to it being the target of the first malicious buffer overflow exploit. Building a worm to target a handful of systems wouldn’t have been nearly as much fun, and BSD and the VAX had hit a critical mass on the Internet. More general Unix flaws (e.g., trust relationships by the “r” utilities) were also exploited by the worm but the VAX buffer overflow exploit was the only such flaw the author chose to target or had enough familiarity to create.

With this lone exception, any further exploitation of buffer overflows remained quite secret. Sun workstations and servers running 68k and then SPARC processors became the Internet host of choice by the 1990’s. However, there was significant fragmentation with many hosts running IRIX on SGI MIPS, IBM AIX on RT and RS, and HP-UX on PA-RISC. Even though Linux and the various BSD distributions had been launched, they were mostly used for hobbyists as client PCs, not servers.

Meanwhile, a network security community had developed around mailing lists such as Bugtraq and Zardoz. Most members were sysadmins and hackers, and few vendors actively participated (Sun was a notable exception). Exploits and fixes/workarounds were regularly published, most of them targeting logic flaws such as race conditions or file descriptor leakage across trust domains. (On a side note, I find it amusing that any debate about full disclosure back then usually involved how many days someone should wait between notifying the vendor and publishing an exploit, not advisory. Perhaps we should be talking in those same terms today.)

In 1995, the various buffer overflow exploits published were only for non-x86 systems (SunOS and HP-UX, in particular). Such systems were still the most prevalent servers on the Internet. The network security community was focused on this kind of “big iron” with few exploits published for open-source Unix systems or Windows, which was almost exclusively a client OS. This changed with the splitvt exploit for Linux on x86, although it took a year until the real impact of this change appeared.

One of the reasons buffer overflows were so rarely exploited on Unix workstations in the early 1990’s was the difficulty of getting CPU and OS architecture information. Vendors kept a lot of this information private or only provided it in costly books. Sure you could read the binutils source code, but without even an understanding of the assembly behavior, it was difficult to gain this kind of low-level knowledge. Also, common logic flaws were easier to exploit for the Unix-centric sysadmin community and that populated Bugtraq.

In 1996, several worlds collided and buffer overflow exploitation grew rapidly. First, Linux and BSD x86 systems had finally become common enough on the Internet that people cared about privilege escalation exploits specific to that CPU architecture. Second, open-source Linux and BSD code made it easy to understand stack layout, trap frames, and other system details. Finally, the virus community had become interested in network security and brought years of low-level x86 experience.

The last factor is the one I haven’t seen discussed elsewhere. Virus enthusiasts (virus authors, defenders, and various interested onlookers) had been engaged in a cat-and-mouse game since the 1980’s. The game was played out almost exclusively on x86 DOS systems. During this period, attackers created polymorphic code generators (MtE) and defenders created a VM-based system for automated unpacking (TBAV). Even today, the impact of all this innovation is still slowly trickling into mainstream security products.

Once all three of these components were in place, buffer overflow exploits became common and the field advanced quickly through the present day. CPU architecture information is now freely available and shellcode and sample exploits exist for all major systems. New techniques for mitigating buffer overflows and subverting mitigations are also constantly appearing with no end in sight.

The advent of the Code Red and Slammer worms in the early 2000’s can be seen as a second-coming of the 1988 worm. At that point, Windows servers were common enough on the Internet to exploit but did not yet contain valuable enough data to keep the flaw secret. That period quickly ended as banking and virtual goods increased the value of compromising Internet hosts, and client-side exploits also became valuable. While some researchers still publish buffer overflow exploits, the difficulty of producing a reliable exploit and associated commercial value means that many are also sold. The race started in the 1970’s continues to this day.

Why buffer overflow exploitation took so long to mature

May 3, 2010May 6, 2010 ~ Nate Lawson ~ 4 Comments

I think the history of buffer overflow exploits is interesting because of how long it took for techniques to mature. About 16 years passed from awareness to first public exploitation, and then 8 more years from that until they were commonly exploited. Programmers were aware of this class of flaw but did little to avoid them for 24 years. But why?

Executing code via a buffer overflow was published at least as early as 1972 (pdf). It’s quite likely this was common knowledge before then, as overlay techniques (loading one segment over another during execution) were often used to save RAM. The first public malicious exploitation of a buffer overflow was by the 1988 Internet worm.

The Morris worm exploited a flaw in the finger daemon, which used gets() to read the username from the client into a buffer on the stack. Its shellcode was only targeted at 4.3 BSD Unix on the VAX CPU, and it used a return address overwrite to gain control. Interestingly enough, the worm did not have shellcode for SunOS even though the Motorola 68k CPUs would have been easy to exploit, and Sun’s fingerd was vulnerable to the same flaw.

Public exploitation of overflows took a 7-year hiatus, even with the publication of a seminal 1990 paper on fuzz testing. Everyone seemed to know overflows were present and theoretically exploitable, but no one seemed to be exploiting them or worried about fixing them.

Then in February 1995, Thomas Lopatic published a stack overflow exploit for NCSA httpd on HP-UX. This was an excellent piece of work, but on an obscure OS and CPU. (By the way, Thomas was also known for his work on a loader for Amiga games and later for analyzing the Windows XP activation scheme).

However, buffer overflow exploits were still few and far between. In August, 8lgm published an advisory for syslog() on SunOS SPARC, but no public exploit. In December, the splitvt exploit for Linux x86 was published. However, a month later, some people were still wondering how it worked.

In November 1996, Aleph1 published “Smashing the Stack for Fun and Profit“. This important article described in detail the evolution of a stack overflow exploit. Though I don’t know the exact history, Aleph1’s x86 shellcode is nearly identical to the splitvt exploit, so perhaps his method was derived from it. The article also provided shellcode for SunOS and Solaris on SPARC, which was an important advance since such machines were still more “interesting” than x86 systems.

After this paper, numerous stack overflows were published. Research on both sides advanced rapidly, with new techniques such as heap overflow exploitation and stack integrity protection. So why had research in this area taken so long to reach this inflection point of rapid growth?

In the next article, I discuss various factors that collided in 1996, creating the buffer overflow arms race that continues to this day.

Next Baysec: April 21 at Irish Bank

April 16, 2010April 16, 2010 ~ Nate Lawson

The next Baysec meeting is Wednesday, April 21, 7 pm at the Irish Bank. Come out and meet fellow security people from all over the Bay Area. As always, this is not a sponsored meeting, there is no agenda or speakers, and no RSVP is needed.

10 Mark Lane
San Francisco, CA
415.788.7152
http://www.theirishbank.com/

Why digital logic is different than analog (part 2)

March 18, 2010March 17, 2010 ~ Nate Lawson ~ 2 Comments

Last time, we asked the question “What is the difference between analog and digital logic?” We identified two areas: noise and dynamic range. The latter limitation is that you can’t have an infinite (or even very high) voltage, so analog logic will always have some maximum value it can represent. With digital logic, you just add more bits to your representation and you can instantly increase the range. This time, we’ll consider noise and analog computers.

The first computers were analog. One of the initial applications was to calculate artillery trajectories under different conditions. The old way to fire a gun was to measure the wind (strength and direction), distance, and elevation to the target, and then look up a number in a table to get the appropriate angle. Originally these tables were generated by test-firing guns and then deriving the variations with calculations done by hand. People who did this work were known as “computers”.

Analog electronic computers were applied to this task in World War 2. This kind of computer worked with a continuously varying voltage, using op-amps to calculate sums and differences of values. The advantages are that the results are highly accurate and computation is very fast. An analog computer does not suffer from the quantization problem or aliasing because the signal is continuous, not discrete like digital sampling.

We’d still be using analog computers today if it wasn’t for noise. The same advantages to working with a continuous analog signal mean it is vulnerable to cascading noise. On every pass through an op-amp, a small amount of noise is added. At some point, the signal is not recoverable from the noise and the calculation can’t proceed. Much of the more recent work on analog computers seems to be focused on filtering noise.

As we saw last time, there is no such thing as a purely digital computer. “Digital” is a point-of-view, a way of interpreting an analog signal of some kind. This is both simple and sublime. It means that we can overcome the advantages inherent in our analog world by moving up a level of abstraction — manipulating symbols and abstract logic instead of concrete and continuous substances such as a voltage.

Digital logic was designed to have the regenerative property. What this means is that each stage of processing starts over from scratch, generating its own signal that was only dependent on the meaning of the previous stages of logic and its computational result. The key thing to understand here is that it is not amplifying the input signal directly, it is regenerating the output by switching a set of transistors. This means that digital logic stages can be infinitely deep and the signal at the end is as free of noise as it was at the start.

To maintain the regenerative property, several rules have to be followed. Since transistors work with continuous voltages and can be considered an analog component, digital logic rules state how they must be used. Each type of silicon family (CMOS, TTL, LVTTL, ECL, GTL) has its own particular rules. One set of rules deals with timing — how fast the input voltages can change and how long they have to remain at a given level to be registered as a logical 1 or 0. Another rule gives the voltage levels for outputing or receiving a logical 1 or 0.

The rules for input and output voltage levels differ, which is key to noise rejection. To maximize noise rejection, you want your output levels to be as far apart as possible but your input levels closer. This handles noise created by behavior such as ground bounce. If your input levels are too close together though, you’ll get false readings due to noise or intermediate input voltages (low slew rate). The difference between logical 1 and 0 is called the noise margin and is measured in volts.

Here are the specified voltage levels for TTL. Output is on the left:

ttl-levels1 — Output and input TTL levels

The diagram shows that outputs must have at least a 4.5V difference between logic levels and the inputs have a 1.2V difference. For example, if you have a noisy input when the sender transitions to logical 0 but the noise peaks are less than 0.8V, you’ll still get a clean logical 0 from the output (0.2V or less).

Manufacturers design these tolerances into their components. If you also follow the timing rules, your digital logic will have all the desired properties: noise rejection, the regenerative property, and arbitrarily large dynamic range. This is what people actually mean when they talk about the reliability of digital data and processing.

By the way, different logic families have different thresholds. The output response over a continuously varying input voltage is called the voltage transfer characteristic. It indicates the S-curve of output voltage for a gate as the input varies. Another way of interpreting the above diagram is that by staying to the far left or right of this curve, we avoid the transition region and thus can reject more noise.

It’s often enlightening to step back and consider the assumptions underlying our everyday work. Given that we live in an analog world, I find the ideas and unique guarantees digital logic provides fascinating.

Why digital logic is different than analog

March 12, 2010March 18, 2010 ~ Nate Lawson ~ 4 Comments

I occasionally come across a concept that, while obvious to those who know it, isn’t as widely spread as it should be. One such question is, “What is the difference between analog and digital logic?” There are a lot of people that know the answer in an abstract way, but the real distinction is both simple and sublime.

The abstract answer is that an analog signal is continuous while a digital signal is discrete, ones and zeros. When people think of analog technology, a common example is cassette tape, hissing and popping, compared to the relative clarity of a CD or MP3. The common sense reason given why the digital version sounds better is because the ones and zeros are conceptually perfect, matching the original recording. Also, copies are perfect as well because the ones and zeros can be exactly duplicated.

However, this explanation begins to break down when you consider it closely. Due to the quantization problem, each digital representation (sample) of the waveform at a moment in time is inexact because you can always divide it into smaller and smaller parts. So an analog signal at one point in time is more perfect than its digital sample. Also, the lower the sampling rate, the greater the error due to aliasing. This is because a discrete sampling method cannot capture changes in the waveform that occur between samples. Thus, an ideal analog signal is always more accurate than its digital representation.

Going even deeper, there is no such thing as a purely digital signal. When expressed in terms of voltage, a one might be 5V and a zero, 0V. But no actual circuit can make an instantaneous transition from 0 to 5V or back. There’s always some small amount of time where the voltage is rising or falling. In really high-speed circuits or over longer distances, a signal can be both a one and a zero at different points on the wire. This is what engineers mean when they say a circuit has to be modeled as a transmission line. So even digital circuits are actually analog underneath. Digital is really just another way of imposing meaning on an analog circuit.

If analog is better, why do we even have digital? The answer is twofold: noise and dynamic range. Noise is always present in a signal. If restricted to a narrow frequency or time band, noise can often be filtered. However, there is always a danger of throwing out useful data along with the noise, especially both are if in a similar frequency range. Here is an example of a signal with easily-filtered noise — their frequencies and amplitudes are quite different.

Dynamic range is the difference between the lowest and highest signal level. A system can’t have an infinite voltage, so there is always some limit to the highest value an analog signal can represent. In contrast, a digital signal can represent arbitrary ranges just by adding more bits (32 bits not enough range? Try 64!)

In the next post, we’ll examine noise in more detail to understand the main difference between digital and analog logic.

Attacking RSA exponentiation with fault injection

March 8, 2010 ~ Nate Lawson ~ 2 Comments

A new paper, “Fault-Based Attack of RSA Authentication” (pdf) by Pellegrini et al, is making the rounds. The general idea is that an attacker can disrupt an RSA private key operation to cause an invalid signature to be returned, then use that result to extract the private key. If you’re new to fault injection attacks on RSA, I previously wrote an intro that should help.

The main concept to grasp is that public key crypto is brittle. In the case of RSA’s CRT operation, a single bit error in one multiplication result is enough to fully compromise your private key. We’ve known this since 1997. The solution is simple: validate every signature with the public key before returning it to the caller.

The authors noticed something curious. OpenSSL does verify signatures it generates before returning them, but if it detects a problem, it does not just return an error. It then tries again using a different exponentiation process, and then returns that signature without validating it.

Think about this for a moment. What conditions could cause an RSA private key operation to compute an invalid answer? An innocent possibility is cosmic radiation, bad RAM, etc. In this case, all computations should be considered unreliable and any retried operation should be checked very carefully. The other and more likely possibility is that the system is under attack by someone with physical proximity. In this case, OpenSSL should generate a very obvious log message and the operation should not be retried. If it is, the result should be checked very carefully.

For whatever reason, the OpenSSL programmers decided to retry with fixed-window exponentiation and trust that since there were no published fault attacks for it, they didn’t have to validate its result. This is a foolhardy attitude — not something you want to see in your crypto library. There had been many other fault injection attacks against various components or implementation approaches for RSA, including right-to-left exponentiation. There was no reason to consider left-to-right exponentiation invulnerable to this kind of attack.

Fixed-window exponentiation is a form of sliding window exponentiation. This is just a table-based optimization, where a window (say, 3 bits wide) is moved across the exponent, computing the final result incrementally. While this may be resistant to some timing attacks (but not cache timing or branch prediction attacks), there is nothing that would prevent fault injection attacks.

Indeed, it turns out to be vulnerable. The authors generate a few thousand signatures with single bit-flips in some window of the signature. Then they compare the faulty signatures to a correct signature over the same message. They compute the value for that portion of the private exponent since there are only 2^w possibilities for that location if w is the window size in bits. This is repeated until enough of the private key is known.

The method they used to create the faulty signatures was a bit artificial. They built a SPARC system on an FPGA running Linux and OpenSSL. They then decreased the power supply voltage until multiplies started to fail. Since multiplication logic is a relatively long chain, it is often one of the first things to fail. However, a more interesting hardware result would be to attempt this kind of attack on an actual server because FPGAs work differently than ASICs. It might require careful targeting of the right power pins on the CPU. Since power pins are numerous in modern systems, this may be more effective than only modulating the system power supply.

This was a nice attack but nothing earth-shattering. The only thing I was floored by (yet again), was the willingness of crypto implementers to perform unsafe operations in the face of an almost certain attack. Shame on OpenSSL.