History of memory corruption vulnerabilities and exploits

I came across a great paper, “Memory Errors: The Past, the Present, and the Future” by van der Veen et al. The authors cover the history of memory corruption errors as well as exploitation and countermeasures. I think there are a number of interesting conclusions to draw from it.

It seems that the number of flaws in common software is still much too high. Consider what’s required to compromise today’s most hardened consumer platforms, iOS and Chrome. You need a flaw in the default install that is useful and remotely accessible, memory disclosure bug, sandbox bypass (or multiple ones), and often a kernel or other privilege escalation flaw.

Given a sufficiently small trusted computing base, it should be impossible to find this confluence of flaws. We clearly have too large a TCB today since this combination of flaws has been found not once, but multiple times in these hardened products. Other products that haven’t been hardened require even less flaws to compromise, making them more vulnerable even if they have the same rate of bug occurrence.

The paper’s conclusion shows that if you want to prevent exploitation, your priority should be preventing stack, heap, and integer overflows (in that order). Stack overflows are by far still the most commonly exploited class of memory corruption flaws, out of proportion to their prevalence.

We’re clearly not smart enough as a species to stop creating software bugs. It takes a Dan Bernstein to reason accurately about software in bite-sized chunks such as in qmail. It’s important to face this fact and make fundamental changes to process and architecture that will make the next 18 years better than the last.

Old programming habits die hard

While programming, it’s enlightening to be aware of the many influences you have. Decisions such as naming internal functions, coding style, organization, threading vs. asynchronous IO, etc. all happen because of your background. I think you could almost look at someone’s code and tell how old they are, even if they keep up with new languages and patterns.

When I think of my own programming background, I remember a famous quote:

“It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.”
— Edsger W.Dijkstra, June 18, 1975

Large memory allocations are a problem

A common mistake is keeping a fixed memory allocation pattern in mind. Since our machines are still changing exponentially, even a linear approach would quickly fall behind.

Back in the 90’s, I would make an effort to keep frames within 4K or 8K total to avoid hitting a page fault to resize the stack. Deep recursion or copying from stack buffer to buffer were bad because they could trigger a fault to the kernel, which would resize the process and slow down execution. It was better to reuse data in-place and pass around pointers.

Nowadays, you can malloc() gigabytes and servers have purely in-memory databases. While memory use is still important, the scale that we’re dealing with now is truly amazing (unless your brain treats performance as a log plot).

Never jump out of a for loop

The BASIC interpreter on early machines had limited garbage collection capability. If you used GOTO in order to exit a loop early, the stack frame was left around, unless you followed some guidelines. Eventually you’d run out of memory if you did this repeatedly.

Because of this, it always feels a little awkward in C to call break from a for loop, which is GOTO at the assembly level. Fortunately, C does a better job at stack management than BASIC.

Low memory addresses are faster

On the 6502, instructions that access zero page addresses (00 – ff) use a more compact instruction encoding than other addresses and also execute one cycle faster. In DOS, you may have spent a lot of time trying to swap things below the 1 MB barrier. On an Amiga, it was chip and fast RAM.

Thus, it always feels a bit faster to me to use the first few elements of an array or when an address has a lot of leading zeros. The former rule of thumb has morphed into cache line access patterns, so it is still valid in a slightly different form. With virtualized addressing, the latter no longer applies.

Pointer storage is insignificant

In the distant past, programmers would make attempts to fold multiple pointers into a single storage unit (the famous XOR trick). Memory became a little less scarce and this practice was denounced, due to its impact on debugging and garbage collection. Meanwhile, on the PC, segmented memory made the 16-bit pointer size insignificant. As developers moved to 32-bit protected mode machines in the 90’s, RAM size was still not an issue because it had grown accordingly.

However, we’re at a peculiar juncture with RAM now. Increasing pointers from 32 to 64 bits uses 66% more RAM for a doubly-linked list implementation with each node storing a 32-bit integer. If your list took 2 GB of RAM, now it takes 3.3 GB for no good reason. With virtual addressing, it often makes sense to return to a flat model where every process in the system has non-overlapping address space. A data structure such as a sparse hash table might be better than a linked list.

Where working set size is less than 4 GB, it may make sense to stay with a 32-bit OS and use PAE to access physical RAM beyond that limit. You get to keep 32-bit pointers but each process can only address 4 GB of RAM. However, you can just run multiple processes to take advantage of the extra RAM. Today’s web architectures and horizontal scaling means this may be a better choice than 64-bit for some applications.

The world of computing changes rapidly. What kind of programming practices have you evolved over the years? How are they still relevant or not? In what ways can today’s new generation of programmers learn from the past?

Magic numbers in Excel waste my time

One of the tools I created recently output its data in CSV format. The Python CSV library is quite nice. However, opening the file in Excel gave the error “SYLK: file format is not valid” or “Excel has detected that ‘test.csv’ is a SYLK file, but cannot load it.” OpenOffice handled the file just fine.

It turns out that a CSV file with the first two bytes set to “ID” (case-sensitive) is detected as a different file format by Excel. And this is why I hate software.

Excel has detected that 'test.csv' is a SYLK file, but
cannot load it.

XMLDsig welcomes all signatures

There’s a new flaw in the W3C XMLDsig standard, found by Thomas Roessler. This standard specifies the procedure for signing and validating XML data, such as for SOAP. The document is validated with HMAC; however, an additional parameter can be supplied, <HMACOutputLength>. This allows the signer to use a truncated HMAC.

As you may remember, an HMAC is a way of validating data by using a secret key and a cryptographic hash algorithm. I avoid using the term “keyed hash” as that leads to bad implementations like “HASH(key || data)”. The output of an HMAC is the size of the hash algorithm, say 160 bits for SHA-1. This value is sometimes truncated in space-constrained designs. For example, only the first 128 bits might be sent and verified. This is sometimes acceptable because circumventing an HMAC requires a second-preimage attack (2n), unlike forging a signature which only requires a collision (2n/2).

The problem is that there is no specified lower bound on <HMACOutputLength>. So this is a perfectly valid signature field in XMLDsig:

<SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#hmac-sha1">

Yes, an attacker can send any request and any signature and it will be accepted if the server follows the field blindly and validates only 0 bits. Even a server that checked for 0 might allow a truncated length of 1 bit or 8 bits or whatever.

According to the advisory, it has taken 6 months to release this simple spec erratum. This is because it affected so many vendors. This is a great example how a little leniency in crypto can have a huge impact on security.

Also, I’m not certain this bug is actually fixed. For example, I see that the algorithm can be specified by the same field. So could an attacker specify a broken algorithm like MD4, which has a second-preimage attack? If the server’s crypto library just maps the given Algorithm field to a hash function, this might be possible.

The spec says:

Requirements are specified over implementation, not over requirements for signature use. Furthermore, the mechanism is extensible; alternative algorithms may be used by signature applications.

So it’s not clear whether or not this is a problem, although ambiguity in crypto is one of the biggest sources of flaws.

The most annoying thing about this entire vulnerability is why were truncated HMACs even included in the standard at all? This is XML we’re talking about, not some packed binary protocol. The difference between full HMAC-SHA1 and the new minimum allowed truncated value is 10 bytes (plus base64 encoding expansion). You’re telling me that it’s worth exposing all your implementers to this kind of crypto risk to save 10 bytes? Really?

When Crypto Attacks! slides posted

I have now posted slides for the talk I gave yesterday at Yahoo Security Week (see below). I also took this opportunity to upload the previous talks I have given since 2004 to Slideshare.

The talk was mostly an in-depth list of attacks against various crypto implementations. The good news is that developers seem to have gotten the message not to design their own ciphers. Now, we’re trying to get the message out that you shouldn’t be implementing your own crypto protocols or constructions, using low-level crypto libraries.

Instead, developers should work at a higher level, using libraries like GPGME, Keyczar, or cryptlib. You wouldn’t write a web application in assembly language. Why take the risk of implementing your own crypto constructions?

If you do end up designing/implementing your own construction, it is really important to get it reviewed by a third party. Since it can be expensive and time-consuming to gain assurance, it’s better in nearly all cases to use a high-level library. The alternative is a potential root key compromise. Are you willing to take that chance?

Fault-tolerant system design

When designing a system, I have a hierarchy of properties. It is somewhat fluid depending on the application but roughly breaks down as follows.

  1. Correct: must perform the basic required function
  2. Robust: … under less-than-perfect operating conditions
  3. Reliable: … handling and recovering from internal errors
  4. Secure: … in the face of a malicious adversary
  5. Performance: … as fast as possible

I’d like to talk about how to achieve reliability. Reliability is the property where the system recovers from out-of-spec conditions and implementation errors with minimal impact on the other properties (i.e., correctness, security). While most system designers have an idea of the desired outcome, surprisingly few have a strategy for getting there. Designing for reliability also produces a system that is easier to secure (ask Dan Bernstein.)

Break down your design into logical components

This should go without saying, but if you have a monolithic design, fault recovery becomes very difficult. If there is an implicit linkage between components, over time it will grow explicit. Corollary: const usage only decreases over time in a fielded system, never increases.

Each component of the system should be able to reset independently or in groups

As you break down the system into components, consider what dependencies a reset of each component triggers. A simple module with less dependencies is best. I use the metric of “cross-section” to describe the complexity of inter-module dependencies.

Implement reset in terms of destructors/constructors

Right from the beginning, implement reset. Since you’re already coding the constructors and should have been coding destructors (right?), reset should be easy. If possible, use reset as part of the normal initialization process to be sure it doesn’t become dead code.

Increase the speed of each component’s recovery and make them
restart independently if possible

If components take a long time to recover, it may result in pressure from customers or management to ditch this feature. A component should never take longer to recover than a full system reset, otherwise rethink its design. Independence means that reset can proceed in parallel, which also increases performance.

Add a rollback feature to components

In cases where a full reset results in loss of data or takes too long, rollback may be another option. Are there intermediate states that can be safely reverted while keeping the component running?

Add error-injection features to test fault recovery

Every component should have a maintenance interface to inject faults. Or, design a standard test framework that does this. At the very least, it should be possible to externally trigger a reset of every component. It’s even better to allow the tester to inject errors in components to see if they detect and recover properly.

Instrument components to get a good report of where fault appeared

A system is only as debuggable as its visibility allows. A lightweight trace generation feature (e.g., FreeBSD KTR) can give engineers the information needed to diagnose faults. It should be fast enough to always be available, not just in debug builds.