Preventing branch prediction timing attacks

A few years ago, Aciicmez et al published a new timing attack (1, 2, 3) exploiting the branch prediction feature of x86 processors. They explained how to extract the RSA private key in one process by monitoring the behavior of the branch prediction unit in a concurrent spy process. This is an extension of the previous work on improving cache timing attacks (see also Percival and Bernstein).

These kinds of attacks all target microarchitectural features of the processor where a shared resource (I or D-cache, branch prediction buffer) is maintained by the processor across context switches. Any resource that is instruction or data-dependent and can be measured outside the security boundary can potentially be used for side-channel attacks.

Most developers don’t get the security boundary right. Usually, they will consider the whole machine as the boundary. But of course, even this simple concept can be more complicated in practice (say, host vs. guest VM environments). Others consider the OS process boundary as their security boundary. But consider a web server process with built-in SSL. After the SSL decryption, any changes in the performance of the general webserver code due to the cache state of the SSL decryption may reveal information about the private key. Thus, it is important to first draw the security boundary as narrowly as possible.

In a branch prediction attack, the attacker typically runs a spy process concurrently with the cipher. It repeatedly does the following:

  1. Execute a set of branches
  2. Measure the overall execution time of all its branches

Since an RSA decryption happens over a relatively long period of time, the cache and BTB state can be measured multiple times by the spy process during a single decryption. Public key algorithms take longer because they are doing a multi-word computation, and there are often conditional branches based on the private key. Also, the spy process can do various things to try to force the crypto process to context-switch more often.

Last year, I wrote about the patch Intel submitted to mitigate cache timing attacks. The OpenSSL patch they submitted for dealing with branch prediction attacks (“Smooth CRT-RSA”) is also interesting. They created new big number functions BN_mod_inverse_no_branch() and BN_div_no_branch(), which remove any conditional branches based on information related to the private key. See section 5.3 of their paper for details or just read the comments in the diff above.

The key insight behind these functions is that the extra reduction step can be eliminated if the domain for the value returned by Montgomery-Multiply can be extended. The modulus n in RSA is composed of two primes, p and q. The MMUL algorithm returns a value from [p, 2p). This means the naive implementation performs an optional extra reduction to fit the value into the range [0, p). The patch avoids this reduction by adding an extra word to the parameters passed to MMUL, effectively zero-padding them. Additionally, the Montgomery parameters are increased by n/2 to account for this modified range.

The lesson from all this is that it’s very hard to eliminate side channels, and you have to be careful when defining the security boundary. Even if you used this branchless CRT implementation, you might still be vulnerable to timing attacks if you used a naive algorithm for loading the private key from a file and converting it to binary format in memory (e.g., for loop with atoi()).

Note to WordPress on SSL

Dear WordPress/Automattic:

Your servers do not offer SSL session resumption. This means that every response contains a server certificate (3807 bytes) and your server has to perform a 2048-bit RSA decryption. This occurs for every piece of data fetched over SSL, even the tiny button pictures that are smaller than the certificate itself.

WP SSL Server Hello message

You should really enable SSL session resumption. It will save a lot of money in server cost and bandwidth, and your users will be happier too.

Thanks,
Nate

[Edit: WordPress staff replied that this was a mistake in their configuration and now this is fixed.]

SSL is not table salt

While I haven’t written an article in a while, I’m still alive. I just got buried with work, tax prep, and using every spare moment to try to finish up the xum1541. Last week, I attended the iSec Forum and saw a talk about cookie forcing based on work by Chris Evans and Michael Zalewski. This attack involves overwriting SSL-only cookies with a cookie injected into a non-SSL connection. In other words, browsers prevent disclosure of SSL-only cookies, but not deletion or replacement by cookies from an insecure session.

I don’t follow network security closely so this may be an older attack. However, it reminds me how the web application and browser designers treat SSL like table salt — sprinkle a little bit here and there, but be careful not to overuse it. That’s completely the wrong mentality.

WordPress recently notified their users how to enable SSL for the admin interface. While it’s admirable that they are providing more security, the attitude behind the post is a great example of this dangerous mentality. They claim SSL is only recommended when blogging from a public network, even going so far as to suggest it be disabled again when back on a “secure network”. It’s hard to believe performance is the issue, given the CPU gains in the past 13 years.

Attention: if you’re using a web application on a shared network (say, the Internet), you’re not on a secure network. This whole idea that users should pick and choose SSL based on some ephemeral security assessment of the local network is insane. How can you expect anyone, let alone regular users, to perform a security assessment before disabling SSL and then remember to re-enable it before traveling to an insecure network? (You can’t log into your blog and re-enable SSL from the insecure network because you would get compromised doing so.)

Likewise, sites such as Yahoo Mail use SSL for submitting the login password, but then provide a session cookie over plain HTTP. A session cookie is almost as good as a password. As long as the attacker refreshes their own session periodically, the cookie stays valid. (Do any web services implement an absolute session limit?) Even if the user clicks “log out”, the attacker can feed a fake logout page to them and keep the cookie active.

All cookies should definitely have their own cryptographic integrity protection and encryption, independent of SSL. But it is clear that the entire attitude toward SSL is wrong, and we will all eventually have to change it. Open wireless networks have helped session hijacking proliferate, no ARP spoofing needed. Soon, malware may contain a MITM kit to compromise any user accessing a website who shares an access point with a rooted system. As this attack becomes more common, perhaps we’ll see the end of SSL as an accessory, and it will be mandated for the entirety of every authenticated session. The prevailing attitude will have to change first.

Forged CA cert talk at 25C3

A talk entitled “MD5 considered harmful today” (slides) is being presented at 25C3. The authors describe forging a CA cert that will be accepted by all browsers by exploiting the fact that several trusted root CAs sign certs with MD5. This allows them to spoof their identity as any SSL-enabled website on the Internet, and it will look perfectly valid to the user.

The growth of trusted root CAs included in standard browsers has been an issue for a while. Every root CA is equal in the eyes of the browser, thus the end-user’s security is equivalent to the security of the weakest root CA. The default Firefox install will accept a Yahoo cert signed by “TurkTrust”, or any other of more than 100 root certs. I don’t know how good each of those companies are at securing their keys, implementing strict cert chain validation, and checking the identity of every submitter. So, it’s a good bet that putting crypto authority in the hands of that many people will result in some failures, repeatedly.

The attack is interesting since they take advantage of more than one flaw in a CA. First, they find a CA that still uses MD5 for signing certs. MD5 has been broken for years, and no CA should have been doing this. Next, they prepared an innocent-looking cert request containing the “magic values” necessary to cause an MD5 collision. They were able to do this because of a second flaw. The CA in question used an incrementing serial number instead of a random one. Since the serial is part of the signed data, it is a cheap way to get some randomness. This would have thwarted this particular attack until a pre-image vulnerability was found in MD5. Don’t count on this for security! MD4 fell to a second pre-image attack a few years after the first collision attacks, and attacks only get better over time.

This talk definitely points out that crypto attacks are not being addressed quickly enough in the real world. While it is difficult to roll out a new root CA cert, it’s better to do so over the years we have known MD5 to be insecure than in a rush after an attack has already occurred. Another excellent talk at 25C3 on the iPhone described how the baseband processor was compromised via the lack of validation of RSA signature padding. What’s intriguing is that Apple’s own RSA implementation in their CDSA code was not vulnerable to this flaw, but apparently a different vendor supplied the baseband code.

To paraphrase Gibson, “Crypto security is available already, it just isn’t equally distributed.”

Comcast to give up BitTorrent blocking?

While discussion about how to implement traffic shaping was occurring, Comcast and BitTorrent.com made a joint announcement last week.  Comcast will apply bulk rate-limiting to customer connections and not target BitTorrent uploads.  And BitTorrent (the company) will cooperate with Comcast to make the protocol more amenable to ISP control in some way to be determined.

My claim was that this was all about centralized control of applications, not about congestion management.  After Comcast began to get heat from the FCC for not announcing their ad-hoc filtering, they decided it would be best to exit this experiment for now.  Comcast’s move seems to confirm my analysis.

Note what was absent in this announcement.  No complaints that aggregate rate-limiting is too hard or expensive to deploy.  No argument that only they knew what was good for their network and that seeding needed to be stopped at all costs.  They simply announced they would move to rate-limiting, which was a simpler and more appropriate response all along.

The strangest part of this whole affair was the MPAA announcement praising BitTorrent for cooperating with Comcast.  However, if you accept the theory that this is all about control, it makes more sense.  The *AA organizations have been stymied by protocols that don’t have a single responsible entity that can be petitioned or shut down.

They may assume this BitTorrent announcement is the start of a new era where ISPs can be used as a lever to bring application and protocol authors back to the negotiating table.  Under cover of “network management”, the ISPs will provide the necessary pressure to get changes made to allow more centralized control.  Only time will tell if this is just a historical footnote or the start of something new.

Preventing RSA cache timing attacks

It has been known for a while that side-channel information about crypto operations (i.e., timing) can give enough information to recover secret keys. The original paper by Kocher even indicated RAM cache behavior as a potential source of timing variance.

In 2005, Osvik and Percival separately published papers on cache timing attacks. The latter resulted in a PR incident for Intel since Hyperthreading, which AMD doesn’t have, was announced as the key vantage point for observing the timing information of crypto running on another core. Since cache-related side channels are exploitable even in single-CPU systems, crypto libraries needed to be updated to fix this issue whether Hyperthreading was present or not.

For RSA, it’s necessary to understand modular exponentiation in order to understand the attack. When encrypting a message using the private exponent d, the server computes md mod n. To do this efficiently with big numbers, a number of techniques are used.

The basic technique, square-and-multiply, takes advantage of the fact that squaring is fast in binary (just a shift left). This method starts by walking through each bit of the exponent. If the bit is 0, the accumulating result (initially set to m) is squared. If it is 1, it is squared and then multiplied by m. In nearly all implementations, it takes longer to square and then multiply than it does to merely square, giving a very distinguishable timing difference. An attacker can figure out which bits are 0 or 1 by this difference.

Since basic square-and-multiply is still too slow, a method called “windowed exponentiation” was invented. Instead of examining each individual bit of the exponent, it walks through groups of bits of size w. There are both fixed and sliding window variants, but I’ll stick to fixed-window here for the sake of simplicity. Square-and-multiply is simply fixed-window exponentiation with w = 1.

For the common size w = 3, these values are precomputed mod n: m2, m3, m4, m5, m6, and m7. Obviously, m0 is 1 and m1 is just m so they are already known.

Just like in square-and-multiply, the accumulator is always first “squared”. However, since we are operating on multiple bits at once, it’s actually raised to 2w or 23 = 8 in our example. Then, the value of the bits in the window are checked. If 000b, nothing more happens. If 001b, the accumulator is multiplied by m. If 010b, it’s multiplied by the precomputed m2 and so forth. This speeds up the processing by handling batches of exponent bits, at the expense of some small precomputation.

Those precomputed values were typically stored in an array in memory, each of size of the modulus n (e.g., 256 bytes each for a 2048-bit n). Since the cache line size on modern x86 CPUs is 64 bytes, the multiply operation would take a little longer if the pre-computed value wasn’t already in the cache.

An attacker could repeatedly cause these values to be evicted from the cache and then based on the computation time, see which ones were used. Or, he could “touch” an array of garbage that was aligned along cache lines, access the SSL server, then time how long it took to read each value from his own array. If a value was quickly accessible, the SSL’s RSA implementation was likely to have used the associated pre-computed value. For example, if the 2nd element in the array was “hot”, then the server probably accessed its precomputed m3 and thus the corresponding bits of the private exponent were 011b.

This attack could be repeated many times until the entire key was extracted. Hyperthreading provided a good vantage point since the spy process could be accessing its array over and over while running at the same time as the SSL implementation on another core. However, this same attack could work on a single CPU system as well, albeit with more samples required.

To address these attacks for OpenSSL, Matthew Wood (Intel) submitted a patch that first appeared in version 0.9.7h. The key change is to the modular exponentiation function in openssl/crypto/bn/bn_exp.c.

This patch stripes the pre-computed m2…x values across cache lines instead of storing them sequentially in the array. That is, the first byte of m2 would be at address 0, the first byte of m3 at 1, etc. A memory dump of this region with w = 3 would look like this:

0: m2[0], m3[0], … m7[0]
64: m2[1], m3[1], … m7[1]
128: m2[2], m3[2], … m7[2]

Thus, the access pattern for reading any pre-computed value is exactly the same as any other: 256 sequential cache line reads. This is a clever way of removing the timing leak. I think it is commendable that Intel spent the time developing this code and contributing it to OpenSSL, especially after the widespread criticism that surrounded this problem was directed primarily at them.

There are two problems regarding this approach. The first is the limitation that it can’t fix problems with AES cache timing (see also this detailed analysis of Bernstein’s work and this proposed cache architecture).

The second may only affect certain hardware. The patch is configurable for machines with a 32 or 64-byte cache line size. The default is set to 64. If an older machine or a non-x86 CPU with a smaller cache line size runs the default compile of OpenSSL, it will still have a small leak if a large window is used (>= actual cache line size). For example, consider a machine with an 8-byte cache line size and a 12-bit window. OpenSSL would store m2…9[0] in the first cache line and m10…11[0] in the second, allowing an attacker to determine if a given set of exponent bits was less than 210 or not. This might be exploitable in certain situations.

To be truly safe against timing attacks, you need to carefully examine the behavior of the underlying hardware. Be especially careful if adapting an existing crypto implementation to a new platform and get an expert to review it.