It has been known for a while that side-channel information about crypto operations (i.e., timing) can give enough information to recover secret keys. The original paper by Kocher even indicated RAM cache behavior as a potential source of timing variance.
In 2005, Osvik and Percival separately published papers on cache timing attacks. The latter resulted in a PR incident for Intel since Hyperthreading, which AMD doesn’t have, was announced as the key vantage point for observing the timing information of crypto running on another core. Since cache-related side channels are exploitable even in single-CPU systems, crypto libraries needed to be updated to fix this issue whether Hyperthreading was present or not.
For RSA, it’s necessary to understand modular exponentiation in order to understand the attack. When encrypting a message using the private exponent d, the server computes md mod n. To do this efficiently with big numbers, a number of techniques are used.
The basic technique, square-and-multiply, takes advantage of the fact that squaring is fast in binary (just a shift left). This method starts by walking through each bit of the exponent. If the bit is 0, the accumulating result (initially set to m) is squared. If it is 1, it is squared and then multiplied by m. In nearly all implementations, it takes longer to square and then multiply than it does to merely square, giving a very distinguishable timing difference. An attacker can figure out which bits are 0 or 1 by this difference.
Since basic square-and-multiply is still too slow, a method called “windowed exponentiation” was invented. Instead of examining each individual bit of the exponent, it walks through groups of bits of size w. There are both fixed and sliding window variants, but I’ll stick to fixed-window here for the sake of simplicity. Square-and-multiply is simply fixed-window exponentiation with w = 1.
For the common size w = 3, these values are precomputed mod n: m2, m3, m4, m5, m6, and m7. Obviously, m0 is 1 and m1 is just m so they are already known.
Just like in square-and-multiply, the accumulator is always first “squared”. However, since we are operating on multiple bits at once, it’s actually raised to 2w or 23 = 8 in our example. Then, the value of the bits in the window are checked. If 000b, nothing more happens. If 001b, the accumulator is multiplied by m. If 010b, it’s multiplied by the precomputed m2 and so forth. This speeds up the processing by handling batches of exponent bits, at the expense of some small precomputation.
Those precomputed values were typically stored in an array in memory, each of size of the modulus n (e.g., 256 bytes each for a 2048-bit n). Since the cache line size on modern x86 CPUs is 64 bytes, the multiply operation would take a little longer if the pre-computed value wasn’t already in the cache.
An attacker could repeatedly cause these values to be evicted from the cache and then based on the computation time, see which ones were used. Or, he could “touch” an array of garbage that was aligned along cache lines, access the SSL server, then time how long it took to read each value from his own array. If a value was quickly accessible, the SSL’s RSA implementation was likely to have used the associated pre-computed value. For example, if the 2nd element in the array was “hot”, then the server probably accessed its precomputed m3 and thus the corresponding bits of the private exponent were 011b.
This attack could be repeated many times until the entire key was extracted. Hyperthreading provided a good vantage point since the spy process could be accessing its array over and over while running at the same time as the SSL implementation on another core. However, this same attack could work on a single CPU system as well, albeit with more samples required.
To address these attacks for OpenSSL, Matthew Wood (Intel) submitted a patch that first appeared in version 0.9.7h. The key change is to the modular exponentiation function in openssl/crypto/bn/bn_exp.c.
This patch stripes the pre-computed m2…x values across cache lines instead of storing them sequentially in the array. That is, the first byte of m2 would be at address 0, the first byte of m3 at 1, etc. A memory dump of this region with w = 3 would look like this:
0: m2, m3, … m7
64: m2, m3, … m7
128: m2, m3, … m7
Thus, the access pattern for reading any pre-computed value is exactly the same as any other: 256 sequential cache line reads. This is a clever way of removing the timing leak. I think it is commendable that Intel spent the time developing this code and contributing it to OpenSSL, especially after the widespread criticism that surrounded this problem was directed primarily at them.
There are two problems regarding this approach. The first is the limitation that it can’t fix problems with AES cache timing (see also this detailed analysis of Bernstein’s work and this proposed cache architecture).
The second may only affect certain hardware. The patch is configurable for machines with a 32 or 64-byte cache line size. The default is set to 64. If an older machine or a non-x86 CPU with a smaller cache line size runs the default compile of OpenSSL, it will still have a small leak if a large window is used (>= actual cache line size). For example, consider a machine with an 8-byte cache line size and a 12-bit window. OpenSSL would store m2…9 in the first cache line and m10…11 in the second, allowing an attacker to determine if a given set of exponent bits was less than 210 or not. This might be exploitable in certain situations.
To be truly safe against timing attacks, you need to carefully examine the behavior of the underlying hardware. Be especially careful if adapting an existing crypto implementation to a new platform and get an expert to review it.