A few years ago, Aciicmez et al published a new timing attack (1, 2, 3) exploiting the branch prediction feature of x86 processors. They explained how to extract the RSA private key in one process by monitoring the behavior of the branch prediction unit in a concurrent spy process. This is an extension of the previous work on improving cache timing attacks (see also Percival and Bernstein).
These kinds of attacks all target microarchitectural features of the processor where a shared resource (I or D-cache, branch prediction buffer) is maintained by the processor across context switches. Any resource that is instruction or data-dependent and can be measured outside the security boundary can potentially be used for side-channel attacks.
Most developers don’t get the security boundary right. Usually, they will consider the whole machine as the boundary. But of course, even this simple concept can be more complicated in practice (say, host vs. guest VM environments). Others consider the OS process boundary as their security boundary. But consider a web server process with built-in SSL. After the SSL decryption, any changes in the performance of the general webserver code due to the cache state of the SSL decryption may reveal information about the private key. Thus, it is important to first draw the security boundary as narrowly as possible.
In a branch prediction attack, the attacker typically runs a spy process concurrently with the cipher. It repeatedly does the following:
- Execute a set of branches
- Measure the overall execution time of all its branches
Since an RSA decryption happens over a relatively long period of time, the cache and BTB state can be measured multiple times by the spy process during a single decryption. Public key algorithms take longer because they are doing a multi-word computation, and there are often conditional branches based on the private key. Also, the spy process can do various things to try to force the crypto process to context-switch more often.
Last year, I wrote about the patch Intel submitted to mitigate cache timing attacks. The OpenSSL patch they submitted for dealing with branch prediction attacks (“Smooth CRT-RSA”) is also interesting. They created new big number functions BN_mod_inverse_no_branch() and BN_div_no_branch(), which remove any conditional branches based on information related to the private key. See section 5.3 of their paper for details or just read the comments in the diff above.
The key insight behind these functions is that the extra reduction step can be eliminated if the domain for the value returned by Montgomery-Multiply can be extended. The modulus n in RSA is composed of two primes, p and q. The MMUL algorithm returns a value from [p, 2p). This means the naive implementation performs an optional extra reduction to fit the value into the range [0, p). The patch avoids this reduction by adding an extra word to the parameters passed to MMUL, effectively zero-padding them. Additionally, the Montgomery parameters are increased by n/2 to account for this modified range.
The lesson from all this is that it’s very hard to eliminate side channels, and you have to be careful when defining the security boundary. Even if you used this branchless CRT implementation, you might still be vulnerable to timing attacks if you used a naive algorithm for loading the private key from a file and converting it to binary format in memory (e.g., for loop with atoi()).