History of memory corruption vulnerabilities and exploits

I came across a great paper, “Memory Errors: The Past, the Present, and the Future” by van der Veen et al. The authors cover the history of memory corruption errors as well as exploitation and countermeasures. I think there are a number of interesting conclusions to draw from it.

It seems that the number of flaws in common software is still much too high. Consider what’s required to compromise today’s most hardened consumer platforms, iOS and Chrome. You need a flaw in the default install that is useful and remotely accessible, memory disclosure bug, sandbox bypass (or multiple ones), and often a kernel or other privilege escalation flaw.

Given a sufficiently small trusted computing base, it should be impossible to find this confluence of flaws. We clearly have too large a TCB today since this combination of flaws has been found not once, but multiple times in these hardened products. Other products that haven’t been hardened require even less flaws to compromise, making them more vulnerable even if they have the same rate of bug occurrence.

The paper’s conclusion shows that if you want to prevent exploitation, your priority should be preventing stack, heap, and integer overflows (in that order). Stack overflows are by far still the most commonly exploited class of memory corruption flaws, out of proportion to their prevalence.

We’re clearly not smart enough as a species to stop creating software bugs. It takes a Dan Bernstein to reason accurately about software in bite-sized chunks such as in qmail. It’s important to face this fact and make fundamental changes to process and architecture that will make the next 18 years better than the last.

Has HTML5 made us more secure?

Brad Hill recently wrote an article claiming that HTML5 has made us more secure, not less. His essential claim is that over the last 10 years, browsers have become more secure. He compares IE6, ActiveX, and Flash in 2002 (when he started in infosec) with HTML5 in order to make this point. While I think his analysis is true for general consumers, it doesn’t apply to more valuable targets, who are indeed less secure with the spread of HTML5.

HTML5 is a broad grouping of features, and there are two parts that I think are important to increasing vulnerability. First, there is the growing flexibility in parsing elements for JavaScript, CSS, SVG, etc., including the interpretation of relationships between them. Second, there’s the exposure of complex decoders for images, video, audio, storage, 3D graphics, etc. to untrusted sources.

If you look at the vulnerability history for these two groups, both are common culprits in flaws that lead to untrusted code execution. They still regularly exhibit “game over” vulnerabilities, in Firefox and Chrome. Displaying a PNG has been exploitable as recently as 2012. Selecting a font via CSS was exploitable in 2010. In many cases, these types of bugs are interrelated. A flaw in a codec could require heap grooming via JavaScript to be reliably exploitable. HTML5’s increased surface area of more parsing and complex decoders standardizes remote, untrusted access to components that are still the biggest source of code execution vulnerabilities in the browser, despite attempts to audit and harden them.

Additionally, it exposes elements that have not had this kind of attention. WebGL hands over access to your 3D graphics stack, something which even CERT thinks is worth disabling. If you want to know the future of exploitation, you need to keep an eye on the console and iPhone/Android hacking groups. 3D shaders were the first software exploit of the Xbox 360, a platform that is much more secure than any browser. And Windows GDI was remotely exploitable in 2009. Firefox WebGL is built on top of Mesa, which is software from the bad old days of 1993. How is it going to do any better than Microsoft’s most secure platform?

As an aside, a rather poor PR battle about WebGL is worth addressing here. An article by a group called Context in 2011 raised some of these same issues, but their exploit was only a DoS. Mozilla devs jumped on this right away. Their solution is a whitelist and blacklist for graphics drivers. A blacklist is great for everyone after a 0-day has been discovered and fixed and deployed, but not so good before then.

Call me a luddite, but I measure security by what I can easily disable or route around and ignore. Flash is easily blocked and can be uninstalled. JavaScript can be disabled with a browser setting or filtered. But HTML5? Well, that’s knit into pretty much every area of the browser. You want to disable WebGL? No checkbox, but at least there’s about:config. Just make sure no one set “webgl.force-enabled” or whatever the next software update adds to your settings. Want to disable parts of CSS but not page layout? Want a no-codec browser? Get out the compiler.

Browser vendors don’t care about the individual target getting compromised; they care about the masses. The cost/benefit tradeoff for these two groups are completely opposite. Otherwise, we’d see vendors competing for who could remove as many features as possible to produce the qmail of browsers.

Security happens in waves. If you’re an ordinary user, the work of Microsoft and Google in particular have paid off for you over the past 10 years. But woe to you if you manage high-value targets. The game of whack-a-mole with the browser vendors has been getting worse, not better. The more confident they get from their bug bounties and hardening, the more likely they are to add complex, deeply intertwined features. And so the pendulum begins swinging back the other way for everyone.

Toggl time-tracking service failures

A while ago, we investigated using various time-tracking services. Making this quick and easy for employees is helpful in a consulting company. Our experience with one service should serve as a cautionary note for web 2.0 companies that want to sell to businesses.

Time tracking is a service that seems both boring and easy to implement but is quite critical to most companies. Everyone from freelance writers to international manufacturers needs some system, ranging from paper and spreadsheets to sophisticated ERP systems. For a consulting company, lost time entries means lost money.

As we added employees last year, we began evaluating various web-based time tracking services in comparison to our home-grown system (flat text files).

We considered using Toggl, but our experiences with it started ok but got worse as they grew. We found that some test entries would disappear occasionally. The desktop widget was really an HTML app wrapped in a full browser (they started with Qt and moved to Chrome).

Then one day, we found that their servers had screwed up access control and everyone was being logged into a single account. You can see the messages in the attached screenshots. This was at once hilarious and horrifying, and we terminated our account immediately. No customer or sensitive data was lost, but there was no way we were ever going to use this service. There were various complaints about all this on the Toggl forums, but those posts were deleted when they updated their support website.

Dealing with distributed systems and the CAP theorem is particularly hard. Startups ignore this at your own (or your customers’) peril. Time tracking seems simple enough, but you can never lose data. For customers that don’t use code names for clients/projects, a data compromise could be devastating.

I debated whether to write this post since I don’t have a grudge against Toggl. However, I am hoping our experience will be informative as to what can happen when you outsource ownership of data that is directly tied to your revenue.

Cyber-weapon authors catch up on blog reading

One of the more popular posts on this blog was the one pointing out how Stuxnet was unsophisticated. Its use of traditional malware methods and lack of protection for the payload indicated that the authors were either “Team B” or in a big hurry. The post was intended to counteract the breathless praise in the press for the advent of sophisticated “cyber-weapons”.

This year, more information was released in the New York Times that gave more support for both theories. The authors may not have had a lot of time due to political pressure and concern about Iran’s progress. The uneasy partnership between the US and Israel may have led to both parties keeping their best tricks in their back pockets.

A lot of people seemed skeptical about the software protection method I described called “secure triggers”. (I had written about this before also, calling it “hash-and-decrypt”.) The general idea is to gather information about the environment in order to generate a cryptographic key, which is used to decrypt the payload. If even one bit of info is incorrect, the payload can’t be decrypted. The analyst has to brute-force the proper environment, which can be made infeasible if there’s enough entropy and/or the validation method is too slow.

The critics claimed that secure triggers were too complicated or unable to withstand malware analyst scrutiny. However, this approach had been used successfully in everything from Core Impact to Blu-ray to Team Twiizers exploits, so it was feasible. Either the malware developers were not aware of this technique or there were other constraints, such as time, preventing it from being used.

Now we’ve got Gauss, which uses (surprise!) this exact technique. And, it turns out to be somewhat effective in preventing Kaspersky from analyzing the payload. We either predicted or caused the future, take your pick.

Is this the endgame? Not even, but it does mean we’re ready for the next stage.

The malware industry has had a stable environment for a while. Targeted attacks were rare, and most new malware authors hadn’t spent a lot of effort building in custom protection for their payloads. Honeypots and local analysis methods assume the code and behavior remain stable between the malware analyst’s environment and the intended target.

In the next stage, proper use of mechanisms like secure triggers will divide malware analysis into two phases: infection and payload. The infection stage can be analyzed with traditional techniques in order to find the security flaws exploited, propagation method, etc. The payload stage will change drastically, with more effort being spent on in situ analysis.

When the payload only decrypts and runs on a single target system, the malware analyst will need direct access to the compromised host. There are several forms this might take. The obvious one is providing a remote shell to the analyst to log in, attach a debugger, try to get a memory dump of the process, etc. This is dangerous because it involves giving an outsider access to a trusted system, and one that might be critical to other operations. Even if a whole-system memory dump is generated, say by physical access or a cold-boot attack, there is still going to be a lot of sensitive information there.

Another approach is emulation. The analyst uses a VM that turns all local syscalls into remote ones. This is connected to the compromised target host (or a clone of it), which runs a daemon to answer the API queries. The malware sample or relevant portions of it (such as the hash-and-decrypt routine) are run in the analyst’s VM, but the information the routine gathers comes directly from the compromised host. This allows the analyst to gather the relevant information while not having full access to the compromised machine.

In the next phase after this, malware authors add more anti-emulation checks to their payload decryption routine. They try to prevent this routine from being run in isolation, in an emulator. Eventually, you end up in a cat-and-mouse game of Core Wars on the live hardware. Malware keeps a closely-synchronized global heartbeat so that any attempt to dump and restart it on a single host corrupts its state irrecoverably. The payload, its triggers, and encryption keys evolve in coordination with the other hosts on the network and are tied closely to each machine’s identity.

Is this where we’re headed? I’m not sure, but I do know that software protection measures are becoming more, not less relevant.

RSA repeats earlier claims, but louder

Sam Curry of RSA was nice enough to respond to my post. Here’s a few points that jumped out at me from what he wrote:

  • RSA is in the process of fixing the downgrade attack that allows an attacker to choose PKCS #1 v1.5, even if the key was generated by a user who selected v2.0.
  • They think they also addressed the general attack via their RAC 3.5.4 middleware update. More info is needed on what that fix actually is. I haven’t seen the words “firmware update” or “product recall” in any of their responses, so no evidence they decided to fix the flaw in the token itself.
  • We shouldn’t call it “SecurID” even though the product name is “RSA SecurID 800”. Or to put it another way, “When we want brand recognition, call it ‘SecurID’. When it’s flawed, call it ‘PKCS #1 v1.5.'”

However, his main point is that, since this is a privilege escalation attack, any gain RSA has given the attacker is not worth mentioning. In his words:

“Any situation where the attacker has access to your smartcard device and has your PIN, essentially compromises your security. RSA maintains that if an attacker already has this level of access, the additional risk of the Bleichenbacher attack does not substantially change the already totally compromised environment.”

Note the careful use of “substantially change” and “totally compromised environment”. They go farther on this tack, recommending the following mitigation approaches.

  • (Tokens) should not be left parked in the USB port any longer than necessary
  • The owner needs to maintain control of their PIN
  • The system which the device is being used on should be running anti-malware.

Their security best practices involve recommending that users limit access to the token while it is in a state to perform crypto operations for the user or attacker. This is good general advice, but it is not directly relevant to this attack for two reasons:

  1. The attack allows recovery of keys protected by the token, and then no further access to it is required
  2. It takes only a short amount of time and can be performed in stages

First, the attack allows key recovery (but not of the private key, as RSA points out over and over). There are three levels of potential compromise of a token like this one:

  1. Temporary online access: attacker can decrypt messages by sending them to the token until it’s disconnected
  2. Exposure of wrapped keys: attacker can decrypt past or future messages offline, until the wrapped keys are changed
  3. Exposure of the master private key: attacker can recover future wrapped keys until the private key is changed

RSA is claiming there’s no important difference between #1 and #2. But the whole point of a physical token is to drive a wedge between these exact cases. Otherwise, you could store your keys on your hard drive and get the same effect — compromise of your computer leads to offline ability to decrypt messages. To RSA, that difference isn’t a “substantial change”.

By screwing up the implementation of their namesake algorithm, RSA turned temporary access to a token into full access to any wrapped keys protected by it. But sure, the private key itself (case #3) is still safe.

Second, they continue to insist that end-user behavior can be important to mitigating this attack. The research paper shows that it takes only a few thousand automated queries to recover a wrapped key (e.g., minutes). Even if you’re lightning fast in unplugging your token, the attack can be performed in stages. There’s no need for continuous access to the token.

After the wrapped keys are recovered, they can be used for offline decryption until changed. No further access is needed to the token until the wrapped keys are changed.

The conclusion is really simple: the RSA SecurID 800 token fails to protect its secrets. An attacker with software-only access (even remote) to the token can recover its wrapped keys in only a few minutes each. A token whose security depends on how fast you unplug it isn’t much of a token.

Why RSA is misleading about SecurID vulnerability

There’s an extensive rebuttal RSA wrote in response to a paper showing that their SecurID 800 token has a crypto vulnerability. It’s interesting how RSA’s response walks around the research without directly addressing it. A perfectly accurate (but inflammatory) headline could also have been “RSA’s RSA Implementation Contained Security Flaw Known Since 1998“.

The research is great and easy to summarize:

  • We optimized Bleichenbacher’s PKCS #1 v1.5 attack by about 5-10x
  • There are a number of different oracles that give varying attacker advantage
  • Here are a bunch of tokens vulnerable to this improvement of the 1998 attack

Additional interesting points from the paper:

  • Aladdin eTokenPro is vulnerable to a simple Vaudenay CBC padding attack as well. Even worse!
  • RSA implemented the worst oracle of the set the authors enumerate, giving the most attacker advantage.
  • If you use PKCS #1 v2.0, you should be safe against the Bleichenbacher attack. Unless you use RSA’s implementation, which always sets a flag in generated keys that allows selecting v1.5 and performing a slight variant of this attack.

The real conclusion is that none of the manufacturers seemed to take implementation robustness seriously. Even the two implementations that were safe from these attacks were only safe because implementation flaws caused them to not provide useful information back to the attacker.

The first counterclaim RSA makes is that this research does not compromise the private key stored on the token. This is true. However, it allows an attacker to decrypt and recover other “wrapped” keys encrypted by the token’s key pair. This is like saying an attacker is running a process with root access but doesn’t know the root password. She can effectively do all the same things as if she did have the password, at least until the process is killed.

RSA is ignoring the point that even a legitimate user should not be able to recover these encrypted “wrapped” keys. They can only cause the token to unwrap and use them on the operator’s behalf, not recover the keys themselves. So this attack definitely qualifies as privilege escalation, even if performed by the authorized user herself.

The second claim is that this attack requires local access and a PIN. This is also correct, although it depends on some assumptions. PKCS #11 is an API, so RSA really has no firm knowledge how all their customers are using it. Some applications may proxy access to the token via a web frontend or other network access. An application may cache the PIN. As with other arguments that privilege escalation attacks don’t matter, it assumes a lot about the customer and attacker profile that RSA has no way of knowing.

The final claim is that OAEP (PKCS #1 v2.0) is not subject to this vulnerability. This is true. But this doesn’t address the issue raised in the paper where RSA’s implementation sets flags in the key to allow the user to choose v2.0 or v1.5. Hopefully, they’ll be fixing this despite not mentioning it here.

RSA has taken a lot of heat due to the previous disclosure of all the SecurID seeds, so perhaps the press has focused on them unfairly. After all, the research paper shows that many other major vendors had the same problem. My conclusion is that we have a long way to go in getting robust crypto implementations in this token market.