Thought experiment on protocols and noise

I hesitate to call this an interview question because I don’t think on-the-spot puzzle solving equates to a good engineering hire. On the other hand, I try to explore some simple thought experiments with candidates that have a security background.

One of these involves a protocol that has messages authenticated by an HMAC. There’s a message (with appropriate internal format) and a SHA-256 HMAC that covers it. As the implementer, you receive a message that doesn’t verify. In other words, your calculated MAC isn’t the same as the one in the message. What do you do?

“Throw an error” is a common response. But is there something more clever you could do? What if you could tell whether the message had been tampered with or if this was an innocent network error due to noise? How could you do that?

Some come up with the idea of calculating the Hamming distance or other comparison between the computed and message HMACs. If they are close, it’s unlikely that the message had been corrupted, due to the avalanche property of secure hash functions. If not, it may be a bit flip in the message, possibly due to an attack.

Ok, you can distinguish whether the MAC had a small number of errors or the message itself. Is this helpful, and is it secure? Consider:

  • If you return an error, which one do you return? At what point in the processing?
  • Does knowing which type of error occurred help an attacker? Which kind of attacker?
  • If you chose to allow up to 8 flipped bits in the MAC while still accepting the message, is that secure? If so, at what number of bits would it be insecure? Does the position of the bits matter?

There comes a moment when every engineer comes up with some “clever” idea like the above. If she’s had experience attacking crypto, the next thought is one of intense unease. The unschooled engineer has no such qualms, and thus provides full employment for Root Labs.

Bypassing Sonos Updates Check

Sonos has screwed up their update process a few times. What happens is that they publish an XML file showing that a new update is available but forget to publish the update itself on their servers. Since the update check is a mandatory part of the installation process, they effectively lock out new customers from using their software. This most recently happened with the 4.2.1 update for Mac.

There’s a way to bypass this: redirect the site “update.sonos.com” to 127.0.0.1 (localhost) using your system hosts file. When you launch the Sonos Controller software for the first time, it will try to connect to your own computer to get the XML file. Instead, it will fail to connect and then move forward in the setup process. After that, you can re-enable access to the updates server.

Specific guides:

The exact entry you want to add is:

127.0.0.1    update.sonos.com

Be sure to remove this line after you’ve gotten your system setup so you can get updates in the future.

20 Years of Internet

This month marks my 20th anniversary of first connecting to the Internet. It seems like a good time to look back on the changes and where we can go from here.

I grew up in a rural area, suspecting but never fully realizing the isolation from the rest of the world, technology or otherwise. Computers and robots of the future lived in the ephemeral world of Sears catalogs and Byte magazines rescued from a dumpster. However, the amateur radio and remote-controlled plane hobbies of my father’s friends brought the world of computing and electronics to our house.

Still, communications were highly local. The VIC-20 could connect to a few BBS systems and my father’s industrial control of warehouse refrigeration systems (way before SCADA). However, anything beyond that incurred long distance charges and thus was irrelevant. Only the strange messages and terminology in cracked games, distributed from faraway places like Sweden, hinted at a much broader world out there.

Towards the end of high school, our local BBS finally got a FidoNet connection. Text files started trickling in about hacking COSMOS to change your “friend’s” phone service and building colored boxes to get free calls. One of those articles described how to use the Internet. I’d spend hours trying to remember all the protocol acronyms, TCP port numbers, etc. The Internet of my imagination was a strange amalgamation of X.25, ARPA protocols, TCP/IP, and the futuristic OSI protocols that were going to replace TCP/IP.

Once I arrived at college, I was one of the first in line to register for an Internet account. Our dorm room had an always-on serial connection to the campus terminal server and Ethernet was coming in a few weeks. It took some encouraging from my friends to make the jump to Ethernet (expensive, and 10BASE-T was barely standardized so it was hard to figure out if a given NIC would even work). Along with free cable TV, you’ve got to wonder, “what were they thinking?”

The dorm Ethernet experiment soon became a glorious free-for-all. There was a lot of Windows 3.1 and Linux, but also a few NeXTSTEP and Sun systems. Campus network admin had its hands full, bungling rushed policy changes intended to stop the flood of warez servers, IPX broadcast storms from Doom games, IRC battles, sniffing, hacking, and even a student running a commercial ISP on the side. Life on the dorm network was like a 24/7 Defcon CTF, but if you failed, you were reinstalling your OS from 25 floppies before you could do your homework.

There were three eras I got to see: Usenet (ending in 1994), early Web (1994-1997), and commercial Web (1998 to present). The Usenet era involved major changes in distributed protocols and operating systems, including the advent of Linux and other free Unixes. The early Web era transitioned to centralized servers with HTTP, with much experimentation in how to standardize access to information (remember image maps? Altavista vs. Lycos?) The commercial Web finally gave the non-technical world a reason to get online, to buy and sell stuff. It continues to be characterized by experimentation in business models, starting with companies like eBay.

One of my constant annoyances with technological progress is when we don’t benefit from history. Oftentimes, what comes along later is not better than what came before. This leads to gaps in progress, where you spend time recapitulating the past before you can truly move on to the predicted future.

Today, I morn the abandonment of the end-to-end principle. I don’t mean networking equipment has gotten too smart for its own good (though it has). I mean that we’re neglecting a wealth of intelligence at the endpoints and restricting them to a star topology, client/server communication model.

Multicast is one example of how things could be different. Much of the Internet data today is video streams or OS updates. Multicast allows a single transmission to be received by multiple listeners, building a dynamic tree of routes so that it traverses a minimal set of networks. Now, add in forward error-correction (allows you to tune in to a rotating transmission at any point in time and reconstruct the data) and distributed hash tables (allows you to look up information without a central directory) and you have something very powerful.

Bittorrent is a hack to leverage an oversight in the ISP pricing model. Since upload bandwidth from home broadband was underutilized but paid for, Bittorrent could reduce the load on centralized servers by augmenting them with users’ connections. This was a clever way to improve the existing star topology of HTTP downloads but would have been unnecessary if proper distributed systems using multicast were available.

We have had the technology for 20 years but a number of players have kept it from being widely deployed. Rapid growth in backbone bandwidth meant there wasn’t enough pricing pressure to reduce wastefulness. The domination of Windows and its closed TCP/IP stack meant it was difficult to innovate in a meaningful way. (I had invented a TCP NAT traversal protocol in 1999 that employed TCP simultaneous connect, but Windows had a bug that caused such connections to fail so I had to scrap it.) There have been bugs in core router stacks, and so multicast is mostly disabled there.

Firewalls are another symptom of the problem. If you had a standardized way to control endpoint communications, there would be no need for firewalls. You’d simply set policies for the group of computers you controlled and the OS on each would figure out how to apply them. However, closed platforms and a lack of standardization mean that not only do we still have network firewalls, but numerous variants of host-based firewalls as well.

Since the late 90’s, money has driven an intense focus on web-based businesses. In this latest round of tech froth, San Francisco is the epicenter instead of San Jose. Nobody cares what router they’re using, and there’s a race to be the most “meta”. Not only did EC2 mean you don’t touch the servers, but now Heroku means you don’t touch the software. But as you build higher, the architectures get narrower. There is no HTTP multicast and the same-origin policy means you can’t even implement Bittorrent in browser JavaScript.

It seems like decentralized protocols only appear in the presence of external pressure. Financial pressure doesn’t seem to be enough so far, but legal pressure led to Tor, magnet links, etc. Apple has done the most of anyone commercially in building distributed systems into their products (Bonjour service discovery, Airdrop direct file sharing), but these capabilities are not employed by many applications. Instead, we get simulated distributed systems like Dropbox, which are still based on the star topology.

I hope that the prevailing trend changes, and that we see more innovations in smart endpoints, chatting with each other in a diversity of decentralized, standardized, and secure protocols. Make this kind of software stack available on every popular platform, and we could see much more innovation in the next 20 years.

RSA repeats earlier claims, but louder

Sam Curry of RSA was nice enough to respond to my post. Here’s a few points that jumped out at me from what he wrote:

  • RSA is in the process of fixing the downgrade attack that allows an attacker to choose PKCS #1 v1.5, even if the key was generated by a user who selected v2.0.
  • They think they also addressed the general attack via their RAC 3.5.4 middleware update. More info is needed on what that fix actually is. I haven’t seen the words “firmware update” or “product recall” in any of their responses, so no evidence they decided to fix the flaw in the token itself.
  • We shouldn’t call it “SecurID” even though the product name is “RSA SecurID 800”. Or to put it another way, “When we want brand recognition, call it ‘SecurID’. When it’s flawed, call it ‘PKCS #1 v1.5.'”

However, his main point is that, since this is a privilege escalation attack, any gain RSA has given the attacker is not worth mentioning. In his words:

“Any situation where the attacker has access to your smartcard device and has your PIN, essentially compromises your security. RSA maintains that if an attacker already has this level of access, the additional risk of the Bleichenbacher attack does not substantially change the already totally compromised environment.”

Note the careful use of “substantially change” and “totally compromised environment”. They go farther on this tack, recommending the following mitigation approaches.

  • (Tokens) should not be left parked in the USB port any longer than necessary
  • The owner needs to maintain control of their PIN
  • The system which the device is being used on should be running anti-malware.

Their security best practices involve recommending that users limit access to the token while it is in a state to perform crypto operations for the user or attacker. This is good general advice, but it is not directly relevant to this attack for two reasons:

  1. The attack allows recovery of keys protected by the token, and then no further access to it is required
  2. It takes only a short amount of time and can be performed in stages

First, the attack allows key recovery (but not of the private key, as RSA points out over and over). There are three levels of potential compromise of a token like this one:

  1. Temporary online access: attacker can decrypt messages by sending them to the token until it’s disconnected
  2. Exposure of wrapped keys: attacker can decrypt past or future messages offline, until the wrapped keys are changed
  3. Exposure of the master private key: attacker can recover future wrapped keys until the private key is changed

RSA is claiming there’s no important difference between #1 and #2. But the whole point of a physical token is to drive a wedge between these exact cases. Otherwise, you could store your keys on your hard drive and get the same effect — compromise of your computer leads to offline ability to decrypt messages. To RSA, that difference isn’t a “substantial change”.

By screwing up the implementation of their namesake algorithm, RSA turned temporary access to a token into full access to any wrapped keys protected by it. But sure, the private key itself (case #3) is still safe.

Second, they continue to insist that end-user behavior can be important to mitigating this attack. The research paper shows that it takes only a few thousand automated queries to recover a wrapped key (e.g., minutes). Even if you’re lightning fast in unplugging your token, the attack can be performed in stages. There’s no need for continuous access to the token.

After the wrapped keys are recovered, they can be used for offline decryption until changed. No further access is needed to the token until the wrapped keys are changed.

The conclusion is really simple: the RSA SecurID 800 token fails to protect its secrets. An attacker with software-only access (even remote) to the token can recover its wrapped keys in only a few minutes each. A token whose security depends on how fast you unplug it isn’t much of a token.

Why RSA is misleading about SecurID vulnerability

There’s an extensive rebuttal RSA wrote in response to a paper showing that their SecurID 800 token has a crypto vulnerability. It’s interesting how RSA’s response walks around the research without directly addressing it. A perfectly accurate (but inflammatory) headline could also have been “RSA’s RSA Implementation Contained Security Flaw Known Since 1998“.

The research is great and easy to summarize:

  • We optimized Bleichenbacher’s PKCS #1 v1.5 attack by about 5-10x
  • There are a number of different oracles that give varying attacker advantage
  • Here are a bunch of tokens vulnerable to this improvement of the 1998 attack

Additional interesting points from the paper:

  • Aladdin eTokenPro is vulnerable to a simple Vaudenay CBC padding attack as well. Even worse!
  • RSA implemented the worst oracle of the set the authors enumerate, giving the most attacker advantage.
  • If you use PKCS #1 v2.0, you should be safe against the Bleichenbacher attack. Unless you use RSA’s implementation, which always sets a flag in generated keys that allows selecting v1.5 and performing a slight variant of this attack.

The real conclusion is that none of the manufacturers seemed to take implementation robustness seriously. Even the two implementations that were safe from these attacks were only safe because implementation flaws caused them to not provide useful information back to the attacker.

The first counterclaim RSA makes is that this research does not compromise the private key stored on the token. This is true. However, it allows an attacker to decrypt and recover other “wrapped” keys encrypted by the token’s key pair. This is like saying an attacker is running a process with root access but doesn’t know the root password. She can effectively do all the same things as if she did have the password, at least until the process is killed.

RSA is ignoring the point that even a legitimate user should not be able to recover these encrypted “wrapped” keys. They can only cause the token to unwrap and use them on the operator’s behalf, not recover the keys themselves. So this attack definitely qualifies as privilege escalation, even if performed by the authorized user herself.

The second claim is that this attack requires local access and a PIN. This is also correct, although it depends on some assumptions. PKCS #11 is an API, so RSA really has no firm knowledge how all their customers are using it. Some applications may proxy access to the token via a web frontend or other network access. An application may cache the PIN. As with other arguments that privilege escalation attacks don’t matter, it assumes a lot about the customer and attacker profile that RSA has no way of knowing.

The final claim is that OAEP (PKCS #1 v2.0) is not subject to this vulnerability. This is true. But this doesn’t address the issue raised in the paper where RSA’s implementation sets flags in the key to allow the user to choose v2.0 or v1.5. Hopefully, they’ll be fixing this despite not mentioning it here.

RSA has taken a lot of heat due to the previous disclosure of all the SecurID seeds, so perhaps the press has focused on them unfairly. After all, the research paper shows that many other major vendors had the same problem. My conclusion is that we have a long way to go in getting robust crypto implementations in this token market.

SSL optimization and security talk

I gave a talk at Cal Poly on recently proposed changes to SSL. I covered False Start and Snap Start, both designed by Google engineer Adam Langley. Snap Start has been withdrawn, but there are some interesting design tradeoffs in these proposals that merit attention.

False Start provides a minor improvement over stock SSL, which takes two round trips in the initial handshake before application data can be sent. It saves one round trip on the initial handshake at the cost of sending data before checking for someone modifying the server’s handshake messages. It doesn’t provide any benefit on subsequent connections since the stock SSL resume protocol only takes one round trip also.

The False Start designers were aware of this risk, so they suggested the client whitelist ciphersuites for use with False Start. The assumption is that an attacker could get the client to provide ciphertext but wouldn’t be able to decrypt it if the encryption was secure. This is true most of the time, but is not sufficient.

The BEAST attack is a good example where ciphersuite whitelists are not enough. If a client used False Start as described in the standard, it couldn’t detect an attacker spoofing the server version in a downgrade attack. Thus, even if both the client and server supported TLS 1.1, which is secure against BEAST, False Start would have made the client insecure. Stock SSL would detect the version downgrade attack before sending any data and thus be safe.

The False Start standard (or at least implementations) could be modified to only allow False Start if the TLS version is 1.1 or higher. But this wouldn’t prevent downgrade attacks against TLS 1.1 or newer versions. You can’t both be proactively secure against the next protocol attack and use False Start. This may be a reasonable tradeoff, but it does make me a bit uncomfortable.

Snap Start removes both round trips for subsequent connections to the same server. This is one better than stock SSL session resumption. Additionally, it allows rekeying whereas session resumption uses the same shared key. The security cost is that Snap Start removes the server’s random contribution.

SSL is designed to fail safe. For example, neither party solely determines the nonce. Instead, the nonce is derived from both client and server randomness. This way, poor PRNG seeding by one of the participants doesn’t affect the final output.

Snap Start lets the client determine the entire nonce, and the server is expected to check it against a cache to prevent replay. There are measures to limit the size of the cache, but a cache can’t tell you how good the entropy is. Therefore, the nonce may be unique but still predictable. Is this a problem? Probably not, but I haven’t analyzed how a predictable nonce affects all the various operating modes of SSL (e.g., ECDH, client cert auth, SRP auth, etc.)

The key insight between both of these proposed changes to SSL is that latency is an important issue to SSL adoption, even with session resumption being built in from the beginning. Also, Google is willing to shift the responsibility for SSL security towards the client in order to save on latency. This makes sense when you own a client and your security deployment model is to ship frequent client updates. It’s less clear that this tradeoff is worth it for SSL applications besides HTTP or other security models.

I appreciate the work people like Adam have been doing to improve SSL performance and security. Obviously, unprotected HTTP is worse than some reductions in SSL security. However, careful study is needed for the many users of these kinds of protocol changes before their full impact is known. I remain cautious about adopting them.