I recently saw a blog post that discussed using RC4 as an ad-hoc hash in order to show why CBC mode is better than ECB. While the author’s example is merely an attempt to create a graphic, it reminded me to explain why a stream cipher shouldn’t be used as as a cryptographic hash.
A stream cipher like RC4 only has one input (the key) and one output, a variable-length keystream. During initialization, the key is expanded and stored in an internal buffer. When the user wants to encrypt or decrypt (both are the same operation), the buffer is updated in some way and keystream bits are output. It’s up to the caller to take that keystream data and XOR it with the plaintext to get the ciphertext (or vice versa). Very simple, right? You just initialize the stream cipher’s state with a key and then turn the crank whenever you want keystream bits.
A cryptographic hash algorithm like SHA-1 also has one input (the data) and one output, the digest. A variable-length stream of input data is crunched in blocks, giving a final output digest that should be difficult to invert, among other properties.
At first glance, it seems that a stream cipher can be used as a cryptographic hash by setting the data to hash as the key, turning the crank, and using some of the keystream as the digest. The reasoning goes, “since it should be difficult to recover the original stream cipher key merely by seeing some of the keystream, the output is usable as a hash”. While this may sound reasonable, it is often wrong, leading to various security problems.
There are numerous, vital design distinctions between stream ciphers and hashes. First, a stream cipher is designed to output an extremely long keystream sequence while a hash digest is a relatively small, fixed-length output. There are design differences that arise from expanding a key vs. compressing input. Also, resistance against a chosen input attack is a requirement for a cryptographic hash, while it may not have been considered for a stream cipher. What could an attacker gain if they can choose the input keys? By definition, they already know the secret key in this case.
The RC4 weakness that led to WEP being broken was a related-key attack. Even though an attacker could not choose WEP keys, the RC4 key was the concatenation of a counter and the secret key. Thus, subsequent outputs of the keystream are derived from closely related input keys.
But to use RC4 for hashing, it would have to be resistant not only to related key attacks, but to a chosen key attack. In this case, the attacker can target weaknesses in your key schedule algorithm by maliciously choosing many keys versus merely knowing that some relation exists between unknown keys that the attacker can’t choose. While chosen-IV attacks are part of the consideration for stream ciphers, I haven’t heard of full chosen-key resistance being an important design criteria. (Please correct me if I’m out of date on this, especially with eStream).
In contrast, resistance to a chosen-input attack is the very definition of a cryptographic hash algorithm. This resistance comes at a performance cost. Turning a hash algorithm into a stream cipher can be done (say, an HMAC using a key and counter), but it’s slower than stream ciphers that were designed as such. Stream cipher designs are optimized for performance and are usually not focused on preventing chosen-key attacks. An interesting corrolary is that analyzing a stream cipher’s key scheduling algorithm as a hash function (e.g., collision resistance) is often a good way to understand its possible weaknesses.
To summarize, don’t use cryptographic primitives for non-standard purposes. There are often built-in assumptions based on the original intended application that could compromise your modified design.