When reviewing source code, I often see a developer making claims like “this can’t happen here” or “to reach this point, X condition must have occurred”. These kinds of statements make a lot of assumptions about the execution environment. In some cases, the assumptions may be justified but often they are not, especially in the face of a motivated attacker.
One recent gain for disk reliability has been ZFS. This filesystem has a mode that generates SHA-1 hashes for all the data it stores on disk. This is an excellent way to make sure a read error hasn’t been silently corrected by the disk to the wrong value or to catch firmware that lies about the status of the physical media. SHA-1 support in filesystems is still a new feature due to the performance loss and the lingering doubt whether it’s needed since it ostensibly duplicates the builtin ECC in the drive.
Mirroring (RAID1) is great because if a disk error occurred, you can often fetch a copy from the other disk. However, it does not address the problem of being certain that an error had occurred and that the data from the other drive was still valid. ZFS gives this assurance. Mirroring could actually be worse than no RAID if you were silently switched to a corrupted disk due to a transient timeout of the other drive. This could happen if its driver had bugs that caused it to timeout on read requests under heavy load, even if the data was perfectly fine. This is not a hypothetical example. It actually happened to me with an old ATA drive.
Still, engineers resist adding any kind of redundant checks, especially of computation results. For example, you rarerly see code adding the result of a subtraction to compare to the original input value. Yet in a critical application, like calculating Bill Gates’ bank account statement, perhaps such paranoia would be warranted. In security-critical code, the developer needs even more paranoia since the threat is active manipulation of the computing environment, not accidental memory flips due to cosmic radiation.
Software protection is one area where redundancy is crucial. A common way to bypass various checks in a protection scheme is to patch them out. For example, the conditional branch that checks for a copied disc to determine if the code should continue to load a game could be inverted or bypassed. By repeating this check in various places, implemented in different ways, it can become more difficult for an attacker to determine if they have removed them all.
As another example, many protection schemes aren’t very introspective. If there is a function that decrypts some data, often you can just grab a common set of arguments to that function and repeatedly call it from your own code, mutating the arguments each time. But if that function had a redundant check that walked the stack to make sure its caller was the one it expected, this could be harder to do. Now, if the whole program was sprinkled with various checks that validated the stack in many different ways, it becomes even harder to hack out any one piece to embed in the attacker’s code. To an ordinary developer, such redundant checks seem wasteful and useless.
Besides software protection, crypto is another area that needs this kind of redundant checking. I previously wrote about how you can recover an RSA private key merely by disrupting any part of the signing operation (e.g., by injecting a fault). The solution to this is for the signer to verify the signature they just made before revealing the signature to an attacker. That is, a sign operation now performs both a sign and verify. If you weren’t concerned about this attack, this redundant operation would seem useless. But without it, an attacker can recover a private key after observing only a single faulty signature.
We need developers to be more paranoid about the validity of stored or computed values, especially in an environment where there could be an adversary. Both software protection and crypto are areas where a very powerful attacker is always assumed to be present. However, other areas of software development could also benefit from this attitude, increasing security and reliability by “wasting” a couple cycles.