Fault-tolerant system design

When designing a system, I have a hierarchy of properties. It is somewhat fluid depending on the application but roughly breaks down as follows.

  1. Correct: must perform the basic required function
  2. Robust: … under less-than-perfect operating conditions
  3. Reliable: … handling and recovering from internal errors
  4. Secure: … in the face of a malicious adversary
  5. Performance: … as fast as possible

I’d like to talk about how to achieve reliability. Reliability is the property where the system recovers from out-of-spec conditions and implementation errors with minimal impact on the other properties (i.e., correctness, security). While most system designers have an idea of the desired outcome, surprisingly few have a strategy for getting there. Designing for reliability also produces a system that is easier to secure (ask Dan Bernstein.)

Break down your design into logical components

This should go without saying, but if you have a monolithic design, fault recovery becomes very difficult. If there is an implicit linkage between components, over time it will grow explicit. Corollary: const usage only decreases over time in a fielded system, never increases.

Each component of the system should be able to reset independently or in groups

As you break down the system into components, consider what dependencies a reset of each component triggers. A simple module with less dependencies is best. I use the metric of “cross-section” to describe the complexity of inter-module dependencies.

Implement reset in terms of destructors/constructors

Right from the beginning, implement reset. Since you’re already coding the constructors and should have been coding destructors (right?), reset should be easy. If possible, use reset as part of the normal initialization process to be sure it doesn’t become dead code.

Increase the speed of each component’s recovery and make them
restart independently if possible

If components take a long time to recover, it may result in pressure from customers or management to ditch this feature. A component should never take longer to recover than a full system reset, otherwise rethink its design. Independence means that reset can proceed in parallel, which also increases performance.

Add a rollback feature to components

In cases where a full reset results in loss of data or takes too long, rollback may be another option. Are there intermediate states that can be safely reverted while keeping the component running?

Add error-injection features to test fault recovery

Every component should have a maintenance interface to inject faults. Or, design a standard test framework that does this. At the very least, it should be possible to externally trigger a reset of every component. It’s even better to allow the tester to inject errors in components to see if they detect and recover properly.

Instrument components to get a good report of where fault appeared

A system is only as debuggable as its visibility allows. A lightweight trace generation feature (e.g., FreeBSD KTR) can give engineers the information needed to diagnose faults. It should be fast enough to always be available, not just in debug builds.

One thought on “Fault-tolerant system design

Comments are closed.