A fault-tolerant system is one that can sustain a reasonable number of process or communication failures, both intermittent and permanent. The ability to solve the consensus problem is at the heart of any fault-tolerant system, because it is through consensus that non-faulty processes agree on which faults have occured, and how to circumvent them.
The consensus problem is defined as follows. All processes must agree on a binary value, based on the votes of each process. They must all agree on the same value, and that value must be the vote of at least one process (e.g., they cannot all decide on 1 when they all voted for 0). Note that it is not part of the definition that the decided value must be the majority vote; for example, in a distributed database application, each process may vote on whether to commit on a particular transaction, and if a single process votes no, then the decided value must be no and all processes must abort the transaction. There are variants of this definition, such as majority requirements or approximate agreement (agreement on numerical values where each process must decide on a value and all decided values must be within of each other).
Consensus in the presence of faults is difficult, especially when few assumptions can be made about the underlying system or the kinds of faults that can occur. Systems with different levels of synchrony or different kinds of failures require different algorithms. At one extreme, a system can be totally asynchronous, in that no assumptions can be made about the relative speeds of the processes or the communication medium. At the other, a system can be totally synchronous where we can assume upper bounds on processing and communication delays.
Usually, two kinds of failures are considered. Fail-stop failures cause a process to die at any time and stop participating in the algorithm. Byzantine failures are those where a process sends incorrect information, possibly according to a malevolent plan. Section 2 of this paper will consider fail-stop failures in asynchronous systems, and Section 3 will consider Byzantine failures in synchronous systems.
The other two combinations are not usually considered. Fail-stop failures in synchronous systems are fairly uninteresting: the failure of a process is immediately detected by the lack of messages from it, and all other processes can change their behaviors accordingly (see early stopping algorithms below). Byzantine failures in asynchronous systems are either equivalent to those in synchronous systems if the malevolent process sends messages, or equivalent to fail-stop failures if it does not.
This survey does not deal with tolerance of systemic failures (i.e., self-stabilization), but only with tolerance of process failures. Only message-passing systems are considered; shared memory algorithms fall in the category of wait-free constructions, which is outside the scope of this paper.