Impossibility of Distributed Consensus with One Faulty Process

Next: Consensus with Probability Up: Fail-stop Failures in Previous: Fail-stop Failures in

Impossibility of Distributed Consensus with One Faulty Process

The title of this sub-section is that of the paper that proved this fundamental result in 1985 [9] by Fisher et al. Let us consider a system of completely asynchronous processes. That is, we can make no assumptions about the relative speeds of the processes or the speed of communication. We cannot check to see if a process has failed, and we have no reference to a clock to implement some kind of time-out mechanism. We do assume reliable communication (although messages may arrive in another order than they were sent) and that at most one process may fail (stop permanently) at any time. Despite these good conditions and benign failures, there is no algorithm which can guarantee consensus on a binary value in finite time.

This disturbing result is based on the fact that we cannot tell if a process has died or if it is just very slow in sending its message; we cannot distinguish between a process which is arbitrarily delayed and one which is indefinitely delayed. If this delayed process's input is necessary, say, to break an even vote, then the algorithm may be delayed indefinitely.

The most significant result of this paper is that totally asynchronous systems can never have any kind of fault-tolerance, since they cannot even handle the most benign of faults under the best of conditions. Fault-tolerance in asynchronous systems requires making some assumptions about the system or about the kinds of faults which can be handled. In real systems, this is usually done by assuming an upper bound in communication and processor speed, and considering a process faultly if it doesn't respond within a certain time. This allows the development of an algorithm in which, at every time-step, a process comes closer to making a decision (based on a received message or lack thereof).

The outline of the impossibility proof is as follows. If we can prove that there exists an initial state for which the final decision is undecided, and that starting at any undecided state can lead to another undecided state, then by induction we can show that the system can stay forever undecided. Even if we only consider fair runs (i.e., runs in which all processes have a chance to execute an action), we can show that there exists the possibility of a fair run staying undecided. This confirms the popularly held belief that any algorithm on asynchronous systems has a ``window of vulnerability'' during which a single process death could cause the algorithm to never terminate.

Since we do want fault-tolerance in asynchronous systems, then we must change our model in some way. The next three sub-sections discuss different ways that the model can be changed and the results of such changes.

Next: Consensus with Probability Up: Fail-stop Failures in Previous: Fail-stop Failures in

Lawrence Kesteloot
Fri Jan 20 16:38:32 EST 1995