Consensus is a mechanism in distributed systems for automating choosing leader nodes in replicated schemes (primary-backup, state machine replication). Consensus algorithms are helpful because they ensure fault tolerance without hardcoding a manual takeover scheme or using an external view server (which is itself prone to failures).

In consensus algorithms, a set of nodes have to agree on a single data value (in practice, usually a leader node) in the presence of failures. This requires a few key properties:

  • For correctness (safety of the system):
    • Agreement: no two correct (i.e., live) nodes decide differently.
    • Integrity: no node decides twice.
    • Validity: any value decided was proposed by some node.
  • For liveness (i.e., so that the system progresses), we require termination, that each correct node eventually decides on a value.

Consensus is equivalent to total order broadcast. When nodes need to broadcast messages in total order, consensus is used to decide on the first message to deliver. All nodes will deliver this message first. All messages are delivered to nodes in the same order.

In practice

Important widespread consensus algorithms include:

  • Paxos (best-effort, partially synchronous, crash-recovery)
  • Raft (best-effort, partially synchronous, crash-recovery)
  • PBFT (closed membership, Byzantine fault tolerant)
  • Bitcoin (open membership, Byzantine fault tolerant)

All consensus algorithms assume a particular system model. In particular, all algorithms must not assume an asynchronous system model, because there’s no inherently reliable way to detect failures (timeouts are unreliable).

In fact, asynchronous models are nearly impossible. The FLP result tells us that no deterministic consensus algorithm is guaranteed to terminate in an asynchronous crash-stop system model. We can kind of make it work with a non-deterministic (probabilistic) algorithm, but it’s still difficult to prove correctness/termination because of this dependence on randomness.

So consensus algorithms need to ensure correctness without depending on timeouts. They also need to ensure progress even when failure detection is potentially incorrect.

  • total order broadcast and fault tolerance
    • manual option — problem is we need to wait for the operator to figure out the leader has crashed
  • option 3
    • connection dropped = network partition
    • basically same problem as failure detection