One system model for distributed systems relates to Byzantine failures. We say that a node is faulty if it deviates from the algorithm. Under this scheme, nodes may execute incorrectly (including with malicious intent).

Byzantine failures can occur due to:

  • Hardware faults — like firmware bugs, bit flips in memory, corrupted network packets, etc.
  • Software bugs — like logic errors, memory corruption, concurrency bugs.
  • Or malicious behaviour altogether — like if a node is running modified/arbitrary code, message changes (alterations, inserting, delays, drops), sending different messages to different nodes (send action A to node 1, action B to node 2), spoofing messages (sending with another identity), or by colluding with other malicious nodes.
  • Error propagation through the network, such as from a single logically faulty node (like in the Bellman-Ford algorithm).

A system is Byzantine fault tolerant if it is able to recover or survive from a Byzantine failure. In general, Byzantine algorithms tend to be very different from other distributed algorithms (like those that assume crash-stop or crash-recovery), because it’s not possible to convert from one node behaviour into another. We cannot automatically detect/shutdown faulty replicas, because some replicas might present different outputs to other replicas, so it’s not always possible to know if a replica is faulty just from its output.