One system model for distributed systems relates to Byzantine failures. We say that a node is faulty if it deviates from the algorithm. Under this scheme, nodes may execute incorrectly (including with malicious intent).
Byzantine failures can occur due to:
- Hardware faults — like firmware bugs, bit flips in memory, corrupted network packets, etc.
- Software bugs — like logic errors, memory corruption, concurrency bugs.
- Or malicious behaviour altogether — like if a node is running modified/arbitrary code, message changes (alterations, inserting, delays, drops), sending different messages to different nodes (send action A to node 1, action B to node 2), spoofing messages (sending with another identity), or by colluding with other malicious nodes.
- Error propagation through the network, such as from a single logically faulty node (like in the Bellman-Ford algorithm).
A system is Byzantine fault tolerant if it is able to recover or survive from a Byzantine failure. In general, Byzantine algorithms tend to be very different from other distributed algorithms (like those that assume crash-stop or crash-recovery), because it’s not possible to convert from one node behaviour into another. We cannot automatically detect/shutdown faulty replicas, because some replicas might present different outputs to other replicas, so it’s not always possible to know if a replica is faulty just from its output.