Fault tolerance is an important principle of software systems, especially distributed systems. A fault tolerant system is a reliable one even despite failures, i.e., users aren’t aware when individual nodes in a system fail.
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” - Leslie Lamport
A failure is defined as when the system stops working, which will cause unavailability. A fault is when part of the system fails (like a node crash, malicious behaviour; or network fault: packet loss or delay, network partitions), which may cause unavailability.
We define reliability in terms of two definitions:
- Durability — so the service doesn’t lose data (often with persistent storage) if it fails.
- Availability — so the service continues to make progress even if nodes have failed.
The main metrics we use:
- MBTF (mean time between failures) — higher is more reliable.
- Availability — fraction of time that a service functions correctly.
- A service that is up for 364 days of the year has an availability of 364/365.
- We want both metrics to be high. A system that’s down for 1 second every hour may have a high availability but low MBTF, and we wouldn’t consider this a very reliable system.
Main challenges:
- Nodes may fail at any time: temporarily or forever.
- Network links may delay messages or drop them altogether (like UDP).
- It’s hard to know if a node has failed, or the network is down (a network partition).
- Nodes may lose data or return corrupt data after a crash.
- Networks may corrupt messages.
How we enable fault tolerance:
- Retry operations on failure.
- Store data on persistent storage for durability.
- Replicate data for high availability, multiple times and at multiple locations.
- Run redundant computations for high availability.
Fault detection
- SLO described in terms of tail latency
- SLA between customer/producer
large systems: hardware failures become more common can’t assume accurate failure detection in real systems b/c timeouts aren’t correct for asynchronous system model
good detectors may still temporarily mislabel a correct node as crashed or a crashed node as correct will eventually label a crashed node as crashed