Fault tolerance is an important principle of software systems, especially distributed systems. A fault tolerant system is a reliable one even despite failures, i.e., users aren’t aware when individual nodes in a system fail.

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” - Leslie Lamport

A failure is defined as when the system stops working, which will cause unavailability. A fault is when part of the system fails (like a node crash, malicious behaviour; or network fault: packet loss or delay, network partitions), which may cause unavailability.

We define reliability in terms of two definitions:

  • Durability — so the service doesn’t lose data (often with persistent storage) if it fails.
  • Availability — so the service continues to make progress even if nodes have failed.

The main metrics we use:

  • MBTF (mean time between failures) — higher is more reliable.
  • Availability — fraction of time that a service functions correctly.
    • A service that is up for 364 days of the year has an availability of 364/365.
    • We want both metrics to be high. A system that’s down for 1 second every hour may have a high availability but low MBTF, and we wouldn’t consider this a very reliable system.

Main challenges:

  • Nodes may fail at any time: temporarily or forever.
  • Network links may delay messages or drop them altogether (like UDP).
  • It’s hard to know if a node has failed, or the network is down (a network partition).
  • Nodes may lose data or return corrupt data after a crash.
  • Networks may corrupt messages.

How we enable fault tolerance:

  • Retry operations on failure.
  • Store data on persistent storage for durability.
  • Replicate data for high availability, multiple times and at multiple locations.
  • Run redundant computations for high availability.

Fault detection

  • SLO described in terms of tail latency
  • SLA between customer/producer

large systems: hardware failures become more common can’t assume accurate failure detection in real systems b/c timeouts aren’t correct for asynchronous system model

good detectors may still temporarily mislabel a correct node as crashed or a crashed node as correct will eventually label a crashed node as crashed