Fault tolerance

Fault tolerance is an important principle of software systems, especially distributed systems. A fault tolerant system is a reliable one even despite failures, i.e., users aren’t aware when individual nodes in a system fail.

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” - Leslie Lamport

A failure is defined as when the system stops working, which will cause unavailability. A fault is when part of the system fails (like a node crash, malicious behaviour; or network fault: packet loss or delay, network partitions), which may cause unavailability.

We define reliability in terms of two definitions:

Durability — so the service doesn’t lose data (often with persistent storage) if it fails.
Availability — so the service continues to make progress even if nodes have failed.

The main metrics we use:

MBTF (mean time between failures) — higher is more reliable.
Availability — fraction of time that a service functions correctly.
- A service that is up for 364 days of the year has an availability of 364/365.
- We want both metrics to be high. A system that’s down for 1 second every hour may have a high availability but low MBTF, and we wouldn’t consider this a very reliable system.

Main challenges:

Nodes may fail at any time: temporarily or forever.
Network links may delay messages or drop them altogether (like UDP).
It’s hard to know if a node has failed, or the network is down (a network partition).
Nodes may lose data or return corrupt data after a crash.
Networks may corrupt messages.

How we enable fault tolerance:

Retry operations on failure.
Store data on persistent storage for durability.
Replicate data for high availability, multiple times and at multiple locations.
Run redundant computations for high availability.

Fault detection

SLO described in terms of tail latency
SLA between customer/producer

large systems: hardware failures become more common can’t assume accurate failure detection in real systems b/c timeouts aren’t correct for asynchronous system model

good detectors may still temporarily mislabel a correct node as crashed or a crashed node as correct will eventually label a crashed node as crashed

jszhn

Recent Notes

ALOHA

ARP

American literature

Assert

Atomics

Fault tolerance

Fault detection

Graph View

Backlinks