Distributed system

In software engineering, a distributed system is a set of computers (that don’t share a common memory or clock) connected by a network that work together as a single system, often to provide a common set of services to users. They’re highly concurrent and are an important software paradigm, because they’re designed to be scalable and reliable.

Designing effective distributed systems is sometimes difficult, because there are many moving parts. Software is executed concurrently, communication networks may be unreliable, and there may not be a single node in charge of the system.

Sub-pages

Key principles
Technologies and tools
- Remote procedure call (RPC)
- Deterministic simulation testing (DST)
Relevant fields

Basics

We define a single machine (server, client) as a node. A service is defined as a set of servers of a particular type. A binding occurs when a client of a service (needs to access) becomes associated with a server of a service (can provide).

Consider the reliability of a single machine:

There’s a finite limit to physical resources we can use (RAM limit).
Single point of failure.
Downtime in case we need to upgrade the machine.
One physical location, but users likely aren’t.

Broadly, distributed systems try to achieve a few key principles:

Abstraction — to provide high-level interfaces for services.
Scalability — to achieve higher performance and capacity with more nodes, i.e., to the user, the system behaves the same even if there’s higher load.
Fault tolerance — to provide reliable service despite failures, i.e., so users aren’t aware when nodes in the system fail.
- This is one of the most important principles. We must design distributed systems with the expectation of failure, i.e., we must say “failure happens all the time”.¹²
Data consistency — to behaves like a single node, centralised system, i.e., so that users aren’t aware of this difference.
Security — to behave like a single, trusted system.

Distributed systems will make trade-offs on what principles they prioritise in their implementation. For example, typical trade-offs include:

Performance > fault tolerance
- So replication is avoided, because it increases overhead.
Performance > consistency
- Like choosing sequential consistency over linearisability.
Fault tolerance > consistency
- Update one replica synchronously, others asynchronously.
- Read from any replica.
Security > performance
- Encrypt and authenticate data.
- Communicate with many replicas to build trust.

In fact, there are 8 fallacies of thinking when designing distributed systems. They often arise from making assumptions that are valid for local programming:³

The network is reliable.
Latency is zero.
Bandwidth is infinite — can result in bottlenecks.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

System models

The model we use to describe a distributed system specifies the assumptions about what kind of faults/failures can occur. For each of the below behaviours, we pick one each. This forms the basis of the algorithms we write, but will cause problems if we assume a wrong case.

Network behaviour

The network behaviour of a system defines how it communicates. We assume bidirectional, point-to-point communication between two nodes. The system will adopt one type of behaviour:

Reliable links — we assume that a message is received if and only if it is sent (like TCP). Messages may be reordered (unlike TCP).
- We achieve this from a best-effort link with retries and de-duplication.
Best-effort links — messages may be lost, duplicated, or reordered (much like UDP). If we keep retrying, a message will eventually get through.
- We achieve this from an insecure link with a secure channel (like adding SSL).
Insecure links — a malicious adversary may interfere with messages, i.e., it may eavesdrop, modify, drop, spoof, or replay messages.

We define a network partition as the case where some links drop/delay all messages for extended periods of time. This is functionally like the connection is broken.

Node behaviour

The node behaviour determines how it behaves on failures while executing an algorithm:

Crash-stop (or fail-stop) failure — a node may crash at any time. After a crash, it stops executing forever.
- Generally these are hardware or power failures.
- If these nodes do return, they return as new nodes to the system. This means that in replicated systems, it takes a long time for these new nodes to get back up to date.
Crash-recovery (or fail-recovery) failure — again assumes a node can crash at any time. It may resume executing later, but it loses its in-memory state. Persistent data survives a crash.
- Main question: what state should be stored in persistent storage to support faster recovery?
Byzantine (or fail-arbitrary) failure — a node is faulty if it deviates from the algorithm. Faulty nodes may execute incorrectly, including if it’s malicious.
- These are software failures, and are generally more difficult to reason.
- Errors can also propagate (from a single logically faulty node).

A node that is not faulty is called correct.

Timing behaviour

We assume one of the following for network and nodes:

Synchronous
- Message latency is no greater than a known upper bound.
- Nodes execute algorithm at a known speed.
- Really, it assumes a shared clock.
- Result: algorithms are easier to design, but this assumption doesn’t reflect reality.
Asynchronous
- Messages can be delayed arbitrarily.
  - Causes may include message loss (which requires a retry), congestion causing queueing, or a network reconfiguration.
- Nodes can pause execution arbitrarily.
  - Slowdowns can occur due to IO accesses, OS scheduling, garbage collection, page faults (or swapping or thrashing).
- No timing guarantees at all.
- Result: algorithms are more robust, but hard to design. Sometimes impossible!
  - The reason is because upon node failure, it’s impossible to determine whether it’s failed or not. Timeouts aren’t robust because the message delay can be arbitrary. And so, ping/heartbeat messages aren’t guaranteed to be correct.
  - i.e., we cannot reliably tell the difference between crashed/delayed nodes and lost/delayed messages.
Partially synchronous is a middle-ground that is a common assumption.
- The system is asynchronous for finite (but unknown) periods of time and synchronous otherwise.
- Algorithms designed with this in mind make a good trade-off between being practical to design but still being realistic.

Resources

A Distributed Systems Reading List, by Dan Creswell
Textbooks
- Designing Data-Intensive Applications, by Martin Kleppmann
- Building Resilient Distributed Systems, by Sam Newman
- Thinking in Distributed Systems, by Dominik Tornow
- Distributed Systems, by Tim Harris
- Distributed Systems, by Maarten van Steen and Andrew Tanenbaum
Distributed Systems for Fun and Profit
Courses

https://brooker.co.za/blog/2024/06/04/scale.html https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Unsorted notes

lec 1

kv distributed cache
- concurrency — what happens with multiple updates at the same time
- what order do we: invalidate cache, update database
option 1: replicated cache
- if many accesses for single value (like b), it’s stored on multiple caches and we can distribute load
partitioned (sharded) caches
- store by range of keys — range partitioning — tricks to get locality: name keys to lie next to each other
- hash partitioning
load increase — elastic scaling
- option 1: drop cache data, restart caching (b/c new hash function) — too slow, too much load on database
- option 2: repartition cache data — done in practice. performed in background
- moving to the cloud facilitates this - avoids needing to buy more machines
load decrease — opposite
generally: try not to add load on database, so tradeoff betw multiple caches and accessing database favours multiple caches

lec 4

generating unique IDs: random number ⇒ still chance of collision
if client crashes and restarts, sequence number resets ⇒ server will resend old responses
timestamps ⇒ clock issues?
client informing server to delete xid entry → what if client doesn’t send?
non-responsive clients
- usual async link problem — we don’t know if client is slow or dead
- use timeout, but upon timeout the connection is closed
- some guarantees
track num of server restarts
- error: bc restart, we may (or may not have) executed it before. we don’t know. takes a conservative approach
- at most once
  - if same epoch, we already do dup detection
  - if diff epoch, we may have already sent. or not. don’t execute
- issues
  - suppose client gets an error, now it knows new epoch number.
  - client doesn’t know if it executes or not
  - client has to do a read query to figure out state of server before it issues the request

clocks

leased out with timeout
renews, releases
logical clock: no relation with physical time
total ordering: d doesn’t impact abc — so even if physical clock quite late d still has earlier logical ordering
monotonically increasing
causality assumes all msgs are visible to system
- why question — call by phone is outside system
- breaks logical ordering

broadcast communication

causal broadcast
- vector comparison deps <= D — for less than case, consider non-FIFO ordering
- for FIFO ordering, trivially deps == D cmp
total order
- single leader — premise is: total order built on top of FIFO broadcast
  - multiple leaders — can’t have total order
  - can pull off causal broadcast by adding vector deps to leader + from leader→nodes
- logical clocks
  - wait for one message from every node
  - assume messages received in FIFO order
  - crashed nodes — protocol fails bc we keep waiting
  - ack should send to itself
  - ack will be in the buffer - won’t be delivered - just helps progress the algorithm

From Introduction to Distributed System Design at Google Code University. ↩
“The problem was that, as Google grew, its computing infrastructure also expanded. Computer hardware rarely failed, until you had enough of it—then it failed all the time. Wires wore down, hard drives fell apart, motherboards overheated. Many machines never worked in the first place; some would unaccountably grow slower.” - from this article talking about Google in the early 2000s ↩
The fallacies of distributed computing article at Wikipedia. ↩

jszhn

Recent Notes

ALOHA

ARP

American literature

Assert

Atomics

Distributed system

Sub-pages

Basics

System models

Network behaviour

Node behaviour

Timing behaviour

Resources

Unsorted notes

Graph View

Backlinks

jszhn

Recent Notes

ALOHA

ARP

American literature

Assert

Atomics

Distributed system

Sub-pages

Basics

System models

Network behaviour

Node behaviour

Timing behaviour

Resources

Unsorted notes

Footnotes

Graph View

Backlinks