In software engineering, a distributed system is a set of computers (that don’t share a common memory or clock) connected by a network that work together as a single system, often to provide a common set of services to users. They’re highly concurrent and are an important software paradigm, because they’re designed to be scalable and reliable.
Designing effective distributed systems is sometimes difficult, because there are many moving parts. Software is executed concurrently, communication networks may be unreliable, and there may not be a single node in charge of the system.
Sub-pages
- Key principles
- Technologies and tools
- Relevant fields
Basics
We define a single machine (server, client) as a node. A service is defined as a set of servers of a particular type. A binding occurs when a client of a service (needs to access) becomes associated with a server of a service (can provide).
Consider the reliability of a single machine:
- There’s a finite limit to physical resources we can use (RAM limit).
- Single point of failure.
- Downtime in case we need to upgrade the machine.
- One physical location, but users likely aren’t.
Broadly, distributed systems try to achieve a few key principles:
- Abstraction — to provide high-level interfaces for services.
- Scalability — to achieve higher performance and capacity with more nodes, i.e., to the user, the system behaves the same even if there’s higher load.
- Fault tolerance — to provide reliable service despite failures, i.e., so users aren’t aware when nodes in the system fail.
- Data consistency — to behaves like a single node, centralised system, i.e., so that users aren’t aware of this difference.
- Security — to behave like a single, trusted system.
Distributed systems will make trade-offs on what principles they prioritise in their implementation. For example, typical trade-offs include:
- Performance > fault tolerance
- So replication is avoided, because it increases overhead.
- Performance > consistency
- Like choosing sequential consistency over linearisability.
- Fault tolerance > consistency
- Update one replica synchronously, others asynchronously.
- Read from any replica.
- Security > performance
- Encrypt and authenticate data.
- Communicate with many replicas to build trust.
In fact, there are 8 fallacies of thinking when designing distributed systems. They often arise from making assumptions that are valid for local programming:3
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite — can result in bottlenecks.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
System models
The model we use to describe a distributed system specifies the assumptions about what kind of faults/failures can occur. For each of the below behaviours, we pick one each. This forms the basis of the algorithms we write, but will cause problems if we assume a wrong case.
Network behaviour
The network behaviour of a system defines how it communicates. We assume bidirectional, point-to-point communication between two nodes. The system will adopt one type of behaviour:
- Reliable links — we assume that a message is received if and only if it is sent (like TCP). Messages may be reordered (unlike TCP).
- We achieve this from a best-effort link with retries and de-duplication.
- Best-effort links — messages may be lost, duplicated, or reordered (much like UDP). If we keep retrying, a message will eventually get through.
- We achieve this from an insecure link with a secure channel (like adding SSL).
- Insecure links — a malicious adversary may interfere with messages, i.e., it may eavesdrop, modify, drop, spoof, or replay messages.
We define a network partition as the case where some links drop/delay all messages for extended periods of time. This is functionally like the connection is broken.
Node behaviour
The node behaviour determines how it behaves on failures while executing an algorithm:
- Crash-stop (or fail-stop) failure — a node may crash at any time. After a crash, it stops executing forever.
- Generally these are hardware or power failures.
- If these nodes do return, they return as new nodes to the system. This means that in replicated systems, it takes a long time for these new nodes to get back up to date.
- Crash-recovery (or fail-recovery) failure — again assumes a node can crash at any time. It may resume executing later, but it loses its in-memory state. Persistent data survives a crash.
- Main question: what state should be stored in persistent storage to support faster recovery?
- Byzantine (or fail-arbitrary) failure — a node is faulty if it deviates from the algorithm. Faulty nodes may execute incorrectly, including if it’s malicious.
- These are software failures, and are generally more difficult to reason.
- Errors can also propagate (from a single logically faulty node).
A node that is not faulty is called correct.
Timing behaviour
We assume one of the following for network and nodes:
- Synchronous
- Message latency is no greater than a known upper bound.
- Nodes execute algorithm at a known speed.
- Really, it assumes a shared clock.
- Result: algorithms are easier to design, but this assumption doesn’t reflect reality.
- Asynchronous
- Messages can be delayed arbitrarily.
- Causes may include message loss (which requires a retry), congestion causing queueing, or a network reconfiguration.
- Nodes can pause execution arbitrarily.
- Slowdowns can occur due to IO accesses, OS scheduling, garbage collection, page faults (or swapping or thrashing).
- No timing guarantees at all.
- Result: algorithms are more robust, but hard to design. Sometimes impossible!
- The reason is because upon node failure, it’s impossible to determine whether it’s failed or not. Timeouts aren’t robust because the message delay can be arbitrary. And so, ping/heartbeat messages aren’t guaranteed to be correct.
- i.e., we cannot reliably tell the difference between crashed/delayed nodes and lost/delayed messages.
- Messages can be delayed arbitrarily.
- Partially synchronous is a middle-ground that is a common assumption.
- The system is asynchronous for finite (but unknown) periods of time and synchronous otherwise.
- Algorithms designed with this in mind make a good trade-off between being practical to design but still being realistic.
Resources
- A Distributed Systems Reading List, by Dan Creswell
- Textbooks
- Designing Data-Intensive Applications, by Martin Kleppmann
- Building Resilient Distributed Systems, by Sam Newman
- Thinking in Distributed Systems, by Dominik Tornow
- Distributed Systems, by Tim Harris
- Distributed Systems, by Maarten van Steen and Andrew Tanenbaum
- Distributed Systems for Fun and Profit
- Courses
https://brooker.co.za/blog/2024/06/04/scale.html https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
Unsorted notes
lec 1
- kv distributed cache
- concurrency — what happens with multiple updates at the same time
- what order do we: invalidate cache, update database
- option 1: replicated cache
- if many accesses for single value (like b), it’s stored on multiple caches and we can distribute load
- partitioned (sharded) caches
- store by range of keys — range partitioning — tricks to get locality: name keys to lie next to each other
- hash partitioning
- load increase — elastic scaling
- option 1: drop cache data, restart caching (b/c new hash function) — too slow, too much load on database
- option 2: repartition cache data — done in practice. performed in background
- moving to the cloud facilitates this - avoids needing to buy more machines
- load decrease — opposite
- generally: try not to add load on database, so tradeoff betw multiple caches and accessing database favours multiple caches
lec 4
- generating unique IDs: random number ⇒ still chance of collision
- if client crashes and restarts, sequence number resets ⇒ server will resend old responses
- timestamps ⇒ clock issues?
- client informing server to delete xid entry → what if client doesn’t send?
- non-responsive clients
- usual async link problem — we don’t know if client is slow or dead
- use timeout, but upon timeout the connection is closed
- some guarantees
- track num of server restarts
- error: bc restart, we may (or may not have) executed it before. we don’t know. takes a conservative approach
- at most once
- if same epoch, we already do dup detection
- if diff epoch, we may have already sent. or not. don’t execute
- issues
- suppose client gets an error, now it knows new epoch number.
- client doesn’t know if it executes or not
- client has to do a read query to figure out state of server before it issues the request
clocks
- leased out with timeout
- renews, releases
- logical clock: no relation with physical time
- total ordering:
ddoesn’t impactabc— so even if physical clock quite latedstill has earlier logical ordering - monotonically increasing
- causality assumes all msgs are visible to system
- why question — call by phone is outside system
- breaks logical ordering
broadcast communication
- causal broadcast
- vector comparison
deps <= D— for less than case, consider non-FIFO ordering - for FIFO ordering, trivially
deps == Dcmp
- vector comparison
- total order
- single leader — premise is: total order built on top of FIFO broadcast
- multiple leaders — can’t have total order
- can pull off causal broadcast by adding vector deps to leader + from leader→nodes
- logical clocks
- wait for one message from every node
- assume messages received in FIFO order
- crashed nodes — protocol fails bc we keep waiting
- ack should send to itself
- ack will be in the buffer - won’t be delivered - just helps progress the algorithm
- single leader — premise is: total order built on top of FIFO broadcast
Footnotes
-
From Introduction to Distributed System Design at Google Code University. ↩
-
“The problem was that, as Google grew, its computing infrastructure also expanded. Computer hardware rarely failed, until you had enough of it—then it failed all the time. Wires wore down, hard drives fell apart, motherboards overheated. Many machines never worked in the first place; some would unaccountably grow slower.” - from this article talking about Google in the early 2000s ↩
-
The fallacies of distributed computing article at Wikipedia. ↩