Replication is a major technique to back-up data used in software systems (including distributed systems, databases, and filesystems). Replicated storage services keep multiple copies of data on different nodes/data centres/countries/continents. We define a node that has a copy of the data as a replica.

Doing the same thing in multiple places.

The core goal of replication is to provide scalability and availability.

  • For scalability:
    • It improves throughput (since clients can access different replicas, which spreads load).
    • It lowers latency (since clients can access a replica geographically closer).
    • Caching is an example of replication. Web caches can serve content closer/locally for faster access. Geo-replicated caches can reduce latency by bringing the content to clients.
  • For availability:
    • If a replica fails, clients can access another replica, i.e., it enables fault tolerance.
    • This allows the application to avoid downtime in case of a server failure.
    • Storage failures can also prevent loss of data.

A good replicated system will have clients be unaware of replication, i.e., it should treat the system as a single machine that provides scalability (high throughput, low latency) and availability (appears to never fail). In this sense, replicated systems must ensure linearisability. This is an inherently challenging task because writes (ordering of concurrent requests) and failures (is data saved?) complicate our ability to achieve linearisation.

Action ordering

We use timestamps to order concurrent writes. There are two main methods:

  • A total order (logical) timestamp. Clients append a time to their request, and whatever is sent at an earlier time loses. This loses data even in concurrent operations, but only one is kept. Upside is that it’s simple!
  • A partial order (vector) timestamp. For ordered events, we still replace. But for concurrent events, we preserve the results of both actions. This is complicated and can result in a large vector timestamp.
    • This is like a Git merge. Both copies are kept.

A visible flag is used to determine whether actions have been completed on all clients. Given a put, the replica will create a record and set visible = true. Given a del, the replica will instead set visible = false. This allows replicas to determine when client actions haven’t been fully executed across all replicas (i.e., only successful for a subset of replicas).

A reconciliation process (also called gossiping and anti-entropy, because it decreases the disorder of the system) is important to detect differences between replicas and reconcile them (including when replicas are added, or when they crash and recover). This ensures that replicas eventually all hold the same data. This process is usually done pairwise between replicas, and is done in the background basically regardless of what replication scheme is being used.

Techniques

There are multiple types of replication schemes: