Scalability is a key principle of software systems, especially distributed ones. Core idea is that: computation and data storage costs need to grow with more users, so the goal is: to get higher performance and capacity with more nodes. There are two types of scaling when we double the number of nodes:

  • Weak scaling — proportionally increase job size, and job takes same time to complete.
    • For example, web traffic. We want to maintain consistent latency for each user, even if the number of users increase. In this case, we add proportionally more resources to maintain the same latency.
  • Strong scaling — where the same job takes half the time to complete.
    • This is mainly relevant in the context of HPC, where system designers throw more machines at the system to get more performance.
    • More difficult to achieve than weak scaling!
    • Typical strong scaling problems are problems where there’s a complex calculation. Given a long-running complex calculation, we want more machines to decrease the amount of time the calculation takes.

The challenges of scalability:

  • How should data be partitioned across nodes?
    • Data accesses may be skewed towards a particular partition.
  • How should a job be split and run in parallel across nodes?
    • When jobs are parallelised, the slowest node determines performance, i.e., the bottleneck task limits the scale of the system.
    • Recall Amdahl’s law.
  • How can we reduce network communication?
    • Networks are slow, and memory accesses are fast.

Metrics we use to measure scalability:

  • Throughput — jobs per second, higher is better.
  • Latency — time to complete a job, lower is better.
    • The tail latency (or worst case) is also an important metric. Generally, it’s difficult to optimise for because latency is additive.
    • Modern systems use microservices — many small services that are scalable, such that a single request might go through hundreds of small services. If one is slow, we get bad tail latency.
  • Cost — resource usage, lower is better.

Methods to improve scalability:

  • All are basically partitioning problems for better locality.
    • Place data that is accessed together geographically closer together (on the same node or rack).
    • Place a job’s tasks on the same nodes that store their data.
    • Co-locate tasks that communicate heavily.
  • Spread data and tasks across nodes, for better load balance.
  • Most methods are workload-dependent.