Memcached is an in-memory cache used in highly scalable distributed systems. It sees significant use across major distributed platforms, like Facebook.
Deployment
The design requirements for practical Memcached systems are mainly related to scale, so that systems can process millions of user requests per second. This means supporting:
- Heavy read load (>1 billion reads/sec)
- Tight latency requirements (near real-time communication)
- Content hot spots (popular shared content),
- Poor locality for storage accesses (i.e., might be from many places around the world)
- Petabytes of storage.
- Geographically distributed users with multiple data centres.
Multiple geographically distributed clusters have one master and many replicas. All writes by clients are sent to the master database, which is eventually propagated to replicas and the cache. This system is designed to maintain a very high cache hit rate, because misses are so expensive with respect to access times.
The short story is that maintaining linearisability takes an extreme hit to performance, because you won’t ever read stale data, so you need agreement among nodes. By allowing for an eventually consistent model, you effectively remove this bottleneck and vastly improve performance. The core idea of Facebook’s deployment of the system is just to make sure that stale data in the caches don’t stay stale forever, and to optimise for a read-heavy workload.
Design problems
Memcached deployment requires addressing a few problems:
- Stale set consistency problem happens when Memcached and the database are inconsistent.
- This might happen with concurrent events. Suppose the cache is out of date. Two clients issue a get from the database. One client replaces the value in the cache with what was already in the DB, but the other client writes a new value to both the DB and cache. This is essentially a race condition.
- Thundering herd problem happens when a key is 1) read heavily then 2) updated then 3) all subsequent reads will cause read misses and result in very expensive database accesses (lots of load!) until 4) the key is cached again.
- Incast congestion happens with a multiple-server cluster. With sharding, this means webservers may have to fetch 500+ keys from hundreds of servers in parallel. This may overwhelm the network switch, causing responses to be dropped.
- From a latency POV, packet drops are quite expensive.
- With many servers in a single cluster, some caches can become hotspots. One slow Memcached server slows the entire network.
- With multiple clusters, we have yet another cache consistency problem when data is updated, since the same data might be cached across different clusters.
- With multiple geo-distributed clusters, all writes are sent to the master database (other clusters are replicas). What this means is that writes must eventually propagate to the replicas and to their caches. Until then, the webserver might read stale data, i.e., we can’t read-your-write.
- Writing only to the master DB ensures some total ordering of writes that is maintained by the DB.
- The only real guarantee Memcached provides in this system is read-your-write consistency. You can still read stale data if you weren’t the writer.
The solutions:
- Leases fix the stale set consistency problem. On a read miss, the cache will return a
lease_idto the webserver. On an update, Memcached will invalidate thelease_id. On a set, the cache will check the providedlease_idby the webserver. If a lease ID is out of date, it is rejected. - The thundering herd problem is fixed by limiting the rate at which leases are returned on a read miss. Once a new lease ID is sent to a webserver, then for the next 10 seconds after that. all
getstell the webserver to retry in a few milliseconds.- This basically allows more generous timing windows to resolve the problem. The webserver with the lease will read the DB and set it in the cache.
- This is entirely for performance, since we want to minimise expensive DB accesses.
- Incast congestion is fixed by limiting the number of outstanding requests by using a sliding window mechanism (much like what TCP uses for congestion control). Larger windows result in more congestion. Smaller windows result in more network round trips.
- i.e., keep increasing accepted packets on the server-side until the first loss event.
- Cache hotspots are resolved by using multiple clusters. This distributes the number of servers per cluster, and allows hot keys to be cached in multiple clusters.
- However, fewer unique keys can be cached across all clusters (for servers, we can cache keys. If we shard it to servers, then the number of keys drops too).
- This wastes space. Suppose a key isn’t very hot. One server over two clusters might send a request. Then the not-hot key takes up space in two clusters’ caches.
- To fix the multiple-cluster problem, we use cache invalidation techniques (much like how CPUs ensure cache coherency). Whenever the storage cluster receives a write, then they send an invalidation message to the cluster.
- Invalidation messages are logged by the storage cluster. If the frontend cluster fails, then invalidation daemons (processes) resend invalidations from the log to resynchronise the cache.
- Note that updates don’t wait for the invalidations to complete. In theory, other
getscan return stale cached values. This is okay for social media — most people won’t care about an exact “like” count. - This results in an eventually consistent model. Leases and reliable invalidations ensure that the caches don’t serve stale data forever.
- We ensure read-your-write consistency with remote markers. Before a webserver writes to the master, it sets a remote marker in the cache, then writes to the master DB. Then, whenever that webserver (i.e., not all webservers!) reads from the cache, it reads from the master. Then the master DB change eventually propagates to the cache, which deletes the remote marker.
- You replicate first, then invalidate the ordering. This ensures that cache coherency is preserved.
Resources
- Resources from Facebook
- Facebook and memcached, tech talk by Mark Zuckerberg
- Scaling Memcache at Facebook (lecture, paper)