Earlier in this series we discussed what Distributed Systems are and why we use them, and we controversially defined a DistSys as any system divided over more than one physical location and using decoupling and copying to improve performance. Then we realised that’s most systems!
However, when we explicitly talk about building DistSys we usually mean creating a system at big enough scale that we need to start worrying about the problems with the architecture. In this post we’re going to take a very high-level look at some of the issues involved in maintaining copies.
Resilience, Performance & Availability
In our first post we talked about replicas, caches and backups. What do we use these for?
- Both replicas and caching are used for operational performance - to get information into a location where it can be accessed quickly.
- We also mentioned resilience. Replicas and backups help ensure we don’t lose data.
- Finally, we use replicas for availability. Availability is subtly different from resilience or performance but it has similarities and it is easily confused with either. Availability means we can always provide a predictable service to our clients rapidly even if there are transient access issues in our system like network problems or contention. For availability, we may keep replicas of our data in multiple locations so if one copy is temporarily inaccessible there is always another to use.
Let’s talk more about performance, availability, and resilience.
Performance
Realistically, almost all modern systems and their clients are physically distributed, and the components are connected together by some form of network.
I would have gotten away with it if it weren’t for you pesky laws of physics
Networks are great but in computer terms they are relatively slow and unreliable. Despite the strenuous efforts of network engineers, getting data packets between endpoints by bouncing them around the internet or even down a straight piece of wire takes time. In a distributed system we therefore have to deal with chronic delays (latency) in communicating data to remote clients or downstream services.
One of the DistSys techniques we use to improve speed is replication. We often hold local replicas of our data, which can be read or written, near to clients so the data has less far to travel to be used.
If we only need read access to some of our data we might keep it cached locally (i.e. in a fast-access, read-only copy). Services for location-based caching include Content Delivery Networks (CDNs) like CloudFlare, AWS CloudFront or Akamai.
Copies for performance can be kept in several different ways with the same overall principle: keeping data close to its consumer makes it faster to access.
Availability
Availability is related to both performance and resilience. In any large system there will be some parts that are currently slow or impossible to access due to hardware or network errors. Hopefully this will eventually be resolved as more capacity is added to handle traffic, or network problems are resolved. In the meantime, however, you still want the overall system’s response times to stay within its SLAs.
What’s really going to impact your system response times is trying to transport large quantities of data through those slow areas. Things would be much faster if full versions of the data supporting both read and write were available in locations that clients could still access rapidly. These copies could be synced up with the rest of the system at some future point when the network had recovered. That is the principle of using active data replicas for availability.
Resilience
Enemy of the State?
Right now in tech we talk a lot about stateful and stateless services. What is state? Very simply, “state” is anything you need to save when your application or service closes down. It’s data that cannot be re-generated and therefore must persist. The trouble with persistent data is at low scale it has to be saved to a medium that’s slow to write to (disk) and saved multiple times to avoid single points of failure. As a result, low scale stateful services are slow to shut down.
Fortunately, at high scale stateful services can rely on replication in memory to provide persistence. As long as there are are enough copies in memory in multiple locations data can be considered statistically persistent and we don’t have to do slow disk writes to keep it safe. Any individual instance can be shut down (or fall over) and rely on data replicas elsewhere to ensure data isn’t lost. Great! What could go wrong? Actually quite a lot, but we’ll discuss that later as part of the problems with replication.
Stateless services on the other hand can always be shut down immediately with a guarantee of no loss of data, no matter the scale. Services might maintain complex transient data but if that data doesn’t need to be saved on closedown they are still stateless.
Statelessness
For a stateless service we don’t need to worry about data replication to permanent or statistically persistent storage for resilience because stateless services don’t persist data. Maintaining data persistently is something we only care about for stateful services and it is really hard. That’s why everyone is so keen on stateless services at the moment. They are easier to write and operate, they work much better with dynamic management tools like orchestrators (e.g. Kubernetes), they scale well, and they are relatively straightforward to make resilient. The more we can use stateless services the easier our lives become.
Stateful Services and Resilience
So, now we’ve got the easy stuff out the way (stateless services) we can start worrying about any remaining services that have data we can’t afford to lose - our difficult stateful services.
Historically, there have been several approaches to persisting data securely. For now, let’s call them disk backups, passive replicas & active replicas. I know this is a very high level overview but bear with me because it demonstrates some of the issues with providing resilience in a simple case.
Very Cold
A backup is a regular copy of your data that you make to reliable permanent storage (usually files on disk). This is the classic, all-else-fails recovery plan. It’s simple and everyone understands it (even the CEO) and it’s cheap because long term storage like disk is inexpensive. You’ll probably always have a file backup of some form no matter how big you are. However, the downsides of these backups are significant.
The main negative is it’s slow to update a file backup. Writing to disk isn’t quick, neither is getting your data to that remote storage. Your backup will therefore always be out of date. If you ever need to restore from it that leads to four problems.
- You will always lose some data because the backup won’t perfectly reflect the latest changes to the system, which have not yet been copied to the backup.
- It will take some time to restore your system from the disk backup, during which time your service will be unavailable.
- The restore is an occasional process that may have bugs or encounter problems of its own.
- After the restore, because you have a discontinuity in your service’s data (that unavoidable data loss), it may be difficult to reconcile your service’s data with that of other services (basically, they’ll be out of whack).
This is stuff you probably know, so why are we telling you? Because the same problems occur with more active replicas. They are just less obvious because the issues occur on a smaller, more inhuman timescale. Your system is inhuman though so it notices. But let’s not get ahead of ourselves. Let’s instead discuss mitigation strategies for backup issues.
Getting Warmer - Passive Replicas
One popular way to reduce the size of the backup problem is to build what is sometimes called a passive replica. The principle of a passive replica is you keep a running copy of your data in a database, not readable or writable (i.e. it’s passive) but in a synchronized state, ready to take over if your active primary fails or becomes unavailable. This is better than a backup:
- A passive replica is usually more closely synchronized with live data so you’ll lose fewer updates switching to a passive replica than a backup. However, you’ll usually still lose data. It is physically impossible to keep two data copies that are remote from one another 100% in sync if there are lots of updates taking place, because data takes time to travel from one machine to another. If the updates stop for a while the passive replica might catch up with the original but we can’t rely on outages conveniently only happening in such a quiet period.
- A switchover to a passive replica is faster than a restore from disk because the data is already where it needs to be and ready to go.
- Less data loss means fewer hard-to-reconcile data inconsistencies with other services.
However, by moving from a backup to a passive replica all we have done is spend a lot of money to make some data-loss-inducing race conditions smaller (admittedly, much smaller). We haven’t completely removed them. Our replica switchover process is also still essentially one-off therefore prone to error.
As our service becomes larger scale with more traffic and updates these race conditions become increasingly troublesome. More data gets lost in the event of failure and things become difficult to reconcile once more. Passive replicas are better than disk backups but they are still not good enough at scale.
Hot - Active/Active Replicas
Active data replicas are sometimes just called “replicas”. An active replica is a constantly-synchronised, would-be-identical data copy that acts as both a “backup” for resilience and another live copy of the data that can be read and updated. They therefore help with performance, reliability, and availability.
At scale, disk backups and passive replicas are too lossy, too slow, and too error-prone. Essentially, they are designed to only be used in a failure scenario. Failover to either is an occasional event. Because of that, the failover process itself often contains bugs and errors. Active replicas, on the other hand, make the “failure” case (i.e. use of the data replica) mainline. it’s no longer an error condition - it’s just business as usual.
Active-active replication can be provided by distributed databases like Cassandra or relational databases like Postgres, MySQL or SQL Server.
Sounds Great! Where’s the Catch?
The downside is managing active replicas is difficult and costly. All the problems of maintaining copies don’t go away. There will be windows where data loss can occur, and all the constant reconciliation can be very tough. Managing those problems, however, is now part of the mainline operation of our system so we can’t just stick our fingers in our ears, sing “lalala” and ignore them until a problem actually occurs. We are forced to invest in designing, testing and operating our systems to cope with these issues. Alternatively, we can get someone else to handle the problems for us (always my preferred option in life) by using a cloud service.
You Can’t Handle The Truth
In summary, the main difficulty with using copies for performance, resilience and availability is synchronisation both in failure and in non-failure cases. How do we keep all our copies effectively in sync (or acceptably consistent) when they are independently changing all the time and they’re physically separated or even in different forms on different media? To answer that question we need to consider physics, reconciliation and the nature of truth and that’s what we’ll be talking about in the next post.
Art by jdhancock.com