This short series of posts comes out of a project I did with Diptanu Choudary, lots of discussions with other expert DistSys practitioners here at CS and elsewhere and my own background as an early engineer in the field.
Everyone has an opinion. This is mine. You are very much allowed to have a different one, as long as you think about it.
What is a Distributed System Anyway?
The simplest application to write and operate is one that runs in one thread on a single processor. If that’s so easy, why on earth do we ever build anything else? Usually because we also want operational and developmental performance. A single server is a physical and organisational limitation on what you can achieve with an application.
Machine Performance (AKA Things)
There are three main operational performance issues introduced by running on a single machine.
- Scale. You might want more CPU, memory or storage than is available on one server no matter how big, or it might be more efficient (cost/server utilization) to use machines with different properties for different functions. For example: CPUs vs GPUs.
- Resilience. Any software or piece of physical hardware will crash (even mainframes die eventually). A single server is a single point of failure.
- Location. “Propinquity” means useful proximity. Unless the only user of your application will be sitting at a keyboard plugged into your server, eventually your single-machine application will need to talk to something else. Where will that client be? Compared to internal communications on a single computer, network connections are disappointingly slow and unreliable. If your server is distant from its “upstream” client then exchanging data introduces delay, is error-prone, is more hardware-failure prone, and costs you money. You might also want to talk to “downstream” services like databases. Those connections will be over unreliable networks too. You ideally want to physically locate your service and data close enough to both downstream services and upstream clients to minimize network-related issues.
Organisational Performance (AKA People)
There are also organisational limitations imposed by a single-server system.
- Team Dynamics. It can be inefficient to manage a single team working on one increasingly large and complex application. Devs get in each other’s way. It’s a short step from merge conflict to physical conflict and blood’s hard to get out of a nice conference t-shirt, so we’d rather avoid that. Large, complex apps are hard to maintain, hard to replace, and it’s easy to end up with a human SPOF like a key manager or architect. They tend to fail even quicker than servers.
The purpose of distributing a system, or dividing it over more than one physical location, is to fix the machine and organisational performance problems above using two powerful operational and architectural concepts: data copying and decoupling.
At this point you might be thinking, “Hang on, if having more than one location=distributed system then my monolith is distributed! It has load balancers, cache servers, front end servers and a backend database with replicas!”
You’d be right. A monolith is a simple distributed system. You are already encountering some of the problems of operating any system with distant clients (that’s why you have that cache server) and single points of failure (that’s why you have those DB replicas). I’ll talk more about how we might consider pretty much everything to be distributed, even a single threaded application, in the next post.
I mentioned that distributed systems leverage two deceptively powerful and wide-ranging operational concepts: decoupling and taking copies.
Decoupling means different parts of your system can operate independently without one part being affected by the internal operations of another. This is an extraordinarily useful idea for several reasons
- Decoupled services can operate independently of other services but also, crucially, of other copies of themselves. That enables scalability and resilience through running multiple instances of each service. It also means you can potentially locate copies of your service closer to clients.
- To be effectively decoupled, services must communicate through defined and maintained application programming interfaces (APIs) that are not tied to their internal operations. That implies the internals of a service can change without changing the external behaviour.
Note that decoupled services cannot share a database unless the same rule of clearly defined and maintained APIs applies. Otherwise the database effectively ties the services together.
We’ve always used decoupling for scalability, speed and resilience. Recently, we have realised the well-defined-API part of decoupling can be extremely useful at improving developer productivity by allowing smaller teams to independently work on tinier services that can be developed in parallel - provided the API is maintained. Every team can safely change the internal behaviour of their service without breaking every other service that talks to it, allowing teams to innovate at different paces at the same time.
Copying is another foundational concept in computing. In software design we use the concept of taking data copies to achieve speed and resilience in several different ways.
A data replica is an exact copy of a database. Replicas are constantly, iteratively synced with one another so their contents are kept as identical as possible. They generally come in two flavours, active and passive. Active replicas support read and write access and play an (unsurprisingly) active role in serving clients. Active replicas can help with scale, resilience and location-base performance. Passive replicas are generally maintained for failover purposes; they don’t support read or write and are only synced in one direction (from the master data). The job of a passive replica is to be ready to take over from the main replica if it fails.
Unlike a replica, a cache is a not-necessarily-identical, read-only, copy of your data, usually maintained as a way to serve read requests faster and more cheaply than querying your main database. Caches may take advantage of
- Location (keeping information physically close to the reading client).
- Medium (keeping information in faster-to-access physical media like memory rather than disk).
- Design (maintaining information in data structures optimised for fast read like key/value stores or graphs).
A backup is a copy of your data for emergency use in the case of catastrophic data loss. Usually a backup is written to a permanent storage medium like disk in multiple physical locations. Backups are slow to write to and restore from and usually they are not kept constantly in sync with the master data. Instead, they are periodically updated. They are useful but have severe limitations and can easily lull folk into a false sense of security. This false security becomes especially dangerous as the scale and complexity of a system increases.
Denormalisation for Decoupling
Data duplication goes in and out of fashion as an architectural technique. It is often frowned upon because it’s a common cause of bugs. On the other hand, it is very useful for performance. Basically, like everything in life it has pros and cons and we need to make a judgment on whether to use it based on the current circumstances.
A decade ago we spent ages “normalising” our databases, or removing data duplication. However in a distributed system we really need to have data available in more than one location. Every service cannot be calling back to the same database all the time or we lose all the benefits of distribution. Microservices usually maintain local equivalents of certain pieces of data that they can write and read without worrying about anyone else.
BUT the potential bugs have not mysteriously gone away. We have to somehow keep all those data copies in alignment (that doesn’t necessarily mean identical). By using data duplication we have to accept that we’ve just included an architectural technique that can be very bug prone in our design. C’est la vie. It’s a tradeoff. It just means that we have to use development techniques that help resolve those problems such as Domain Driven Design, or operational techniques like managed stateful services that handle some of these issues for us.
What Are We Talking About Again?
So, I’ve just controversially defined a distributed system as any system that has components in more than one physical location and I've said that DistSys often use the architectural concepts of data copying and decoupling to improve operational efficiency (speed, resilience) and, more recently, developer efficiency (team productivity).
In my next post on this subject I’m going to go all philosophical and consider whether, under my wide definition, a single threaded application running on a single cpu is also a distributed system? Does that mean my definition is wrong? (Hint, obviously I don’t think so, that’s a rhetorical question. I’m just faking humility there).