This short series of posts comes out of a project I did with Diptanu Choudary, lots of discussions with other expert DistSys practitioners here at CS and elsewhere and my own background as an early engineer in the field.
Everyone has an opinion. This is mine. You are very much allowed to have a different one, as long as you think about it.
The simplest application to write and operate is one that runs in one thread on a single processor. If that’s so easy, why on earth do we ever build anything else? Usually because we also want operational and developmental performance. A single server is a physical and organisational limitation on what you can achieve with an application.
There are three main operational performance issues introduced by running on a single machine.
There are also organisational limitations imposed by a single-server system.
The purpose of distributing a system, or dividing it over more than one physical location, is to fix the machine and organisational performance problems above using two powerful operational and architectural concepts: data copying and decoupling.
At this point you might be thinking, “Hang on, if having more than one location=distributed system then my monolith is distributed! It has load balancers, cache servers, front end servers and a backend database with replicas!”
You’d be right. A monolith is a simple distributed system. You are already encountering some of the problems of operating any system with distant clients (that’s why you have that cache server) and single points of failure (that’s why you have those DB replicas). I’ll talk more about how we might consider pretty much everything to be distributed, even a single threaded application, in the next post.
I mentioned that distributed systems leverage two deceptively powerful and wide-ranging operational concepts: decoupling and taking copies.
Decoupling means different parts of your system can operate independently without one part being affected by the internal operations of another. This is an extraordinarily useful idea for several reasons
Note that decoupled services cannot share a database unless the same rule of clearly defined and maintained APIs applies. Otherwise the database effectively ties the services together.
We’ve always used decoupling for scalability, speed and resilience. Recently, we have realised the well-defined-API part of decoupling can be extremely useful at improving developer productivity by allowing smaller teams to independently work on tinier services that can be developed in parallel - provided the API is maintained. Every team can safely change the internal behaviour of their service without breaking every other service that talks to it, allowing teams to innovate at different paces at the same time.
Copying is another foundational concept in computing. In software design we use the concept of taking data copies to achieve speed and resilience in several different ways.
A data replica is an exact copy of a database. Replicas are constantly, iteratively synced with one another so their contents are kept as identical as possible. They generally come in two flavours, active and passive. Active replicas support read and write access and play an (unsurprisingly) active role in serving clients. Active replicas can help with scale, resilience and location-base performance. Passive replicas are generally maintained for failover purposes; they don’t support read or write and are only synced in one direction (from the master data). The job of a passive replica is to be ready to take over from the main replica if it fails.
Unlike a replica, a cache is a not-necessarily-identical, read-only, copy of your data, usually maintained as a way to serve read requests faster and more cheaply than querying your main database. Caches may take advantage of
A backup is a copy of your data for emergency use in the case of catastrophic data loss. Usually a backup is written to a permanent storage medium like disk in multiple physical locations. Backups are slow to write to and restore from and usually they are not kept constantly in sync with the master data. Instead, they are periodically updated. They are useful but have severe limitations and can easily lull folk into a false sense of security. This false security becomes especially dangerous as the scale and complexity of a system increases.
Data duplication goes in and out of fashion as an architectural technique. It is often frowned upon because it’s a common cause of bugs. On the other hand, it is very useful for performance. Basically, like everything in life it has pros and cons and we need to make a judgment on whether to use it based on the current circumstances.
A decade ago we spent ages “normalising” our databases, or removing data duplication. However in a distributed system we really need to have data available in more than one location. Every service cannot be calling back to the same database all the time or we lose all the benefits of distribution. Microservices usually maintain local equivalents of certain pieces of data that they can write and read without worrying about anyone else.
BUT the potential bugs have not mysteriously gone away. We have to somehow keep all those data copies in alignment (that doesn’t necessarily mean identical). By using data duplication we have to accept that we’ve just included an architectural technique that can be very bug prone in our design. C’est la vie. It’s a tradeoff. It just means that we have to use development techniques that help resolve those problems such as Domain Driven Design, or operational techniques like managed stateful services that handle some of these issues for us.
So, I’ve just controversially defined a distributed system as any system that has components in more than one physical location and I've said that DistSys often use the architectural concepts of data copying and decoupling to improve operational efficiency (speed, resilience) and, more recently, developer efficiency (team productivity).
In my next post on this subject I’m going to go all philosophical and consider whether, under my wide definition, a single threaded application running on a single cpu is also a distributed system? Does that mean my definition is wrong? (Hint, obviously I don’t think so, that’s a rhetorical question. I’m just faking humility there).