No, this isn't the worst Christmas cracker joke ever.
In the first blog in this series I talked about what is a distributed system and controversially defined it (IMO) as a system that spans more than one physical location and uses the related concepts of copying and decoupling to improve operational efficiency (speed, resilience) and, more recently, developer efficiency (team productivity). In this post I'm going to ask if everything is really a distributed system if you look closely enough, it's just that for some systems the distributed-like parts are so hardened (commodified) that we no longer even think about them.
Thanks to Diptanu Choudary, Jon Berger & the Distributed Systems experts at Container Solutions, whose huge brains I picked for this work. They do not always agree with me on every philosophical point. However, we do enjoying thinking and discussing, which is the purpose of the exercise. We're all allowed to do that.
Is Copying Allowed?
As we noted in the last post, using copies of data is a foundational concept in computing, but you don’t have to be running on multiple machines or even running a particularly complicated system to benefit from information duplication. It’s fundamental to the performance of a single threaded application on a single server. We’ll talk about that because it’s a good example of the effect on performance of using copies.
Show Me the Cache
Let’s consider the very simple example of a compiled executable that runs directly on a single host server without a framework (say it’s written in C++ or Go for example) and with no VM. Even in this trivial case the executable (machine) code exists in at least 3 physical locations on that machine:
- The compiled machine code is generally held in a file on the local disk (the executable binary).
- At application initialisation, all the machine code is copied from the file into user space memory by the OS.
- At runtime, code operations that are about to be executed are temporarily copied from the user space memory into CPU cache (a physical storage location on the CPU chip). From here, the operations are read and then run by the CPU. (In fact, it’s slightly more complex even than this, there are multiple levels of CPU cache but you get the gist).
This multi-stage copying is complicated and even potentially error prone (it could fail or become corrupted, although on a modern machine this cache-style, one-way copying is extremely reliable <Anne says: ahem... see comment at end>. However, there has always been a trade-off between complexity, speed, cost and persistence when running applications. Without this caching, our applications would be very slow indeed, as we are about to hear.
The Medium and the Message
If you haven’t seen Peter Norvig’s table of the comparative access times for different common media take a look now, it’s very eye opening: http://norvig.com/21-days.html#answers
Let’s use these numbers to see why a simple system like that described above uses so many complicated duplicates.
Cache Me If You Can
What if our single server running a compiled binary didn’t use all the complex duplication we describe above?
- A read from memory takes maybe 10x longer than a read from a CPU cache, so if the OS didn’t pre-copy each line of machine code from RAM into a CPU cache before executing it your application could run at 1/10th the speed!
- Let’s go further. What if the CPU attempted to retrieve each line of machine code to execute directly from the executable file on disk? Well, your application would run between 1000 and 1,000,000 times slower and I’m guessing your users would probably complain.
Disks and RAM must be rubbish compared to CPU caches! Obviously, that’s not true. They just have different properties and use cases.
Disk is a cheap, permanent storage mechanism but it’s comparatively slow to read from and write to. Solid State Drives (SSDs) are much faster but more expensive. Memory is faster but even more expensive and also not persistent (the data is lost when you switch the power off). CPU cache is super-speedy but yet more expensive, limited in size, and similarly power-dependent.
So, even in our simplified example of a single-threaded application running on a single server we use the concept of copying to balance speediness and data persistence with cost. The OS uses data copies to compromise between going faster, keeping costs down, and safely maintaining data (the binary) in a location where it won’t disappear if the server crashes.
Blatant Plagiarism?
For running an executable, operational duplication is vital. It’s a good reminder that even a simple single-threaded application uses data copies for one of the same reasons as a distributed system (performance) with the same trade-offs (the difficulty of keeping everything in sync, and knowing what to cache).
But despite the similarities, I’d say this is only a very rudimentary distributed system. The single server system runs fast but it’s not resilient, it’s not that scaleable, it won’t perform well for a client a long way away and it doesn’t handle changing data (if you update the executable file on disk you generally have to restart the whole application for that to take effect).
To do better, we have to introduce more machines, more replication, and perhaps some decoupling. However, that will also introduce further complexity and more failure modes. Never mind. That’s why we get paid the big bucks.
The purpose of this post is to illustrate there’s really nowhere you can go in computing that doesn’t use copying as a performance tool. The OS is now just very good at this basic caching (keeping everything we need loaded and synced). We very rarely see any problems with it because it’s hardened, so we seldom think about it. We completely rely on the OS to handle it for us.
Ideally for our larger scale distributed systems we’d like a data synchronisation platform as stable as an OS cache system <Anne says: "ouch" see new end comment>, but Is that a realistic dream? In my next post I'm going to talk a little bit about why we might want to try; because we want performance, resilience and availability.
(*) The recent events of Meltdown and Spectre demonstrate that CPU caching was not quite the "hardened tech" we thought it was! The dangers of the trade-off between speed and complexity have been rather nastily demonstrated! Maybe that shows that our greatest risks in security lie where we are most over-confident...
If you want to read more by me, try my free mini book on the basics of containers, orchestrators and microservices: "The Cloud Native Attitude".