Since March I have been interviewing enterprises, vendors and pundits about Cloud Native for my ebook, “The Cloud Native Attitude”, which came out last week and is currently available as a free download.
In August, The New Stack very kindly ran an excerpt from my book, which was about distributed systems, aka microservices. In it I describe how, at scale, microservices are very hard to operate.
While doing all this research, however, I noticed something surprising. Although historically distributed systems were used to improve operational scale and resilience this time most companies were not using them for that. They did care about machine productivity and availability but that wasn’t their primary concern. Their strongest motivation was improving the productivity of their development teams and getting new features to production faster.
We realised that most companies are developing Cloud Native to optimise developer productivity not machine productivity.
In the olden days we used distributed architectures because Moore’s Law hadn’t delivered enough yet. Networks were slow, unreliable and variable, particularly the “last mile” to browsers. For availability that often meant components separated by network connections cached data locally and became tiny stateful services (e.g. in the client/server model).
Servers were flaky - you had to assume frequent failure and exhausted resources. They were also small and weak - you needed good utilisation, multi-threading, and many machines. Distribution was thus key to getting performance and scale. DistSys was hard because it had a lot of failure modes and it was slow to develop and debug but there was no choice. We had to optimise for machine productivity and that goal is what we used distribution to achieve.
Then Moore’s law came to the rescue. Machines and networks became bigger and better and we stopped torturing ourselves with distributed systems and embraced something less efficient but faster and easier to develop on and operate: the vertically scalable monolith. We had the option to prioritise developer productivity over machine productivity and we took it.
So fast-forward to today and everyone wants distributed systems again. Are we mad?
As an old-school developer it’s easy to imagine that folk want distributed systems because of a desire for scale. That isn’t a crazy thing to think. It’s why we used them before and the people who are leading the field here (Netflix, Google, Facebook) are hyperscale, need to use distribution for scale, and talk at a lot of conferences.
If you’re pushing your systems to their limits in terms of traffic volumes then you’ll constantly grapple with the failure modes I described in my article and book and life will be really hard for your dev/ops teams. Murphy’s law will rule: everything that can go wrong will go wrong.
But my colleagues at Container Solutions & I just spent the past 6 months interviewing not-yet-hyperscale Microservice users and we noticed two unexpected things:
Fundamentally, we finally understood they weren’t using distributed systems for scale. They were using microservices for developer productivity.
What are the implications of that? There are several:
Fundamentally, most of the businesses we spoke to were using microservices for the same trade-off we’ve all been making for 30 years: developer productivity over operational efficiency. In our book we describe three case studies where this was true.
If a distributed system is not under load many of the coincidences that are hard to debug will happen seldom enough that engineers have time to resolve them as they go and so can evolve their service to be distribution-safe.
The classic comment on distributed systems is they wouldn’t be that hard if it weren’t for the errors. If you run the whole system under low effective load and low resource utilisation then you reduce errors. If you heavily use reliable stateful services like queues or, even better, queues-as-a-service you can reduce errors even further although it will slow your system down and cost you money. That’s a developer productivity vs hosting cost trade-off.
Distributed systems are really hard if you are using them to deliver scale and efficiency. However, if that isn’t your primary aim then some of the difficulties can be actively reduced giving you more time to feel your way to a system-wide solution.
Ironically, I can see the biggest danger (other than the unpleasant environmental implications) for taking the approach of running distributed systems over-provisioned to manage errors is one we didn’t face much in the 90’s - security. Back then stuff didn’t get hacked as much as it does today.
Evolving a distributed system and fixing the crashes and bugs along the way is OK because crashes are obvious. Hacks are less obvious and less easy to recover from, so distributed system security still does need to be planned up front, you can’t safely evolve it. Sam Newman speaks well on this subject.
My conclusion is distributed systems are always going to be hard but you can make them easier by choosing less efficient communication options like REST and slowing things down with copious reliable, stateful message handling services like queues-as-a-service. These are decisions you wouldn’t make if your primary motivation was efficiency at scale but if your motive is actually feature velocity they make sense.
One of the key components of effective strategy is to keep your goal in mind. If you are using microservices for scale you’ll make one set of architectural decisions. If you are using them for developer productivity you’ll make different decisions. It’s important that you and your tech teams understand your primary goal and you don’t attempt to achieve both at once. Don’t fight a tough war on more than one front.
There’s nothing wrong with this architecturally but I do want to point out it’s not environmentally friendly. We are using up resources here (burning fuel) to develop faster, but data centres already use 2% of the world’s electricity and that’s rising fast. If you feel concerned about this you can choose a cloud provider with a strong commitment to powering your systems with renewables. Google are leading the field here.
Read more about our work in The Cloud Native Attitude.