Since March I have been interviewing enterprises, vendors and pundits about Cloud Native for my ebook, “The Cloud Native Attitude”, which came out last week and is currently available as a free download.
In August, The New Stack very kindly ran an excerpt from my book, which was about distributed systems, aka microservices. In it I describe how, at scale, microservices are very hard to operate.
While doing all this research, however, I noticed something surprising. Although historically distributed systems were used to improve operational scale and resilience this time most companies were not using them for that. They did care about machine productivity and availability but that wasn’t their primary concern. Their strongest motivation was improving the productivity of their development teams and getting new features to production faster.
We realised that most companies are developing Cloud Native to optimise developer productivity not machine productivity.
Microservices Are About Scale. Aren’t They?
In the olden days we used distributed architectures because Moore’s Law hadn’t delivered enough yet. Networks were slow, unreliable and variable, particularly the “last mile” to browsers. For availability that often meant components separated by network connections cached data locally and became tiny stateful services (e.g. in the client/server model).
Servers were flaky - you had to assume frequent failure and exhausted resources. They were also small and weak - you needed good utilisation, multi-threading, and many machines. Distribution was thus key to getting performance and scale. DistSys was hard because it had a lot of failure modes and it was slow to develop and debug but there was no choice. We had to optimise for machine productivity and that goal is what we used distribution to achieve.
Then Moore’s law came to the rescue. Machines and networks became bigger and better and we stopped torturing ourselves with distributed systems and embraced something less efficient but faster and easier to develop on and operate: the vertically scalable monolith. We had the option to prioritise developer productivity over machine productivity and we took it.
Have We Lost The Plot?
So fast-forward to today and everyone wants distributed systems again. Are we mad?
As an old-school developer it’s easy to imagine that folk want distributed systems because of a desire for scale. That isn’t a crazy thing to think. It’s why we used them before and the people who are leading the field here (Netflix, Google, Facebook) are hyperscale, need to use distribution for scale, and talk at a lot of conferences.
If you’re pushing your systems to their limits in terms of traffic volumes then you’ll constantly grapple with the failure modes I described in my article and book and life will be really hard for your dev/ops teams. Murphy’s law will rule: everything that can go wrong will go wrong.
But my colleagues at Container Solutions & I just spent the past 6 months interviewing not-yet-hyperscale Microservice users and we noticed two unexpected things:
- They all stated their primary motivation was not scale at all but developer productivity and feature velocity. They wanted to ship product updates more often.
- They seemed to run a long time in production before they hit common highscale distributed systems problems. They thus seemed to have time to work out how to handle the issues.
Fundamentally, we finally understood they weren’t using distributed systems for scale. They were using microservices for developer productivity.
What are the implications of that? There are several:
- The current drive behind architectural decisions wasn’t necessarily availability or efficiency (scale). Instead the motivation was to split up function to avoid team clashes and allow developers to develop and deploy safely in parallel (productivity). It wasn’t Moore’s law in battle with Murphy’s law. It was Conway’s law. It was extreme modularity.
- Tech teams deliberately took decisions that reduced operational efficiency but were easier for devs, like using RESTful rather than more efficient RPC messages.
- The systems weren’t optimized to maximize traffic throughput per server. They were more likely to be over-provisioned, which is a good way to reduce errors. Without extreme load, they might not immediately hit the traditional, statistically-driven errors of distributed systems.
Fundamentally, most of the businesses we spoke to were using microservices for the same trade-off we’ve all been making for 30 years: developer productivity over operational efficiency. In our book we describe three case studies where this was true.
That’s Why Distributed Systems Mean Something Different Now
If a distributed system is not under load many of the coincidences that are hard to debug will happen seldom enough that engineers have time to resolve them as they go and so can evolve their service to be distribution-safe.
The classic comment on distributed systems is they wouldn’t be that hard if it weren’t for the errors. If you run the whole system under low effective load and low resource utilisation then you reduce errors. If you heavily use reliable stateful services like queues or, even better, queues-as-a-service you can reduce errors even further although it will slow your system down and cost you money. That’s a developer productivity vs hosting cost trade-off.
Distributed systems are really hard if you are using them to deliver scale and efficiency. However, if that isn’t your primary aim then some of the difficulties can be actively reduced giving you more time to feel your way to a system-wide solution.
The Danger
Ironically, I can see the biggest danger (other than the unpleasant environmental implications) for taking the approach of running distributed systems over-provisioned to manage errors is one we didn’t face much in the 90’s - security. Back then stuff didn’t get hacked as much as it does today.
Evolving a distributed system and fixing the crashes and bugs along the way is OK because crashes are obvious. Hacks are less obvious and less easy to recover from, so distributed system security still does need to be planned up front, you can’t safely evolve it. Sam Newman speaks well on this subject.
Conclusion
My conclusion is distributed systems are always going to be hard but you can make them easier by choosing less efficient communication options like REST and slowing things down with copious reliable, stateful message handling services like queues-as-a-service. These are decisions you wouldn’t make if your primary motivation was efficiency at scale but if your motive is actually feature velocity they make sense.
Strategy
One of the key components of effective strategy is to keep your goal in mind. If you are using microservices for scale you’ll make one set of architectural decisions. If you are using them for developer productivity you’ll make different decisions. It’s important that you and your tech teams understand your primary goal and you don’t attempt to achieve both at once. Don’t fight a tough war on more than one front.
Anything Wrong?
There’s nothing wrong with this architecturally but I do want to point out it’s not environmentally friendly. We are using up resources here (burning fuel) to develop faster, but data centres already use 2% of the world’s electricity and that’s rising fast. If you feel concerned about this you can choose a cloud provider with a strong commitment to powering your systems with renewables. Google are leading the field here.
Read more about our work in The Cloud Native Attitude.