For me, one of the most fun and useful parts of going to conferences is talking to other attendees. I learn a lot that way about what folk are currently trying and learning.
Talking at a recent Velocity conference on distributed systems I had three extremely interesting conversations about three different problems attendees had experienced with over-engineering, which is a big risk when building complex DistSys.
It Would Be Easy If It Wasn't For the Errors
I built and supported very early distributed systems in production for over a decade and I remain in shock & awe at the power of the key concepts of replication and decoupling. Awe because they can achieve such great performance and modularity. Shock because there’s so much that can go wrong if you use them too fearlessly. The power of replication and decoupling is deceptive. A little goes a long way.
It has been said about DistSys that they would be easy if it wasn’t for the errors. One difficulty with designing a microservices architecture is that in dev or test or low-traffic production it’s hard to envisage the errors that will appear as scale increases. A great design in theory can often be too complex or slow in practice.
War Story 1
A common microservice mistake is to underestimate the performance impact of passing lots of messages from machine to machine. Passing messages on a network is at least 100 times slower than internal message passing. Add in a RESTful framework and a message queue for reliability and everything can get very slow.
You generally don’t want to be sending too many small messages, but the designer needs to think more deeply than that. For each message, does latency matter? There are some messages that are not holding up mainline execution or only need to move at human speeds. Those messages can potentially sit on asynchronous queues for milliseconds or even seconds! But if you have messages that need to be passed at computer-speeds you probably need to do that within a single microservice.
One of the attendee war stories I heard at Velocity London this year was of over-engineering by an over-keen contractor. He built an extensive multi-microservice system that ran like treacle. Why so slow? Too many services, too many communications of the wrong (critical) sort. The contractor soon left and the attendee’s company ended up replacing the system with a far simpler one with a smaller number of services to achieve acceptable throughput. Distributed systems are useful but also have downsides. In this case the contractor traded a nice-looking, highly decoupled design in theory for terrible execution speed in practise.
I suspect there are several lessons here.
- The first lesson is that this contractor learned no lesson. He built a system that looked good to him but he left before seeing it fail in production. If an engineer has built microservice systems but never operated and fixed them at realistic scale, he hasn’t actually learned how to build microservices. He’ll probably go on and build that bad system again somewhere else.
- The second lesson is that decoupling is a great architectural concept but it has to be used with care. One of the key factors when you decide on microservice size is how many messages is the service going to exchange with other services and how will they be sent? Are those messages latency-sensitive? Smaller microservices are more decoupled but if they require more message passing of latency-sensitive messages they could throttle your system..
War Story 2
The second Velocity attendee’s war story was a different problem. This time the performance of their microservices was fine. The issue was interoperability. Every service had a different interface and it became difficult to get them to talk to one another successfully.
Although in theory a microservice architecture is polyglot and every team can do their own thing, in reality many companies get into trouble with complexity and incompatibility. Just because microservices can be highly individual doesn’t mean they have to be, or even should be. In practise it’s often better to temper the freedom of microservices with some standardised constraints. System-wide conventions and shared tools for, for example, message passing and formats can make life much easier if slightly less creative.
War Story 3
The third war story I heard was not about microservices really, but it was pretty funny so I'm including it. A conference attendee was asked to help a company out with a data processing Hadoop cluster. It had been set up by a previous contractor and was running poorly. When our protagonist asked for a sample of the data the cluster processed, he was handed a 2GB file. That was it, it never handled anything bigger! He replaced the entire Hadoop cluster with a simple Python script that would run on anyone’s laptop.
So here was a great example of inappropriately using complex and hard-to-maintain technology. The 2GB of data could easily be handled by a script or even a spreadsheet! Both of those solutions would be far cheaper and easier to support for the client company who clearly lacked Hadoop expertise. This tale is a good reminder that a new system is forever, not just for the first time you put it in when you can handhold it.
WTF!
The interesting thing is, these three microservice war stories were just from the first three strangers I randomly chatted to at Velocity. They were all recent problems, which is unsurprising as the technology is new. But is there any other pattern here?
Their problems all seemed to be caused by someone enthusiastically trying a complex new thing (microservices) without much experience of the tech in production. That’s bad if the designer is a contractor who then disappears, learning nothing and leaving a “legacy” mess behind.
Be careful about your contractors. Are they using you to make their CVs look better on paper? Do they understand the trade-offs of a microservice architecture? Have they actually operated this stuff in production? It’s fantastic to learn and we should encourage it. It’s brilliant to invest in the expertise and experience of your staff or folk with whom you have a long term relationship. However, If you are investing to build misleading CV points for a contractor who is quickly going to vanish then be careful.
We often help organise conferences and meetups and we would love to hear more war stories like this. I wish people would talk about them in public - let’s not only hear the panglossian stuff. I love a disaster movie (as long as I'm not in it). Do submit some tales of horror in conference CFPs! Especially if you recovered from them!
If you want to read more about the fundamental ideas behind Cloud Native try our free eBook The Cloud Native Attitude. This includes several case studies from folk who have learned this stuff the hard way. It took time!