No, Really, Cloud Native Is About Culture, Not Containers

For most of 2019 and 2020, I felt uneasy and slightly guilty about the Cloud Native Computing Foundation’s definition of ‘Cloud Native’. The CNCF definition had a strong implication that you weren’t Cloud Native if you weren’t using microservices.

I thought I was writing Cloud Native applications, but I very rarely used microservices. I didn’t need them, so I didn’t bother. Did that mean my applications weren’t actually Cloud Native? Was I doing Cloud Native wrong? After all, who was I to say I knew more about Cloud Native than the Cloud Native Computing Foundation?

The good news, for my peace of mind if nothing else, is that the CNCF definition has softened the language around microservices. This is good timing, since the pendulum seems to be swinging back towards monoliths, as more organisations discover the tradeoffs of a microservices architecture; but that’s another topic.

So what is Cloud Native? It’s one of those terms whose meaning seems to morph to meet the needs of the speaker. Cloud Native undoubtedly has a technological component (the hint is in the ‘cloud’ part of the name), but it’s about a lot more than technology.

PSSketches 602

When we evaluate a technology or an architectural pattern, we should be asking, ‘What does it do?’, but also, ‘What does it enable? ... What else do we need to do to get this benefit?’ Being Cloud Native is about enabling the benefits of the cloud: the cost savings and elasticity, the resiliency, but especially the speed-to-market.

To access those, the first step is to be on the cloud, but that’s not enough. It’s not enough even if you’re using microservices. Getting the most out of the cloud means having the right, Cloud Native, culture. So what does that culture look like?

Automate All the Things

Automation is nothing new. Software engineers love automating repetitive tasks. (Many of us love it too much, and catch ourselves going down automation rabbit holes for one-off tasks, just in case they might one day be repeated.)

What’s changed with Cloud Native is how much we can automate; declarative infrastructure allows us to automate much more deeply. We can configure networking, create servers, provision storage, and release applications, all in code. Cloud native technologies allow us to automate, and taking advantage of Cloud Native requires us to automate. If releases are slow and manual, it doesn’t matter how much Kubernetes is under the hood–it’s not Cloud Native.

PStankillustration

Many organisations struggle with automation, because it takes time and skill. Automation is sometimes pushed off to ‘later’, in favour of new features, and then the automation is never actually done. Aim to automate at the beginning of a new undertaking. The thing being automated will be smaller and more tractable, and the return on the automation investment will be maximised. Cultivate good habits like TDD (at all levels of the test pyramid, including unit, contract, and integration) and setting up a build pipeline on day one.

What impact does automation have on people? In a healthy organisation, automation should be great for people; it frees them from low-value repetitive tasks and allows them to move up the value chain to more fulfilling, creative work. Things can go wrong, though. If automation is perceived solely as a cost-saving mechanism to reduce headcount, employees won’t be rushing to automate themselves out of their jobs.

Even where employees feel secure, change is hard. I recently worked with a small, friendly company on a project to automate a decision tree of regulations knowledge. I was surprised to discover that some of the staff were intensely uncomfortable with the idea of transferring knowledge from their heads to a computer, because they worried they’d be less valuable to the company, or that their job would be going away. We had to convince them that only the boring parts of their role would be going away.

PSSketches 600

Psychological Safety Supports Learning

To make the switch from someone who does a task to someone who writes the automation, individuals need psychological safety. People need to believe it’s OK to not know everything, it’s OK to learn on the job, and it’s OK to make mistakes.

On one of my microservices projects, all of the integration testing was done manually, by a QA team. Each release was gated by an extended QA period, and there was often a long delay between the developers introducing integration bugs and the QA team finding them. My IBM Garage colleagues and I were mystified: why weren’t the QA team members teaching themselves to automate these tests?

With the benefit of hindsight, I think it was probably a combination of factors. Nobody felt able to invest the time in reskilling, because there was too much (repetitive) work and too much pressure on the deadlines. The QA team themselves felt more secure being QA experts rather than automation novices.

However, an organisation that doesn’t invest in learning and promote a culture of curiosity is going to be in trouble. Tech platforms just change too fast for a static skills profile to be sustainable, especially in the cloud.

Adapt to Automation

Getting automation in place is only half the battle. If we automate something, we must also update the processes that assumed the thing we just automated is expensive and unreliable. This is the bit that many organisations miss.

Before releasing a ‘Cloud Native’ application, I’ve had to fill in 64-tab release approval spreadsheets, with lots of questions about what proportion of the tests failed and how many tests were run.

These checklists make some sense when there’s a manual QA where tests can fail without blocking the build, or testers might forget to run some tests. They don’t make sense in a context where test execution is automated and the bar for success is 100%.

Release Early, Release Often

The process of releasing software is a good demonstration of the interplay between culture and technology. Before the cloud, software was released by pressing it onto discs and physically sending it around the world. That sounds expensive, and it was. It wasn’t the sort of thing you wanted to do often, and you had to get it right. Even users didn’t want to receive updates often, because installing an update was both complicated and tedious.

The cloud allows us to release software much more often. Sometimes users don’t even notice our releases, because they’re consuming a service we manage. Even if users install an update, it’s often a cheap and easy pull by an automated system.

Microservices allow this release cadence to be increased even further, because instead of coordinating the release of the whole application, we can release only a small part of the system. That means we don’t need to do prolonged stabilisation cycles or software freezes that span many teams.

This ability to release small components independently is one of the main benefits of microservices, because it means—if the culture has kept up—releases happen much more often and the system as a whole is more nimble.

If you’re trying to keep count, the other benefits of microservices are related to heterogeneous stacks, resilience, and performance. Different parts of the application can be scaled independently, and if one part of the application is consuming excessive CPU, it won’t starve other processes of resources. More accurately, the resource starvation will depend on the deployment topology and it will be a ‘noisy neighbour’ type of problem at worst. Because each service runs in its own process, crashes in one microservice won’t instantly take out all the other services, but unavailable services must still be handled with care to avoid awful cascading failures.

It sounds kind of obvious, but the benefit of small independent releases will only be achieved if an organisation actually does small independent releases. My team did some calls with a veteran Asian bank that was struggling to keep pace with more nimble competitors. The bank decided the solution was to modernise its aging architecture and reimplement a large COBOL estate as a set of microservices.

Sadly, their release board met twice a year. No matter how fine-grained and architecturally elegant their microservices were, releases would be happening (at most) twice a year.

Microservices are designed to be able to go faster, but at a cost; distributed systems introduce significant operational and runtime complexity. What’s the point of taking the hit for a distributed system, if applications can’t actually get to market faster?

PSSketches 599

This example shows the importance of ‘if you automate something, fix the surrounding processes that assume it’s hard’. In this case, releases of individual microservices should be (with appropriate monitoring and automation) easy; but the release board process still assumes it’s slow and hard. The surrounding processes almost eliminate the benefits of the microservices architecture, but they don’t do anything to fix the downsides.

Releasing is so essential to Cloud Native success that it should be done as often as possible. Organisations that do this well actually make a distinction between ‘releasing’ and ‘deploying’. Releasing can be done on whatever schedule the business has an appetite for, but deployment is a technical issue and should be done as often as possible.

Releasing (that is, making functionality available to users) is totally decoupled from deployment. Visibility of a new piece of functionality can be controlled using A/B tests or feature flags, or by not even wiring new modules in. Deploy a walking skeleton with no functionality early in the project to get the hang of deploying, and then stay in practice with regular updates.

As well as making getting code to production delightfully easy, separating releases from deployments gives ‘free’ validation of the rollback mechanism; to rollback a bad feature, just ‘un-release’ an existing deployment.

Risk Management

The emphasis on rollback and recoverability is another important part of Cloud Native culture. If software is packed up and released in cardboard boxes, there is no rollback, and even patches are horrible and expensive.

Because of this, many organisations still have time-consuming, up-front QA processes, and almost no organisational capacity for failure management. Instead, prioritise recoverability; ideally, organisations should be able to recover from self-inflicted or externally-triggered outages within milliseconds, with no data loss and no manual intervention. That may be too ambitious for mere mortals, but strive to minimise manual steps and avoid approval handoffs in disaster situations.

Autonomy

The dream of microservices architecture is that each team manages its own service, with complete autonomy and without handoffs. Autonomy has huge benefits; localised decision-making means the people with the most information are the ones making the decisions. Decisions can be made quickly without cumbersome committee meetings. And because autonomy feels good, autonomous teams are likely to be engaged and energised.

In practice, many organisations feel autonomy is best in moderation. To manage risk, they impose centralised governance over cloud provisioning, software stacks, and releasing. Some of our clients tell us that they use a single CI/CD pipeline to ensure all microservices are released on the same, centrally controlled pipeline.

It should be obvious that this is seriously missing the point. Enforcing lockstep releases undoes many of the benefits of cloud technology, and the reduction in speed saps morale.

PSSketches 598

Nonetheless, even with properly autonomous teams, system-level considerations don’t go away. How do we ensure (non-coercively) alignment with the organisational goals? How do we ensure changes to one microservice don’t break others?

Many of us have experienced microservices spaghetti, which is like normal code spaghetti, except distributed and more frightening. It is quite possible for every component in a system to be behaving as intended, and for the system as a whole to be totally broken. Automated end-to-end tests can help with this, but they’re expensive.

Consumer-driven contract tests, in which the consumer and provider share a description of the service behaviour and test against it, are more scalable. As well as helping to ensure system-level integrity, the contract framework acts as a handy mock for the consumer’s unit tests, and as a handy test suite for the provider.

Despite their obvious value in a microservices context, contract tests are not widely used. Implementations sometimes fail because the teams involved can’t coordinate updates to the shared contracts. If the consumer updates the contract, it can break the provider’s build and cause cross-team annoyance. This is a legitimate problem and shouldn’t be trivialised.

Nonetheless, if it’s too hard to manage updates to a contract and communication between the CI/CD pipelines, is it likely communications between the actual services and updates to the implementation will go smoothly? Deferring integration concerns from the development phase to production is unlikely to be a winning strategy.

Coupling Is Still a Thing

The problem contract tests are trying to address, and the reason they’re hard, is that coupling between services still exists, no matter how distributed the architecture. Services have to interact to create value. Coupling can be maximised or minimised, but it will never go away. The key to a successful architecture is to manage coupling.

PSSketches 597

Wishful Mimicry and Incomplete Adoption

The reason many Cloud Native transformations run into trouble is that they suffer from ‘incomplete adoption’. The back-end code is split into microservices, but the front end remains monolithic, or the database doesn’t get split up, or the middleware hairball remains centralised and hairy.

Perhaps even more commonly, new technologies get adopted but the culture stays exactly the same. It’s a natural human tendency to try to learn by imitating others who are successful. This isn’t a bad thing, but before transplanting practices or architectures into our own organisation, we need to spend the time fully understanding all of the ingredients that go into that success.

If we pick up just some of the more inviting practices in the hope that that’s enough to create the full transformation, things won’t end well.

PSSketches 603

Illustrations by Holly Cummins.

Related Cloud Native Patterns

Automated Infrastructure

The absolute majority of operational tasks need to be automated. Automation reduces inter-team dependencies, which allows faster experimentation and leads in turn to faster development.

Automated Testing

Shift responsibility for testing from humans (manual) to automated testing frameworks, so the quality of the released products is consistent and continuously improving, allowing developers to deliver faster while spending more of their time improving features to meet customer needs.

Continuous Delivery

Keeping a short build/test/deliver cycle means code is always ready for production and features can be immediately released to customers—and their feedback quickly returned to developers.

Psychological Safety

When team members feel they can speak up, express concern, and make mistakes without facing punishment or ridicule, they can think freely and creatively, and are open to taking risks.