Earlier this year as part of the preparation for my new book "The Cloud Native Attitude" I interviewed a lot of expert practitioners to find out what they had tried, struggled with and achieved. Over the next 2 weeks I'm going to publish those interview write-ups. They are also in the book, which is currently available as a free eBook download.
Case Study: The FT
Based in London, The Financial Times has an average worldwide daily readership of 2.2 million. Its paid circulation, including both print and digital, is 856K. Three quarters of its subscribers are digital.
The FT was a pioneer of content paywalls and was the first mainstream UK newspaper to report earning more from digital subscriptions than print sales. They are also unusual in earning more from content than from advertising.
The FT have been gradually adopting microservices, continuous delivery, containers and orchestrators for three years. Like Skyscanner (who I’ll talk about next), their original motivation was to be able to move faster and respond more quickly to changes in the marketplace.
As Sarah Wells, the high-profile tech lead of the content platform, points out, “our goal of becoming a technologically agile company was a major success - the teams moved from deploys taking 120 days to only 15 minutes”. In the process, according to senior project manager Victoria Morgan-Smith, “the teams were completely liberated”. So how did they achieve all this? Broadly speaking, they made incremental but constant improvements.
The FT have moved an increasing share of their infrastructure into the cloud (IaaS). Six years ago, the FT started with their own virtualized infrastructure but then adopted AWS as Amazon solved issues with funding, monitoring, networking and OS choice. As Sarah Wells described it, “custom infrastructure was not a business differentiator for us”. They now have a target of 100% cloud infrastructure and they use off-the-shelf, cloud-based services like databases-as-a-service (including AWS Aurora) and queues-as-a-service wherever possible. Again this is because operating this functionality in house is “not a differentiator” for the company.
Within the FT as a whole there was a strong inclination to move to a microservices-oriented architecture but in different parts of the company they took different approaches. The FT have three big programmes of work where they implemented a new system as a set of microservices. One of those (subscription services) incrementally migrated their monolithic server to a microservice architecture by slowly carving off key components. However, the remaining two projects (the new content platform and the new website) essentially both built a duplicate of their respective monoliths right from the start using microservices. Interestingly, both of those approaches worked successfully for the FT suggesting that there is no one correct way to do a monolith to microservice migration.
After nearly three years the content platform has moved from a monolith to having around 150 microservices each of which broadly “does one thing”. However, they have not followed the popular “Conway’s law” approach where one or more microservices represent the responsibilities of each team (many services to one team). Instead multiple teams support each microservice (many to many). This helps maximize parallelism but is mostly because teams work end-to-end on the delivery of features (such as “publish videos”) and these features usually span multiple microservices. They then monitor for deploy conflicts between teams. If clashes regularly occur then the service in contention is split further.
They found that, in Wells’ words, “infrastructure-as-code was necessary for microservices”, and they evolved a strong culture of automation and CD. According to Wells:
“There is a fair amount of diversity within the FT with some teams running a home-grown continuous delivery system based on Puppet while others wrap and deploy their services in Docker containers on the container-friendly Linux operating system CoreOS, with yet others deploying to Heroku
Basically, we have at least:
1) A home-grown, puppet-based platform, currently hosted on AWS without containers
2) A Heroku-hosted PaaS
3) A Docker container-based environment using CoreOS, hosted on AWS”
All of these environments work well, they are each evolving and were each chosen by the relevant tech team to meet their own needs at the time. Again, the FT’s experience suggests there is more than one way to successfully implement an architectural vision that is microservice-oriented and runs in a cloud-based environment with continuous delivery.
Finally, the FT’s content platform team found that containers were the gateway to orchestration. The content folk have been orchestrating their Docker-containerised processes in production for several years with the original motivation being server density - more efficient resource utilisation. By using large AWS instances to host multiple containerised processes, controlled with an orchestrator, they reduced their hosting costs by around 75%. As very early users of orchestration they created their own orchestrator from several open source tools but are now evaluating the latest off-the-shelf products, in particular Kubernetes.
So what unexpected results came out of this Cloud Native evolution for the FT? They anticipated the shift to faster deployments would increase risk. In fact, they have moved from a 20% deployment rollback rate to ~0.1%, i.e. a two order-of-magnitude reduction in their error rate. They ascribe this to the ability to release small changes more often with microservices. They have invested heavily in monitoring and A/B testing, again building their own tools for the latter, and they replaced traditional pre-deployment acceptance tests with automated monitoring in production of key functionality.
How have they handled the complexity of distributed systems? They chose to make heavy use of asynchronous queues-as-a-service which simplified their distributed architecture by limiting the knock-on effects of a single microservice outage (although this does increase system latency, a tradeoff they accepted). They also limit the use of chained synchronous calls to avoid cascading failures as one failed service holds up a whole chain of services waiting on outstanding synchronous requests. They also struggled with issues around the order of microservice instantiation and are contemplating rules that microservices should exit if pre-requisite services are not yet available, allowing the orchestrator to automatically re-start them (by which point their pre-requisite service should hopefully have appeared). Basically, it was difficult but they learned and improved as they went.
According to project manager Victoria Morgan-Smith “our goal throughout was to de-risk experimentation” but that involved “training, tools and trust”. The FT heavily invested in internal on-the-job training with an explicit remit for their devops teams to disseminate the new operational knowledge to developers and operations. They learned that their teams could be trusted to make good judgments if they were informed, given responsibility and had the right tools. For example, initially, their IaaS bills were very high, but once developers were given training and access to billing tools and guidance on budgets the bills reduced.
In common with many other early adopters the FT experimented and built in-house and were prepared to accept a level of uncertainty and risk. Sometimes their tech teams needed to re-assess as the world changed, as with their move from private to public cloud, but they were persistent and trusted to make the occasional readjustment in a rapidly changing environment. Trust was a key factor in their progress.
Read more about our work in The Cloud Native Attitude.