The Apache Mesos cluster manager emerged from work out of Twitter and UC Berkeley around 2009. Over the last six years the Mesos project has kept a relatively low profile among mainstream technology press, but has recently exploded on to the scene with a bang. The reality, however, is that many large organisations have been successfully using Mesos for production (critical) workloads for quite some time, including Twitter, AirBnB, eBay and Apple.
A number of companies are now building rich application platforms around Mesos; Mesosphere’s DCOS, Yelp’s Paasta, Capgemini’s Apollo and Cisco’s microservice infrastructure. I’ve been fortunate enough to report on many of these for InfoQ, and so I was keen to summarise the reasons that I think organisations are investing so heavily into Mesos…
Avoiding cloud/datacenter vendor lock-in
One of the primary reasons that is cited when discussing the benefits provided by Mesos, is the level of abstraction (or anti-corruption layer) it provides over proprietary cloud or datacenter vendor infrastructure. Mesos very much acts like the Linux kernel, but at the abstraction level of the datacenter - this effectively homogenises infrastructure (as far as applications are concerned), and pools compute resource for use by components within your software systems. In an article I wrote for InfoQ that explored the Mesos-based Capgemini Apollo project, Graham Taylor explained why Capgemini based their microservice and big data platform on Mesos:
“As engineers, we understand that other developers don’t particularly like proprietary / closed solutions or those that have complete lock in to a particular cloud provider. Since we were already leveraging other open-source components, this decision made complete sense.”
Support for mixed workloads
Mesos was built to support mixed workloads of long-running (application) and short-running (batch processing) processes and jobs. For example, Twitter utilises Apache Aurora to run their microservice applications alongside Hadoop for crunching large amounts of data. Mesos also supports Storm and Spark, and so data crunching can be undertaken in real-time as well (which embraces the lambda architecture pattern). In a recent InfoQ article, Keith Chambers of Mesosphere discussed that part of the reason that Cisco is embracing Mesos is the ability to ‘democratise the primitives’ of modern software systems:
“Cisco is betting big on IoT and this is part of our bet. We want to democratize the primitives (Kafka, Spark, Cassandra, Elasticsearch, etc) for building distributed IoT applications by making them available to all.
[...]
We know of two production deployments outside of Cisco and we have begun offering ‘Marathon as a Service’ for Cisco product teams.”
As discussed in an article by Wired magazine, the fact that all of these workloads can be run in the same cluster is revolutionary, and reduces the need to design, run and maintain multiple job-specific clusters. As John Wylkes often talks about in relation to Google’s Borg, running multiple workloads with uncorrelated demand (e.g. priority one and priority two jobs) allows the maximum utilisation of your cluster if you can dynamically allocate capacity. This can be a game-changer within cluster computing.
A perfect partner for Docker
There is no denying the buzz around Docker, and this is often for good reason - the technology provides a great way to package and run (micro)services and application. With frameworks like Mesosphere’s Marathon and Apache Aurora providing first class support for running Docker containers, this is another reason for utilising Mesos. Marathon also provides machinery for dynamically allocating capacity (for example, reacting to application/infrastructure metrics), enforcing resource constraints, enabling blue/green deploys, and more, and we have utilised this with several clients already.
Many of the organisations building PaaS and ‘big data’ frameworks on top of Mesos are choosing to use Docker as their unit of deployment, which I believe adds credibility to the marriage of the two technologies. Docker Inc, also announced at the recent DockerCon that the experimental version of Docker Swarm has native support for Mesos via a custom Swarm cluster driver. It would appear that Docker Inc, are seeing the potential for Mesos too...
Provides a clear definition of the 'platform'
With the success of the DevOps movement we are seeing the integration of developer, QA and operator roles. DevOps encourages shared understanding, responsibility and accountability, but doesn't explicitly exclude the necessity for specialist roles, particularly when running complex software systems at scale. At times there still needs to be a clear definition (dare I say, boundary) of the applications and the platform, so that the right people get notified and can begin diagnosing the issue. I learnt a lot about this from a presentation at QCon NY by Nori Heikkinen on how Google deals with failure.
Some examples of modern specialist roles include the 'platform' engineer or site reliability engineer (SRE). These people focus on the platform and infrastructure to which software systems are deployed. For teams that utilise Mesos, the definition of the platform is obvious - the development teams package applications (for example within Docker images, combined with metadata), and even though developers may be responsible for running applications in production, the SREs can focus on the surrounding/underlying infrastructure i.e. everything managed by Mesos. There is a interesting case study on the Mesosphere website that explores in more detail the creation of a platform based on Mesos at HolidayCheck, a popular German travel site.
Flexible APIs
As mentioned above, Mesos provides a ‘kernel for the datacenter’, and accordingly schedulers (or frameworks in Mesos parlance) can run on top of this in order to utilise the exposed compute resource. The APIs available to schedulers are simplistic but powerful, which allows framework developers to create their own clean and cohesive APIs. Developers often comment that the APIs exposed by Mesos and frameworks like Mesosphere’s Marathon or Apache Aurora are easier and less opinionated than those provided by a typical commercial PaaS, but at a higher level of abstraction in comparison with IaaS, and hence more useful for developers.
Amazon Web Services have also released an experimental EC2 Container Service (ECS) scheduler driver that demonstrates how Mesos could be integrated with ECS at the API/framework level.
Elasticity
The ability to scale to meet demand is almost mandatory for any popular software system that is accessible via the web. Mesos enables this on two levels. For example, frameworks such as Marathon and Aurora provide the ability to dynamically schedule the supply of application services (see the comments above in the mixed workload section), and also the underlying infrastructure that is exposed as compute resource. Implementing the triggers (and associated metrics and calculations) for scaling with Mesos isn’t trivial at times, and this is where additional work is required, but Google have also talked about the difficulty of doing this with Kubernetes at a ‘utility computing’-scale (i.e. providing Kubernetes as a utility outside of Google, where demand and use cases are unknown).
Elasticity also allows for fault-tolerance. For example, if a machine within a Mesos cluster fails, the schedulers can move the applications and service that were utilising this resource to somewhere else within the cluster (providing of course, there is spare capacity).
There are many other reasons (so get in touch!)...
We could wax lyrical about Mesos for quite some time, but hopefully you are starting to see the benefits. The fact that the core technology is open source (allowing you to examine and extend the code) and has been battle-hardened by running at scale in production for many of the organisations mentioned above only adds to the attraction. Although configuring Mesos can be complex at times, companies such as Mesosphere are striving to make Mesos more accessible for the everyday developer and operator, and their recently launched Datacenter Operating System (DCOS) goes a long way towards this.
If you’re interested in experimenting with Mesos and the DCOS I highly recommend visiting Mesosphere’s website. At Container Solutions we’ve been using Mesos for quite some time, and so we are always happy to share notes on the use of the technology, or to help organisations explore how Mesos can be put to good use for application deployment, continuous delivery and operating at scale.