We use Mesos a lot. Developing with Mesos is a full time job for a large proportion of Container Solutions. From the perspective of a production user, the abstraction of physical hardware into resources is a fantastic idea. But in this post, I want to discuss how usability can affect Mesos development. And how some ongoing issues are causing frustration and are potentially preventing user adoption.
Developing Mesos Applications
A large number of production projects could be accomplished using off the shelf components and standard microservices. But for some applications, it makes sense to take advantage of the functionality of Mesos. For example, in the Elasticsearch (ES) Mesos project we have developed a sophisticated scheduler that manages the lifecycle of the ES nodes. In a previous blog post, I wrote about why you might want to develop a Mesos Framework.
Writing a Framework
We write most of our frameworks using Java. But it is quite feasible to use other languages that Mesos has interfaces for (e.g. Go, Python or C++). The first thing that you will want to do is to learn the APIs. The implementation of the scheduler interface is refreshingly simple. There are really only a handful of methods that need to be implemented in order to have a working framework. Up to this point, development is very fast. You will find that you can have a working framework within a day.
Testing your framework
However, this is where things start to slow down. The next problem is testing your framework. Mesos is designed in such a way that Master and Agent binaries should be running on their respective nodes. The binaries are written in C++, which means they need to be compiled and packaged for each architecture. If you are not running Ubuntu, simply getting Mesos to start can be incredibly difficult for a new user (*). Docker images make this easier, but this now places the requirement of using Docker in your application. We do this in our minimesos project to make it easier for users to start a Mesos cluster on their machine.
So we now have a working cluster. Either in Docker or Vagrant or on some real hosts. Depending on what your framework does, your next problem will be with networking. Mesos doesn't have any network capabilities built in. This is the user's/developer's responsibility. At first, it seems like this isn't a problem, because Mesos has a concept of Framework messages. (*). It is possible (but not advisable, see later) to do all your custom communication through the framework messages. But to do this, you must serialise your data to be compatible with the Protocol Buffer format. Also, to receive any custom data, this automatically means your framework has to implement a custom Executor (the interface that represents the task that runs on the Agents).
Executors are useless
In my opinion, a custom Executor is never required. Except when it is (*). But for the vast majority of projects, using the default implementation of an Executor is more than enough. If you can run your application as a binary or as a Docker image, then you can use the default Executor (*). The recent book by David Greenberg, "Building Applications on Mesos", echoes my sentiment when introducing the chapter on Executors:
"You probably don't want to do this"!
By removing the executor you can save yourself approximately 1000-2000 lines of code (Java) and more importantly, decouple your framework from the application that you are trying to run.
A Development Example
Once we have decided we don't want to use a custom Executor, this means we can't use the framework messages. So we need to communicate with our application (e.g. health checks) over the application's standard interface (e.g. REST). Sometimes there are elements within the Protocol Buffer format that can achieve what you want to do (*). For example, there is an element within the TaskInfo (*) to specify a CommandInfo (*) to perform a HealthCheck (*). After looking through the code, it becomes apparent that there are two different health checks depending on whether you are starting a Docker image, or a shell command. Looking at the
mesos.proto file which defines the Protocol Buffer interface, it is possible to see that the HealthCheck can accept a HTTP endpoint. After you have compiled (and built the Docker image if using Docker) and test this on your cluster, the first error message you will see is that you haven't specified a Port. The second time you compile and test you will find a message saying words to the effect of "Sorry, this hasn't been implemented yet". If you go back to the code, you will find that the only method implemented is the command method (*). If you repeat this process again, you'll find that if you are using any version from 0.23-0.25, it will cause a cracking C stack trace.
Time is Money
The previous walkthrough was intended to provide an opinionated demonstration of development with Mesos. It took the example of writing a health check for a task. But it could equally be any number of what most would consider a simple use case.
As developers, our service is exchanged for money/recognition/philanthropic happiness. Whatever or whoever is paying, the most valuable resource we have is time. When we spend more time than expected on a task, everyone in the chain suffers. Developers suffer by reducing their output. Clients suffer because they have to pay more to do the same thing. Applications and services suffer because they receive fewer features. When developing with Mesos, what eats the time?
Documentation and usability
Throughout this post, you may have noticed the (*)'s everywhere. Every time you see one of these, it means there is little or no documentation about that particular subject. Not in the code, nor in user manuals or in wikis. Developers have to infer usage from the code.
Some developers abhor writing documentation. But documentation is absolutely necessary for a public project to survive. And documentation doesn't have to mean reams of wiki pages or thousands of comments. Examples, demos, little snippets of information all save the developer time. Take the health check example above. An example of a few lines of protocol builder construction that is known to be working would have saved me days of time. I would have realised straight away that the "working" example doesn't work, which probably means there is a bug.
But documentation is more than this. There is a visual aesthetic too. The way the documentation is currently presented gives the impression that the project isn't professional. It seems like it is the result of a cohort of OSS warriors, not ready for production. Compare this to other projects of comparable scope: Kubernetes, Docker, etc.
The use of C++
I'm unsure of the original reasons for choosing C++ but I suspect it causes more problems that it solves. The first problem is the build system. The code must be built for the architecture that it is to be run on. This means if you can't use a prepackaged binary (Mesosphere produce binaries for Ubuntu, Debian and Centos), then you will have to build it yourself. The build scripts themselves require a PhD to investigate, and if you ever need to build anything against the code (e.g. a Module), I would suggest you don't. It takes a long time to get anything building.
There is a subsequent difficulty, in that the code relies on a number of nonstandard libraries. Libprocess adds an Actor like model to C++. Which is a great idea, but I'm a little worried about its lack of use. And because the code is asynchronous, it becomes very hard to follow. Since the code is in effect the documentation, it makes the task even more difficult.
Mesos is released often. Which is great. New features are added all the time. But it seems that version compatibility is not a priority. How coupled an application is to Mesos will differ between software. But we have not had an occasion where our software works on more than one version of Mesos. This is somewhat embarrassing and results in a number of confusing bug reports (remember the C stack traces?) that turn out to be a difference in version. And even though Protocol Buffers can handle version changes reasonably well, on many occasions version compatibility has been prevented due to a non-existent field in the application or Mesos (depending on whether you are going up or down in version).
Related to this is the issue of coupling. Your applications coupling to Mesos will depend on the interface chosen. Take frameworks as an example. Despite a very small interface to produce a Scheduler and an Executor, it requires a developer to import the entire Mesos codebase (mainly due to the protocol buffers). This means that the full Mesos library is required on every machine that could be running Scheduler/Executor code. This is a particular problem for developers because it means they have to have the Mesos libraries on their developer system. For example, the process of building Mesos for OSX is tedious. Mesos have begun to address this problem with a HTTP interface to the scheduler. Because our code is so coupled to the Mesos binaries, we haven't had chance to try this yet. But if starting a framework for the first time, I would definitely recommend using this instead of the Java/Python/C interfaces.
I suspect this is an issue with any distributed system, but logs will be distributed to several different physical locations and on several different levels. A new Mesos developer really struggles to get to grips with the logs. Even in the HealthCheck example above, I was looking for logs in the wrong place for a significant amount of time.
There are two Mesos logs, the Master and Slave. Depending on where your application fails, it could produce a log in either of these two. For the slave logs, there is one for each of the Agents in the cluster. In HA mode, there may be more than one Master log. Then there are application logs, which go to the Mesos sandbox. One for stdout and one for stderr. Then there are application specific logs like Docker logs, or system logs, etc.
When debugging a user's problem, or when starting development, finding where an error appears or where a log message has gone is a significant task. Potentially frameworks like Mesos Logstash can help here.
I have heard of a number of people that have raised this issue. Open source projects rely on the community. If there are any barriers, however small, to this process the number of contributors will drop. From an outsiders perspective, the obfuscated commit process and the Apache processes (many would say "bureaucracy") are a huge barrier for commits. Any commit process that requires 1500 words to describe it is too long.
This post has been rather scathing, but it is intended to be constructive. To highlight the importance and inspire Mesos to re-focus on usability.
Right now, there are a number of major issues that are preventing users from choosing Mesos, preventing developers from producing feature-rich software and reducing contributions. Many of these issues are fixable. Most issues can be fixed with a concerted effort to improve the look, feel and content of the documentation.
Version compatibility is an easy fix. It simply takes development effort. The Mesos developers should always remember that there are others who rely on them. If they make breaking changes, then it should be clearly communicated. If they make breaking changes often, then they risk users using old versions and ultimately leaving for a competitor.
Some of the other issues may not be fixable. For example, there was probably a justifiable reason for using C++ and the effort required to rewrite everything is probably too great. But whilst this is a reason, it does not remove the problems of trying to implement and debug software on a complex C++ based system.
To repeat the comment at the start, we use Mesos a lot. We use it because it has a brilliant ability. To think of a cluster of machines as one single set of "resources" is a very powerful abstraction. It enables a large cluster of machines to handle a workload that is as varied as it is resilient. I believe that Mesos is a system capable of providing up-times that are very close to 100%. Add to this the flexibility of being able to request and assign work for any imaginable job. This is the future.
(*): This is not a footnote. See the "Documentation and usability" section.