In our efforts to simplify and speed-up the MiniMesos Testing Framework, we decided to move away from a Docker-in-Docker implementation to one where each node ran in their own containers. We thought running each node in their own container was the right way to run a local Mesos cluster. We assumed that it would increase transparency, simplicity and speed of the unit tests. And in theory this should have been so... in practice, however, this proved to be quite difficult and made us scratch our heads more often than I'd like to admit.
Why run Mesos cluster nodes in containers?
When developing the ElasticSearch and Logstash frameworks we were frustrated by the length of the development cycle and the time it took to spin up the cluster when we ran our automated tests.
We used the internal docker registry to store executor and scheduler Docker images. We also used a proxy container to allow communication between the nodes.
We could have also used a VM to run the cluster, but for unit testing this was not feasible. (And running unit tests regularly is an aim of ours.)
In the end, this is the picture we were striving towards:
docker ps (out) CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES (out) fe9f4e4ad55d containersol/mesos-agent:0.22.1-1.0.ubuntu1404 "mesos-slave" 10 seconds ago Up 9 seconds 5051/tcp mini-mesos-slave-fa05fae3-2b4b-4ef8-b196-a11783252888 (out) 8bcba897710d containersol/mesos-agent:0.22.1-1.0.ubuntu1404 "mesos-slave" 12 seconds ago Up 11 seconds 5051/tcp mini-mesos-slave-05c2ae7b-503b-4f7a-954a-80b4bdb919dc (out) 95507434ff71 containersol/mesos-agent:0.22.1-1.0.ubuntu1404 "mesos-slave" 14 seconds ago Up 13 seconds 5051/tcp mini-mesos-slave-45fffd6d-9611-4334-b43d-fd38370eb226 (out) 9cad000d912b containersol/mesos-master:0.22.1-1.0.ubuntu1404 "mesos-master --regi" 15 seconds ago Up 14 seconds 5050/tcp mini_mesos_cluster-753556771 (out) 539b396db040 jplock/zookeeper:3.4.5 "/opt/zookeeper-3.4." 21 seconds ago Up 16 seconds 2181/tcp, 2888/tcp, 3888/tcp zookeeper-330863558
Mesos slave (now agent) and master Docker images
We are using Docker Containerizer, which means that agent nodes have access to the host Docker client binary, Docker socket and cgroups hierarchy.
Shared PID namespace
The first problem arises due to the fact that Mesos tracks the executor containers using PID. When an agent is starting a task it then asks Docker for the PID number of the container. Looking in the /proc folder Mesos doesn't see the process and assumes it's dead, killing the container straightaway and marking the task as TASK_LOST. You can look at the relevant Mesos jira issue.
The issue confused us even more when most of the times the executor task logs were empty, but sometimes you'd see the application managing to emit some log statements before being inevitably slashed by Mesos.
We tried mounting /proc as a volume, but Docker does not allow it. Luckily, since Docker 1.5 one can provide an option to share PID namespace with the host, which resolved the immediate issue.
Now Mesos can properly keep track of executor containers. Great.
Prefixed Mesos containers
By an unfortunate accident, we used the "mesos-" prefix when starting the containers. It turns out that Mesos itself uses this prefix to manage the containers and wrongfully kill all our containers. The solution was simple - change the containers prefix.
Now that the agents and master know about each other through the ZooKeeper node, the agents can create executor tasks and keep a track of them. Awesome.
libprocess and communication between Mesos master and the executors
The next problem that we faced was communication between executor tasks and Mesos master.
By default, the executor task is started in the host networking mode and libprocess binds itself to a randomly chosen port, but since it shares network with the host, it tries to bind on a loopback interface.
There are two options to resolve this:
1) Detect the IP address of the mesos-agent container and bind libprocess on that address.
2) Ask Mesos to run the executor tasks in the bridge networking mode instead of host, thus giving the executor task its own IP. This is described in more details on Apache Mesos JIRA.
We went for the former by providing LIBPROCESS_IP environment variable before starting an executor, but this is decided by the framework itself by configuring the Task configuration that starts the executor.
Researching and solving the aforementioned issues allowed us to run the full Mesos cluster in docker containers. It gave us the performance boost we hoped for, removed a lot of complexity and gave us some invaluable insight on how Mesos Containerizer works.