Say that you have a set of services up & running in your Swarm cluster. Sooner rather than later there will come a time where you want to upgrade the version of your services. This most likely means that you will want to deploy a new set of containers with the upgraded version of your software.
A common approach in the industry is to set your website in Maintenance mode. This implies several things:
If you do this manually, you must follow these steps by the book, otherwise your scheduled downtime would be way longer.
If you do this in an automated way by the means of some sort of orchestration tool or even with bash scripts, then you reduce the downtime dramatically, yet you still have downtime.
In the past, this strategy made sense because creating new servers with the new version of your software was rather a slow process. Not today.
A better strategy would be to spin up new servers, provision them with the new version of your software, ensure that the application is running again by the means of some sort of smoke tests and once confirmed that the tests were passing, commission the new servers into your load balancers, decommission the old servers and probably also destroy them and in this way avoid having downtime at all.
Today, we can achieve zero-downtime deployments in just a matter of seconds thanks to the fact that containers are cheap to both produce and move around. Let's see how we can do this using Docker Swarm.
For the sake of simplicity we're going to use the containersol/hello-world
container image, which serves a web application that renders the hostname of the machine where it's running. Since it runs in a container, it shows the host name of the container itself.
I also assume that you have already created a Swarm cluster.
docker service create -d --name backend -p 8080:80 containersol/hello-world:6f045da67c52ed22f745211612fa90462c4f5e38
(out) rcm7klgspe2dgwgf1n8pxghur
To verify that the service has been deployed:
docker service ls
(out) ID NAME MODE REPLICAS IMAGE PORTS
(out) rcm7klgspe2d backend replicated 1/1 containersol/hello-world:6f045da67c52ed22f745211612fa90462c4f5e38 *:8080->80/tcp
You can also see the actual container that was deployed:
docker ps
(out) CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
(out) c150f0e84a92 containersol/hello-world:6f045da67c52ed22f745211612fa90462c4f5e38 "./main" 40 seconds ago Up 39 seconds 80/tcp backend.1.ei6195x8xovza2msbfb8s3v31
To verify that the service is actually running, visit http://localhost:8080
A new version of the application is ready to be shipped: 18101e645ee3d9b1de302164bb31f907a8282349
.
To release this new version:
docker service update -d --image containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349 backend
(out) backend
Let's verify that Swarm actually picked up the update command and is now (hopefully) running the desired version of the container image:
docker service ls
(out) ID NAME MODE REPLICAS IMAGE PORTS
(out) bkcv1ztj8vir backend replicated 1/1 containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349 *:8080->80/tcp
If you were monitoring the service list in another window you would have noticed early in the update process an output similar to this:
docker service ls
(out) ID NAME MODE REPLICAS IMAGE PORTS
(out) rcm7klgspe2d backend replicated 0/1 containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349 *:8080->80/tcp
Notice how the REPLICAS
field says 0/1
. This is because at that point Swarm was still busy getting a hold of the desired container image in order to replace the old one with the new one.
Let's take a deeper look to see what is going on behind the scenes. Roughly, this is what Swarm will do for you:
docker service update
commandSIGTERM
signal to the container and give it a grace period of 10 seconds to exit gracefully. If after 10 seconds the service hasn't yet given up, Docker will terminate the process with a SIGKILL
signal.--stop-signal
and the --stop-grace-period
flags when creating/updating the serviceAs you can see, this default behaviour features downtime since between steps 2 and 4 there could be a big delay, specially since production-ready applications nowadays tend to take more than just some milliseconds to come up.
If we inspect the service, we can see that the Update order
field is set to stop-first
:
docker service inspect --pretty backend
(out) ID: rcm7klgspe2dgwgf1n8pxghur
(out) Name: backend
(out) Service Mode: Replicated
(out) Replicas: 1
(out) UpdateStatus:
(out) State: completed
(out) Started: 57 seconds ago
(out) Completed: 48 seconds ago
(out) Message: update completed
(out) Placement:
(out) UpdateConfig:
(out) Parallelism: 1
(out) On failure: pause
(out) Monitoring Period: 5s
(out) Max failure ratio: 0
(out) Update order: stop-first
(out) RollbackConfig:
(out) Parallelism: 1
(out) On failure: pause
(out) Monitoring Period: 5s
(out) Max failure ratio: 0
(out) Rollback order: stop-first
(out) ContainerSpec:
(out) Image: containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349@sha256:57b73ae1110ffab17cce2824f2416dc5e96122035b083f282f8a6b009905adee
(out) Resources:
(out) Endpoint Mode: vip
(out) Ports:
(out) PublishedPort = 8080
(out) Protocol = tcp
(out) TargetPort = 80
(out) PublishMode = ingress
It's a pity that at the time of writing this post, Swarm's default behaviour is to kill your container and then bring one up (stop-first
). By now one would think that starting a new container and ensuring its healthy and readiness first would be the way to go...
The good news is that this behaviour can be tweaked in Swarm so that it does exactly that (start-first)
. To go about it, use the --update-order
flag either at the time of creating your service or when updating it.
Let's update our service, this time making sure that the update-order
is set to start-first
docker service update -d --update-order start-first backend
(out) backend
The next time that the container image is updated, Swarm will first bring the new container up and only commission it once it's ready.
To verify this behaviour, open a new terminal window and watch
for the containers being run:
watch -n 0.5 docker ps
(out) # or the poor man's watch if you can't afford it:
while true;
(out) do
(out) docker ps
(out) sleep 0.5
(out) clear
(out) done
and on the other window go ahead and release a new version of the container image:
docker service update -d --image containersol/hello-world:latest
(out) backend
You can see how the new container is being created and then the old one is being thrown away.
Another important feature for rolling updates in Swarm are the --update-parallelism
and the --rollback-parallelism
flags. These flags will tell Swarm how many tasks it will update in parallel. Tweak these flags according to your own needs but most certainly you want this number to be lower than the total amount of tasks/replicas that you're running for a specific service.
1
by default0
will update/rollback all at onceIt is very possible that the new container image doesn't come up during the upgrade process, either because of system failures or because of a faulty image. In this case, Swarm will make the best effort to rollback to the previous version that you were running. For these scenarios you also need to think if you want to go for a start-first
or for a stop-first
strategy. There are also some networking issues with Swarm that are still being figured out.
We have done this exercise from the command line for the sake of simplicity. However, more sophisticated, production-ready setups feature the definition of your services in Docker Compose or Stack files. We were sad to find out that the update-order
feature, even though merged into upstream, hasn't yet made it to the latest Docker version. This means that you are for now stranded with the stop-first
strategy until (hopefully) the next release of Docker.