Rolling updates with Docker Swarm

Say that you have a set of services up & running in your Swarm cluster. Sooner rather than later there will come a time where you want to upgrade the version of your services. This most likely means that you will want to deploy a new set of containers with the upgraded version of your software.

A common approach in the industry is to set your website in Maintenance mode. This implies several things:

You configure your load balancer so that it stops sending requests to your backend services
You then serve a static Site under maintenance page from the load balancer itself
Later on you deploy the new version of your software to your servers, ensure that it's properly working by the means of smoke tests or something similar
Then you commission the servers back again into the load balancer, remove the Site under maintenance page and have your service back up

If you do this manually, you must follow these steps by the book, otherwise your scheduled downtime would be way longer.

If you do this in an automated way by the means of some sort of orchestration tool or even with bash scripts, then you reduce the downtime dramatically, yet you still have downtime.

In the past, this strategy made sense because creating new servers with the new version of your software was rather a slow process. Not today.

A better strategy would be to spin up new servers, provision them with the new version of your software, ensure that the application is running again by the means of some sort of smoke tests and once confirmed that the tests were passing, commission the new servers into your load balancers, decommission the old servers and probably also destroy them and in this way avoid having downtime at all.

Today, we can achieve zero-downtime deployments in just a matter of seconds thanks to the fact that containers are cheap to both produce and move around. Let's see how we can do this using Docker Swarm.

Getting Started

For the sake of simplicity we're going to use the containersol/hello-world container image, which serves a web application that renders the hostname of the machine where it's running. Since it runs in a container, it shows the host name of the container itself.

I also assume that you have already created a Swarm cluster.

Deploy

  
docker service create -d --name backend -p 8080:80 containersol/hello-world:6f045da67c52ed22f745211612fa90462c4f5e38
(out) rcm7klgspe2dgwgf1n8pxghur

To verify that the service has been deployed:

  
docker service ls
(out) ID                  NAME                MODE                REPLICAS            IMAGE                                                               PORTS
(out) rcm7klgspe2d        backend             replicated          1/1                 containersol/hello-world:6f045da67c52ed22f745211612fa90462c4f5e38   *:8080->80/tcp

You can also see the actual container that was deployed:

  
docker ps
(out) CONTAINER ID        IMAGE                                                               COMMAND             CREATED             STATUS              PORTS               NAMES
(out) c150f0e84a92        containersol/hello-world:6f045da67c52ed22f745211612fa90462c4f5e38   "./main"            40 seconds ago      Up 39 seconds       80/tcp              backend.1.ei6195x8xovza2msbfb8s3v31

To verify that the service is actually running, visit http://localhost:8080

Releasing a new version

A new version of the application is ready to be shipped: 18101e645ee3d9b1de302164bb31f907a8282349.

To release this new version:

  
docker service update -d --image containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349 backend
(out) backend

Let's verify that Swarm actually picked up the update command and is now (hopefully) running the desired version of the container image:

  
docker service ls
(out) ID                  NAME                MODE                REPLICAS            IMAGE                                                               PORTS
(out) bkcv1ztj8vir        backend             replicated          1/1                 containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349   *:8080->80/tcp

What happened?

If you were monitoring the service list in another window you would have noticed early in the update process an output similar to this:

  
docker service ls
(out) ID                  NAME                MODE                REPLICAS            IMAGE                                                               PORTS
(out) rcm7klgspe2d        backend             replicated          0/1                 containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349   *:8080->80/tcp

Notice how the REPLICAS field says 0/1. This is because at that point Swarm was still busy getting a hold of the desired container image in order to replace the old one with the new one.

Let's take a deeper look to see what is going on behind the scenes. Roughly, this is what Swarm will do for you:

Pull the image specified in the docker service update command
Remove the current (now old) container from its internal load balancer
Send a SIGTERM signal to the container and give it a grace period of 10 seconds to exit gracefully. If after 10 seconds the service hasn't yet given up, Docker will terminate the process with a SIGKILL signal.
Both the signal and the grace period can be tweaked with the --stop-signal and the --stop-grace-period flags when creating/updating the service
Start the new container
Add the new container to its internal load balancer

As you can see, this default behaviour features downtime since between steps 2 and 4 there could be a big delay, specially since production-ready applications nowadays tend to take more than just some milliseconds to come up.

If we inspect the service, we can see that the Update order field is set to stop-first:

  
docker service inspect --pretty backend
(out) ID:             rcm7klgspe2dgwgf1n8pxghur
(out) Name:           backend
(out) Service Mode:   Replicated
(out)  Replicas:      1
(out) UpdateStatus:
(out)  State:         completed
(out)  Started:       57 seconds ago
(out)  Completed:     48 seconds ago
(out)  Message:       update completed
(out) Placement:
(out) UpdateConfig:
(out)  Parallelism:   1
(out)  On failure:    pause
(out)  Monitoring Period: 5s
(out)  Max failure ratio: 0
(out)  Update order:      stop-first
(out) RollbackConfig:
(out)  Parallelism:   1
(out)  On failure:    pause
(out)  Monitoring Period: 5s
(out)  Max failure ratio: 0
(out)  Rollback order:    stop-first
(out) ContainerSpec:
(out)  Image:         containersol/hello-world:18101e645ee3d9b1de302164bb31f907a8282349@sha256:57b73ae1110ffab17cce2824f2416dc5e96122035b083f282f8a6b009905adee
(out) Resources:
(out) Endpoint Mode:  vip
(out) Ports:
(out)  PublishedPort = 8080
(out)   Protocol = tcp
(out)   TargetPort = 80
(out)   PublishMode = ingress

Update Order

It's a pity that at the time of writing this post, Swarm's default behaviour is to kill your container and then bring one up (stop-first). By now one would think that starting a new container and ensuring its healthy and readiness first would be the way to go...
The good news is that this behaviour can be tweaked in Swarm so that it does exactly that (start-first). To go about it, use the --update-order flag either at the time of creating your service or when updating it.

Let's update our service, this time making sure that the update-order is set to start-first

  
docker service update -d --update-order start-first backend
(out) backend

The next time that the container image is updated, Swarm will first bring the new container up and only commission it once it's ready.

To verify this behaviour, open a new terminal window and watch for the containers being run:

  
watch -n 0.5 docker ps
(out) # or the poor man's watch if you can't afford it:
while true;
(out) do
(out)   docker ps
(out)   sleep 0.5
(out)   clear
(out) done

and on the other window go ahead and release a new version of the container image:

  
docker service update -d --image containersol/hello-world:latest
(out) backend

You can see how the new container is being created and then the old one is being thrown away.

Update Parallelism

Another important feature for rolling updates in Swarm are the --update-parallelism and the --rollback-parallelism flags. These flags will tell Swarm how many tasks it will update in parallel. Tweak these flags according to your own needs but most certainly you want this number to be lower than the total amount of tasks/replicas that you're running for a specific service.

1 by default
0 will update/rollback all at once

A word about rollbacks

It is very possible that the new container image doesn't come up during the upgrade process, either because of system failures or because of a faulty image. In this case, Swarm will make the best effort to rollback to the previous version that you were running. For these scenarios you also need to think if you want to go for a start-first or for a stop-first strategy. There are also some networking issues with Swarm that are still being figured out.

Compose/Stack file gotcha

We have done this exercise from the command line for the sake of simplicity. However, more sophisticated, production-ready setups feature the definition of your services in Docker Compose or Stack files. We were sad to find out that the update-order feature, even though merged into upstream, hasn't yet made it to the latest Docker version. This means that you are for now stranded with the stop-first strategy until (hopefully) the next release of Docker.

Rolling updates with Docker Swarm