Terraform-Mesos module explained

In a previous post, we showed you how to set up your own Mesos cluster with Terraform by using our Terraform-Mesos module. In this post we'll be shedding some light on how the module is structured.

There's been a lot of activity lately in Terraform, we saw the release of 0.4.0 just before we posted the howto two weeks ago. This introduced a lot of cool new features and enhancements, but sadly also a number of bugs. It appears the nice folks at Hashicorp did not waste any time since then because they released 0.4.1 a week later, and 0.4.2 a day after that. This fixed the bugs that made our module throw errors. As of writing, the Changelog already shows the preparations for the 0.5.0 release.
We wrote our module when 0.3.7 was current, which did not include resources for forwarding rules and target pools for the Google Cloud provider. As a result, we did not configure public access to the cluster, something we are certainly going to include later on.

Back to the things we already did. First, let's find out what it takes to set up a Mesos cluster on the Google Cloud. A Mesos cluster has a master-slave structure, where the masters keep track of the slaves and offer their resources to the frameworks registered with the masters. The slaves do the actual work of running the tasks for the frameworks. So, in Terraform terms, we need to define a number of google_compute_instances that need to be able to talk to each other. We will create a dedicated network for our cluster, to keep it separate from other machines in our infrastructure. We will need to set up a number of firewall rules to let the instances talk to each other, and to provide external access to our network.
The instances need to be provisioned after creation, we will mainly use the remote access provisioner to provision the machine using bash scripts. This could easily be swapped by some proper tool like Puppet or Ansible.

The module sources are available on GitHub. We divided the module into separate files to improve maintenance and readability. We'll explain each of the files in turn.

Variables

As you could see in our previous post, we use a couple of variables to configure our cluster. We define all these variables in a separate file and we give them resonable defaults where we can. Actual values are only provided when we call the module, see the previous article on how to do that.


variable "account_file" {}
variable "gce_ssh_user" {}
variable "gce_ssh_private_key_file" {}
 
variable "region" {}
variable "zone" {}
variable "project" {}
variable "image" {
    default = "ubuntu-os-cloud/ubuntu-1404-trusty-v20150128"
}
variable "master_machine_type" {
    default = "n1-standard-2"
}
variable "slave_machine_type" {
    default = "n1-standard-4"
}
 
variable "network" {
    default = "10.20.30.0/24"
}
variable "localaddress" {}
variable "domain" {}
 
variable "name" {}
variable "masters" {
    default = "1"
}
variable "slaves" {
    default = "3"
}

We can refer to these variables in our resources by prefixing them with var. and enclosing them in ${} , so the variable account_file can be referenced like ${var.account_file} .

Provider

To enable Terraform to set up our infrastructure on the Google Cloud, we need to define our provider. We need to tell it how to access the Google Cloud by declaring the account_file, the id of the project we will use to create our machines in, and the region where the machines will be created. You can generate an account file by visiting the Google Developer Console and going to Credentials under APIs & auth. You wil find a button there saying Generate new JSON key.


provider "google" {
    account_file = "${var.account_file}"
    project = "${var.project}"
    region = "${var.region}"
}

Address

We added a resource for a fixed address, but we are not using it currently because we couldn't set up forwarding rules and target pools in 0.3.7. This resource takes just one parameter, a name for the address.


resource "google_compute_address" "external-address" {
    name = "${var.name}-address"
}

Network

Our network has a name and an ip range. We use 10.20.30.0/24 by default, so we have 253 addresses available for our machines.


resource "google_compute_network" "mesos-net" {
    name = "${var.name}-net"
    ipv4_range = "${var.network}"
}

Firewall

The firewall rules are where it gets a bit more interesting. First, we define a rule to enable all hosts in the network to talk to each other, on any port, tcp, udp and icmp. We add our local address to the source_ranges , so we will be able to reach our hosts for debugging if necessary.


resource "google_compute_firewall" "mesos-internal" {
    name = "${var.name}-mesos-internal"
    network = "${google_compute_network.mesos-net.name}"
 
    allow {
        protocol = "tcp"
        ports = ["1-65535"]
    }
    allow {
        protocol = "udp"
        ports = ["1-65535"]
    }
    allow {
        protocol = "icmp"
    }
 
    source_ranges = ["${google_compute_network.mesos-net.ipv4_range}","${var.localaddress}"]
 
}

Then we set up separate rules for http, https, ssh and vpn. We will tag our instances with the target_tags we provide here, so connections to ports on these instances will be allowed. We're only listing the http rule here because the others are very similar. See the full source on GitHub.


resource "google_compute_firewall" "mesos-http" {
    name = "${var.name}-mesos-http"
    network = "${google_compute_network.mesos-net.name}"
 
    allow {
        protocol = "tcp"
        ports = ["80"]
    }
 
    target_tags = ["http"]
    source_ranges = ["0.0.0.0/0"]
}

Master

Our master instances make use of the count meta-parameter, so when we later apply our plan Terraform will create the corresponding number of identical instances. Each machine will need a unique name, so we interpolate the index of the count in the value of the name parameter. We define a machine_type and a zone , and provide the tags we declared earlier in the firewall rules. Then we declare a disk with an Ubuntu image. We use 14.04 because Mesos provides packages for this version. In the future, we will move away from pre-built packages and provide a way to use a specific version built from source, there's already an issue for this. Next, we set some metadata which we need for configuring our nodes. We'll use the metadata in our scripts later on.
Our instances need a network_interface with an address in our network , Google will add a public, ephemeral address automatically. Then we provide the credentials for accessing our instances through ssh. Since we've logged in using the Google SDK, our credentials are saved as Project Metadata which is transferred to every instance we create. Lastly, we use the remote-exec provisioner to call a bunch of scripts to do the actual work.


resource "google_compute_instance" "mesos-master" {
    count = "${var.masters}"
    name = "${var.name}-mesos-master-${count.index}"
    machine_type = "${var.master_machine_type}"
    zone = "${var.zone}"
    tags = ["mesos-master","http","https","ssh","vpn"]
    
    disk {
      image = "${var.image}"
      type = "pd-ssd"
    }
 
    # declare metadata for configuration of the node
    metadata {
      mastercount = "${var.masters}"
      clustername = "${var.name}"
      myid = "${count.index}"
      domain = "${var.domain}"
    }
    
    # network interface
    network_interface {
      network = "${google_compute_network.mesos-net.name}"
      access_config {}
    }
    
    # define default connection for remote provisioners
    connection {
      user = "${var.gce_ssh_user}"
      key_file = "${var.gce_ssh_private_key_file}"
    }
    
    # install mesos, haproxy, docker, openvpn, and configure the node
    provisioner "remote-exec" {
      scripts = [
        "${path.module}/scripts/master_install.sh",
        "${path.module}/scripts/docker_install.sh",
        "${path.module}/scripts/openvpn_install.sh",
        "${path.module}/scripts/haproxy_marathon_bridge_install.sh",
        "${path.module}/scripts/common_config.sh",
        "${path.module}/scripts/master_config.sh"
      ]
    }
}

Slave

Our slave instances look much like our masters. We use a different count variable, and a slightly larger machine type with 4 cpu's and 15GB. And of course we need different scripts to install and configure packages on the slaves. You can see the full source here.

Output

We define the outputs like this:


output "master_address" {
  value = "${join(",", google_compute_instance.mesos-master.*.network_interface.0.address)}"
}
 
output "slave_address" {
  value = "${join(",", google_compute_instance.mesos-slave.*.network_interface.0.address)}"
}

The outputs currently don't work. We're trying to output some relevant information like the public ip addresses of our instances, but these don't seem to be detected by Terraform. When we try to terraform output master_address the answer is:


The state file has no outputs defined. Define an output
in your configuration with the `output` directive and re-run
`terraform apply` for it to become available.

Probably the outputs from a module need some special handling. We're looking into that, any suggestions are definitely appreciated!

The Scripts

The scripts we use for provisioning our instances are bash scripts that take care of installing the necessary packages and configuring the setup. We used the excellent How To from Digital Ocean as a base for the scripts. We will not list all the scripts here, you can look them up on GitHub.
The master_install script takes care of installing the mesosphere and haproxy packages by adding the proper key and repository.
The docker_install just contains the oneliner from the Docker install instructions.
The openvpn_install script is still a work in progress. We want to use a vpn to be able to display information of the mesos slaves in the web-interface, because you can't reach the slaves on their local address directly. There are still some things we have to figure out, like how to handle environment variables with sudo. We need this to generate the keypairs. Also we have no means of showing the generated keypairs, you will need to scp them from the machine. Come to think of it, we could use a local-exec to copy them over. And then there is some routing issue which prevents us from connecting to the other nodes through the vpn.
The haproxy_marathon_bridge_install script uses our Go version of the bridge script supplied with Marathon. We adjusted it so every container that exposes port 80 will be available through it's application name and the supplied domain name. This allows us to deploy a container called webservice and have it instantly available via webservice.ourdomain.com. This install script also shows how to query the metadata we put in our instance configuration.

  
DOMAIN=`curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/attributes/domain

As you can see, you need to set a header called Metadata-Flavor with the value Google , and then query the server at metadata.google.internal to get to the metadata for your instance.
The common_config script composes the Zookeeper string we need in our Mesos and Marathon configurations.
The master_config does the main body of the work configuring Zookeeper, Mesos and Marathon and restarting the services, while the slave_config only configures Mesos as a slave.

And that's all there is to it. To use the module, create a simple .tf file referencing the module and define the necessary variables (again, see the howto). Then sit back as Terraform applies your settings to create a Mesos cluster on your Google Cloud!