Recently I had the opportunity to participate in a hackathon organized by Microsoft with a colleague of mine. The format of this hackathon found us partnering up with developers from Microsoft, in this case Alessandro and David, who were fantastic help during the course of the hackathon. The purpose of this hackathon was to help developers familiarize themselves with Microsoft Azure, while at the same time giving Microsoft an opportunity to connect and collect valuable feedback. This was an opportunity for both my colleague Adam and I since neither of us had much experience with Azure, both being more familiar with Google Cloud or AWS. Therefore both of us jumped at this opportunity, finally a chance to gain insight into another major cloud provider. We were required to submit a proposal outlining what exactly we were going to be working on and what we intended to achieve. Adam proposed using a Cloud Native reference app called the Sock Shop which was developed in conjunction with WeaveWorks to help demonstrate different cloud native technologies coming together. In this blog post I’ll go through all the things that we tried to accomplish and the various issues we encountered along the way. For reference below is an architectural diagram of the Sock Shop.
Microsoft has adopted a unique strategy to supporting containers on the Azure cloud. While Google has Google Container Engine (GKE) and AWS has EC2 Container Service (ECS) Azure has something similar named Azure Container Service (ACS). Under the hood this is a very different beast than the other two (which are quite different in the first place). First it doesn’t focus on just a single orchestrator, you have a choice of Kubernetes, DC/OS and Docker Swarm. Second, it doesn’t manage the orchestrator after creation. GKE and ECS are both services where the master component orchestrating the cluster is managed by Google and AWS respectively. ACS is currently is more of a turn-key installer creating all the necessary components but leaving management to the user. This is not necessarily a problem for many customers but the folks we worked with from Microsoft told us that the aim is to provide Kubernetes with managed master nodes. What’s very interesting about ACS is it’s open source. You can find all the code which is running under the hood on Github. This provides transparency and access to future releases. The current version of the ACS Resource Manager is only supporting Kubernetes 1.5.3, but you can grab the acs-engine source code to use the newest features and Kubernetes 1.6.4 support. We wanted to use Kubernetes 1.6.4 and also create hybrid Linux-Windows clusters to showcase the best in Azure and ACS.
The steps needed to provision a Kubernetes cluster using acs-engine instead of the production ACS which is integrated into the Azure UI is documented here. acs-engine provides a Docker image which serves as a convenient developer environment. You will need to mount the source code of acs-engine into this container, execute a make command to build it and you’re ready to go on any platform.
What acs-engine binary in the container is then able to do is generate deployment json files which can be consumed by the Azure CLI.
The generate command takes a template json file as parameter
./acs-engine generate examples/kubernetes.json
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes"
},
"masterProfile": {
"count": 1,
"dnsPrefix": "cs-k8s-hybrid",
"vmSize": "Standard_D2_v2"
},
"agentPoolProfiles": [
{
"name": "linuxpool1",
"count": 2,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet"
},
{
"name": "windowspool2",
"count": 2,
"vmSize": "Standard_DS3_v2",
"availabilityProfile": "AvailabilitySet",
"osType": "Windows"
}
],
"windowsProfile": {
"adminUsername": "azureuser",
"adminPassword": "aAvc$DF!sP5ybs%fYcJ3qW9Z&C7Z&g"
},
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa..."
}
]
}
},
"servicePrincipalProfile": {
"servicePrincipalClientID": "74dde826-a81c-40bf-aa73-0f33b00b0cf7",
"servicePrincipalClientSecret":..."
}
}
}
This template file defines a cluster which is composed of both a Linux pool composed of Standard_D2_v2 machines and a Windows pool composed of Standard_D3_v2 machines, more information on VM sizes can be found here. From this simple template file acs-engine can create all the resources needed to set up a hybrid Linux-Windows Kubernetes cluster.
Our next order of business was trying to deploy the Sock Shop. The interesting bit here is that we opted to run a hybrid cluster running both Windows and Linux split between 4 nodes. The first time we tried to deploy the Sock Shop certain services were failing to deploy, it took us a while to realize that this was because Kubernetes was trying to deploy Linux services on Windows. Because of this we need to make a few amendments to the typical Sock Shop deployment that already existed. We needed to add the beta.kubernetes.io/os
annotation to deployments so that Kubernetes knows not to deploy services on the wrong OS. Once we did that everything deployed properly. One problem is that none of the existing services are native Window apps so we ended up with overloaded Linux nodes and empty Window nodes. To mitigate this we needed to run a Windows service, luckily we had one available from a previous experiment of ours where we rewrote the orders service into a .NET app. We quickly deleted the original orders service and deployment, made the edits, made sure we had the os annotation set to Windows this time and redeployed.
While everything deployed successfully we still needed to confirm whether the Sock Shop was operating properly. This is where we encountered another issue, it seems our .NET order service was having trouble communicating with the MongoDB backend. After a bit of investigation we discovered it was because the Microsoft node hadn’t configured DNS resolution properly. Thankfully this is a known issue and a PR has been merged which fixes this, you can find more detail here.
One of the cool things we discovered whilst playing around with Load Balancers was that Microsoft Azure creates a Load Balancer if it was specified in the service yaml, a behaviour similar to Google Container Engine and AWS.
We decided to introduce autoscaling into the mix, which was relatively simple as there was already an autoscaler available for ACS. It was a fork of the autoscaler for EC2 written by OpenAI, you can find more information here. One issue we encountered here was with the way we had named our deployment, the autoscaler code originally hard coded a deployment name of ‘azuredeploy’, and we had chosen something different than convention dictated. Fortunately Adam and David were able to write a PR which has since been merged and now the autoscaler should work with deployments of any name by accepting a configuration parameter. You can find that PR and more information here. We also found that the current node autoscaler in ACS doesn’t support hybrid clusters - having Windows nodes in the cluster breaks algorithm. This is not going to be fixed until node management is migrated to use VMSS(Virtual Machine Scale Sets). VMSSs will make node management simpler so autoscaling is frozen until the migration happens.
The last thing we attempted to do was upgrade the underlying infrastructure by increasing the size of the VMs. This is probably where we encountered the most issues. In theory this should have been relatively simple since all it really required was modifying our original cluster definition adding new definition to the agentPoolProfiles section that goes something like this.
{
"name": "linuxpool2",
"count": 2,
"vmSize": "Standard_DS2_v2",
"availabilityProfile": "AvailabilitySet"
},
We’ve given it another name and changed the vmSize, thankfully the provisioner is smart enough to realize that we had duplicate definitions and didn’t try to recreate those. Once our new VMs are provisioned we can proceed to drain the smaller nodes, delete them, and move on. An issue we ran into while trying to provision the new VMs was we tried to do this on a machine that wasn’t the original machine we used to provision the initial VMs. Unfortunately we were never able to solve this and thus had to start from scratch again. A key learning from this was the need to save the files generated during the initial process to avoid having any discrepancies.
In conclusion while there were several issues that came up during the course of the hackathon none of them were significant enough to deter our experience of Microsoft Azure. It would be nice to see ACS reach a point where it’s as up to date as Google when it comes to Kubernetes versions we understand Microsoft is actively working and will eventually reach this point. Luckily the DNS issue we encountered has already been resolved and by the time we use Azure again it will have been solved. And lastly the issue we encountered during the upgrade could easily be avoided which makes it difficult to consider calling it a true issue. Overall we were happy with what we saw and wouldn’t hesitate using Azure as an option for hosted kubernetes.