Istio and Kubernetes: Reducing Risk Through Chaos Engineering

When designing your microservice architecture in a Cloud Native system, setting up the Istio service mesh on your Kubernetes cluster(s) can give you more control and observability over network traffic. But, it can also help you break things, which will be the focus of this blog post.

Chaos engineering is a term coined at Netflix, and it can be boiled down to breaking your systems in production, and designing solutions to remediate the side effects, before things have a chance to break unexpectedly.

chaos engineering

Do you know what happens if half of your backend infrastructure becomes unreachable? How about if one of your frontend web servers goes down? What if traffic takes a few extra seconds to reach a critical component of your backend? If you can’t answer these kinds of questions with the utmost confidence, you need to start chaos engineering.

Chaos engineering experiments are part of how we Container Solutions engineers and our clients test the Cloud Native systems we build together. Such resiliency tests are part of our four-step Cloud Native transformation process—Think, Design, Build, Run—and they help our clients find out what works and what doesn’t early, before they waste time, money, and staff resources.

Now that you know why you should embrace this mindset, let’s talk about some ways you can put thoughts into action. Besides, breaking stuff can be fun!

Simulating Service Outages

Let’s start by simulating a partial outage on one of your web services. Here’s a typical Istio Virtual Service for a highly available frontend service.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
  - frontend
  http:
  - route:
    - destination:
        host: frontend
        subset: v1

As you can see, we’re sending incoming traffic to an arbitrary number of services tagged with the 'frontend' label. You could just start killing pods to see what happens (and you should). But we’re going to focus on what you can do with Istio to simulate some of your requests being handled by a misconfigured or unreachable microservice.

Here is the same yaml, plus the addition of an ‘fault’ section, which we’ll use to cause half of our requests to be responded to with 503 internal server errors.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
  - frontend
  http:
  - route:
    - destination:
        host: frontend
        subset: v1
  - fault:
       abort:
        httpStatus: 503
        percent: 50

Here’s another example, but instead of a fault, we’re setting a timeout to give our service a limited amount of time to respond before returning a 504 Gateway Timeout error. This should be tested against some of your backend REST APIs to see how the services that depend on them handle it.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
  - frontend
  http:
  - route:
    - destination:
        host: frontend
        subset: v1
    timeout: 1s

Finally, we’ll try introducing some latency into our network with delay fault injection. Through this we can learn how our frontend application deals with a delay in the expected response. Hopefully it just returns what you expect a few seconds late, but it’s better to find out now then it is to get paged at 4 a.m. on a Saturday.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
  - frontend
  http:
  - route:
    - destination:
        host: frontend
        subset: v1
  - fault:
      delay:
        fixedDelay: 5s
        percent: 100

Let Istio Retry Failed Requests

So, now that you’ve tested some common failure modes, it’s time to work on remediation. In most cases, you’ll need to update your application code to deal with faults gracefully, but Istio has a built-in retry option that can buy you some time while you work out the kinks.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
  - frontend
  http:
  - route:
    - destination:
        host: frontend
        subset: v1
    retries:
      attempts: 3
      perTryTimeout: 2s

HTTP retries do exactly what you’d expect. If a request fails, for whatever reason, Istio will handle it by retrying as many times as you specify. If some service is still available, depending on how you configure it, your request will eventually be served by a healthy pod. Assuming your infrastructure itself is highly available (think multi-region/multi-cluster), there should be very little, if any, perceivable degradation in performance.

Doesn’t it feel good to know what happens if things go wrong? I know I sleep easier at night with the knowledge that comes from chaos engineering. It might seem scary to potentially cause outages in production systems, but even the best disaster recovery plans will be incomplete until they’re tested. Either you test them, or they test you—and if you’ve read this far, then you already know which is preferable.

Photo by Park Troopers, at Unsplash.

You can find all of our information about SRE and CRE in one place. Click here.