DevOps, WTF Is Cloud Native

SLO Strategy: Balancing Strategic Vulnerability with Uptime and Engagement

We can get overly obsessed with uptime. We can actually set service level objectives (SLOs) too high. If we zero in on five-nines across the board, we risk compromising our teammates’ ability to innovate. And we can increase burnout. As Marc Alvidrez puts it in the original Google SRE book:

“You might expect Google to try to build 100% reliable services—ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better…users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximising uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimised.”

But what does that look like in reality? How do you find the right balance of experimentation and always-on? Because fire drills can only prepare you so much.

Honeycomb’s Principal Developer Advocate Liz Fong-Jones shared at this year’s Chaos Carnival three heart-stopping times when the team blew their SLO. It’s a lesson in considering when maybe you should blow it. Or at least update it. And how you can let your error budget be the yang to your SLO yin.

We don’t want things to go wrong often, but as Fong-Jones showed, outages and failed experiments can become unscheduled learning opportunities.

An SLO is what matters to your users

Part of an SLO strategy is reducing your “pager load”—you don’t want to be waking up teammates in the middle of the night for things your users don’t care that much about.

Fong-Jones kicked off her talk by explaining what an SLO is—a target metric that measures something meaningful to your users. “You should aim to achieve just that level of reliability and not much more. So you're not spending money chasing nines that don't matter.”

It all starts with your understanding of what is meaningful to your users. “The idea is that you should not plan for 100% uptime. You should have some idea about the level of reliability that actually is meaningful to your users,” she explained. Then you can plan into your SLO a level of vulnerability that allows for chaos engineering experimentation and some wiggle room for inevitable, unpredictable system behaviour.

Honeycomb is a DevOps-oriented observability tool that ingests telemetry data from users to help them gain understanding into their increasingly unpredictable, distributed systems. Observability differs from traditional monitoring in that the goal is to collect as much raw data as possible and then use that to explore how a system behaves in real time, as opposed to having predefined alerts and data collection for specific scenarios. As Jessica Kerr, principal developer evangelist at Honeycomb, put it at Strangeloop last year, “Observability turns monitoring on its head”.

To achieve this, Fong-Jones said, “We allow people to search and query on every single column of that data, regardless of how much cardinality there is, regardless of how many columns there are”. Honeycomb’s SLOs map to real user journeys and goals.

For example, they want to make sure the Honeycomb product's homepage loads relatively quickly at 250 milliseconds, lest customers get frustrated while getting their bearings. If a user is trying to run a query, it should take less than ten seconds, but the Honeycomb team knows it doesn’t dramatically affect user experience if they have to click refresh and reload. But their bread-and-butter is the telemetry ingestion that they cannot risk.

"Most importantly to us, we only get one chance to ingest your data most of the time,” Fong-Jones said. “We don't want to drop your customer data because, if we drop your customer data, especially for enough customers, those customers will wind up with a giant crater in their data for the end of time.”

So this process is where Honeycomb focuses on hitting four-nines—i.e., 99.99% of the time, the events that users send to Honeycomb are ingested without error in less than five milliseconds per event in the batch, over a trailing period of 30 days.

That only translates to less than 4.5 minutes allowed downtime per month. That’s not a lot of wiggle room.

Except that most failures aren’t 100%. Honeycomb has designed its systems so they only experience partial degradation.

How to ship within an SLO in normal circumstances

“In reality, if we have a 1% outage, we've got 400 minutes to fix it before we’ve burnt through the entire [monthly] error budget. We still kind of keep track of this idea of how much error budget we have left and what are we achieving versus what our goal is,” Fong-Jones explained.

Now keeping track of an error margin isn’t so easy. That’s where progressive delivery comes in. Instead of rolling out a new feature to 100% of users, Honeycomb rolls out new features gradually, designing with feature flag development. It minimises the risk of things going 100% wrong, allowing for the team to do tests along the way to check for expected behaviour. Every release involves a mix of automated continuous integration tests and human reviews, all in rapid deployment cycles.

The Honeycomb release cycle also runs a short lead time with a green button merge, which automatically reaches production within an hour. This keeps in mind that tech workers are true knowledge workers, which means they can forget details if they wait too long between deployments. A faster deployment cycle, Fong-Jones said, “maximises the amount of context that is in the head of the person who wrote the change right before it rolls out”. Each release passes through progressively more rigorous environments before moving to production.

But things don’t always work out that way. As she put it, “Not every experiment is going to succeed. There are always going to be some experiments that fail catastrophically and therefore it's important to know what you do if you either break your SLO or are in danger of breaking your SLO.”

The rest of Fong-Jones’s talk focused on how, when things failed, the Honeycomb team was able to keep—relatively—calm and just keep shipping.

Almost Epic Fail #1: Failure to ingest

First, let’s head back to December 2020. Honeycomb’s ingest API service was run with stateless “shepherd” workers that were ingesting data in the OpenTelemetry format via a custom JSON API. Since OpenTelemetry has native support for using gRPC as a Remote Procedure Call protocol, the team wanted to move to it from JSON, which in turn required binding a privileged port by default. Everything worked smoothly in the staging environments on local developer workstations.

But then, in production, it attempted to bind to the privileged port and it all crashed. The developer was able to stop his deployment, but, because Honeycomb wasn’t on Kubernetes yet, there was no ability to roll back. That means any builds that were broken would remain broken. And it was still the default new build, so any new hosts were going to try to pull down that new broken new build.

“We were unable to scale up quickly enough because every single worker that Amazon stood up would simultaneously try to pull the new binaries and then it would start the binary, the binary would crash, and then it would show 0% CPU usage.” Then, Fong-Jones continued, that Amazon Web Services interpreted this 0% CPU as scaling up too far and wasting CPU.

AWS “would scale in some other workers, even from the busy workers. So then we would get this cycle where there's traffic coming off of our workers that were working correctly and going on to new workers that were broken. So we were in this yoyo cycle in which latency was just going up and up and up, as our users were unable to get their data.”

Instead of just impacting the users the update was rolled out to before the deployment was halted, it ended up impacting all Honeycomb users at the same time.

If they were following the advice in Google’s SRE book, since they blew their SLO, the team should have responded by freezing deployments. However, doing so goes against Honeycomb’s belief that the deploy machinery should keep working under any circumstances because it should always be possible to ship reliability improvements.

“Instead, what needs to change is you need to shift where you're spending your time as an engineering team. So you don't just pile up a bunch of things that haven't shipped, but instead you're focused on using your existing running deploy pipeline to deploy improvements to reliability,” Fong-Jones said.

They still wanted to meet that launch deadline with AWS as a partner, so they chose to keep on shipping, but to a dedicated feature branch. They put a separate set of isolated workers only for gRPC, while keeping 99.99% of traffic on the original JSON endpoint going to the original workers.

Almost Epic Fail #2: Failure to buffer

In February 2021, the Honeycomb team was looking to tackle another reliability concern. The product uses Apache Kafka for decoupling between the shepherd index workers and their query workers. This increased complexity through increased dependencies: for example if Kafka breaks, they couldn’t ingest data. So they were in the process of deploying Kafka with a new architecture, along with a new set of technologies by Confluent, and moving storage from local disks to Amazon Elastic Block Store (EBS). All three at the same time.

Juggling all the newly introduced variables, they spent two months smoke-testing everything out in their dog-food Docker environment—which wasn’t a true replica of the stresses of a production workload, but rather about a third the scale of production. They knew that this presented a risk of introducing chaos once in production, but a lot more chaos ensued than they had anticipated.

“We knew that there was going to be some risk of failure as this happened, but we underestimated how bad it was going to be. So we wound up running into three different dimensions that we had not anticipated. Like we knew that there might be some scaling problems, but we kept on encountering issue after issue after issue.” Fong-Jones said, first, they blew out the amount of disk bandwidth available to EBS from each individual AWS instance. Then, they blew up the network dimension of the amount of network capacity in and out that each instance would allow.

They realised that they had deployed the incorrect instance size, so they fixed that, but still things were going awry. They’d be running fine, but when they started to replicate data, the instance would crash, with Amazon saying that someone had terminated the instance—but no one had. They realised they had to again default to a safe configuration, remaining on local SSD and on Intel. The one change they deployed was to move older data onto Amazon S3.

In the end, they would wait another ten months for the release of AWS Graviton2 storage instances to feel safe enough to make those moves. Because they realised at the time that “it was not worth it to us to burn our people chasing just a small amount of money”.

Fong-Jones continued that wasn’t the only lesson from that quagmire that really drove Honeycomb into 2022: “Even if you anticipate there may be some chaos making a change, you should always have a mechanism and safety cord to pull to say, ‘You know what? This is not going according to plan. We need to back up from the experiment and try something else. We need to take care of our people first because, even though we didn't blow our SLO in terms of the service—we had, through heroics, kept our SLO working correctly—the cost for humans was not worth it.”

The Honeycomb team realised that some teammates were up all hours trying to fix it. So, they made sure everyone knew they should expense food deliveries for their households to have one less concern. And then they reiterated the expectation that everyone should speak up if something doesn’t seem right or they need to hand things off to someone else in order to take a well-deserved break. After all, Honeycomb’s core values are transparency, autonomy, experimentation, and kind, direct feedback, all which were emphasised in this trial.

Almost Epic Fail #3: Failure to query

Most recently, they realised that Honeycomb was having “extreme cost” associated with AWS Lambda charging by the millisecond, while performance wasn’t where they expected it to be, encountering limits on how wide they could scale up. AWS had just announced that it would allow for use of Graviton2 with 64-bit ARM processors across 11 functions.

The team started out by running a 50/50 A/B test of the current versus new. As Fong-Jones described it: “It turns out doing so cold on a giant workload that is streaming the capacity [is] not a good idea. So we went straight to 50% and it just blew up in our faces right away—latency went up by a factor of three.”

But, this time, all was not lost because it was feature flagged in LaunchDarkly. They didn’t risk their SLO because they were able to quickly roll back and remediate the impact within about 20 seconds, dropping the problematic traffic down to 1%.

“So this kind of experiment is something that becomes normal and routine, if you have feature flags, because it really enables you to do this process of iterating on something, knowing that you have an out in case it goes wrong, and knowing that you have an upper bound on the number of users that you can impact,” she explained.

They inched the experiment forward by a few percentage points and, by the time of the talk, they were at 30% of users. Every step was about gaining more feedback on how they could move forward safely.

Fong-Jones used the analogy of learning to ride a bike. The slower, more hesitantly you go, the more likely you’re to lose your balance. “But if you keep on moving and making forward progress and carrying on momentum, that means that you're going to be in a more stable state than if you kind of stop cold and then have to start over from scratch every single time.”

By leveraging feature flags or segregating out more experimental traffic, she said you are able to spread out risk in order to improve customer experience.

In the end, an SLO is just a number.

Fong-Jones warned that black swan events can and will happen—but that shouldn’t keep you from experimenting.

“Your SLO is, at the end of the day, just a number. Customer happiness and customer satisfaction is what matters. And customers will be understanding that they would rather tolerate a little bit of unreliability than to have all their customer data leaked all over the internet,” she said.

That satisfaction starts with a discussion between the customers, engineers and key stakeholders to understand what really matters. An SLO is simply a reflection of the broader socio-technical system.

And remember: risking blowing your SLO may end up making your systems, processes and teams stronger. These experiments and tests can actually discover unknown sources of risks that you wouldn’t otherwise uncover until it was too late. By continuously validating your infrastructure, you’re able to prepare to meet your SLO in the future.

In the meantime, just keep looking for those unscheduled learning opportunities.

New call-to-action

Comments
Leave your Comment