SRE, WTF Is Cloud Native, CNO, Hacking the Org

Podcast: Jennifer Mace on How Google Does SRE

Charles Humble talks to Jennifer Mace, aka as Macey. She gives us her definitions of "site reliability engineer" and “toil”, discusses how google recruits SREs, explores how to manage risk and speed, and the pros and cons of having a centralised SRE function and why that model, vs. the “You build it, you run it” model preferred by Netflix, works well at Google. She also offers advice for junior SREs and tells us what it is like when the pager goes off.

Subscribe: Amazon MusicApple Podcasts | Google Podcasts | Spotify

About the interviewee

Jennifer Mace (Macey) is a Staff Site Reliability Engineer at Google Seattle, where she works to make Google's cloud a more hospitable place for the company's corporate infrastructure. Previously the tech lead of Google Kubernetes Engine SRE, she contributed to The Site Reliability Workbook on topics from incident management to the interplay between load balancing and autoscaling systems.

Resources mentioned

Generic mitigations
SRE book chapter 3 on embracing risk
SRE book chapter on Toil
SRE Workbook chapter on Incident Response

Full Transcript

Introductions

Charles Humble: Hello and welcome to the 12th episode of "Hacking the Org", the last one of our first year of making this podcast from the "WTF is Cloud Native" team here at Container Solutions. I'm Charles Humble. I'm Container Solutions’ Editor in Chief. Before we get into the podcast, I wanted to mention that our conference, WTF is SRE is back for a third year and will be held in person for the first time. It takes place in London here in the UK. May the 4th to the 5th of 2023. And we have four tracks covering observability, DevSecOps, reliability, and for the first time this year, DevEx. If you want to find out more point your browser of choice to www.cloud-native-sre.wtf. Tickets are on sale now. Spaces are limited, so do get along and snap one up as quickly as you can.

My guest on the podcast this month is Jennifer Mace, known as Macey. She is an author of speculative fiction but perhaps of more relevance for this particular show, she's also a staff Site Reliability Engineer at Google in Seattle where she works to make Google's cloud a more hospitable place for the company's corporate infrastructure. She contributed to the site reliability workbook on topics from incident management to the interplay between load balancing and autoscaling systems, and I'm thrilled to have her on the show. Macey, welcome. Lovely to have you on.

Jennifer Mace: Yes, thank you for the welcome. I'm happy to be here.

What is your definition of what an SRE actually is?

Charles Humble: My pleasure. So maybe we could start by what is your definition of what an SRE actually is?

Jennifer Mace: Well, I think the Ben Treynor's off-the-cuff definition is always “what happens when you make a developer do reliability”. So SRE means a lot of different things depending on the context, but essentially an SRE is an engineer, a software engineer or a systems engineer, whose focus is making things run well rather than delivering features.

How did you get into the SRE field in the first place?

Charles Humble: You've worked in the SRE space for a long time, but how did you get into the field in the first place?

Jennifer Mace: So I was fresh-faced and all of 22 just graduated from college and I got a Google offer as a SWE because I am a mathematician and mathematicians don't apply directly to ops roles. And I had three champion calls with different teams at Google and two of them were database teams and one of them was an SRE team. And I asked around my friends who worked at Google and I said, "What is an SRE?" And I got various answers—"they're scary people", "they're really cool", "they wear leather jackets." And I said, "You know what? That is not helpful, but I like it, so I'm going to do that."

What is it about it about being an SRE that you find interesting?

Charles Humble: Love it. As I said, you've worked in the field for quite a long time, so what is it about it that you find interesting?

Jennifer Mace: I think that the SRE role allows you to ask a lot more questions and kind of stick to your convictions. Stubbornness is a virtue in a way in SRE that it's not necessarily in a normal development role. We get to dig into mysteries and keep digging until we’ve figured out how they happened and insisted on putting in place methods to make sure they don't happen again. And so there's a combination of detective work and then the design and implementation part as well. So you get to dig down into the roots of how things really function.

How does Google recruit SREs?

Charles Humble: Right, yes. How does Google find SREs? So you said you had your kind of initial conversation, but is there like a program to allow developers to try out doing the job? Because it's one of those things if you haven't done it, you don't really know, I wouldn't have thought if it works for you or not.

Jennifer Mace: Well, I would say SRE is a fascinatingly diverse population, particularly in how people come to this role. I have a colleague who - she's one of the first staff IC SREs from the women in SRE - who was a physicist with a PhD who was working at Google as a business manager and did basically an SRE bootcamp and transferred over. So she wasn't an engineer at all. She'd done a little coding in her physics PhD. A lot of the time we bring in new grad SREs as pure software engineers who've had no experience in anything as if they were any other new engineer who is learning how to do the job of software development instead of solving problem sets for their professor, right? Because nobody comes out of college knowing how to be a software developer or a Site Reliability Engineer. And you do see with that population that some people love it and some people hate it and transfer to a development organisation.

And Google also has separately a program called Mission Control where any developer in the company with their manager's permission can spend six months embedded in an SRE team, can go on call, can do the job, and at the end of that they go back to their developer team or they decide to transfer either to that SRE team or a different one.

It's really great, I think, for our developer organisations as well because it injects that little bit of reliability knowledge and expertise back into them. So even when they're doing the early stages of feature development, they have that knowledge and can kind of build it in from the ground up.

And of course we hire people also who are already SREs, who have been working on platform teams in other companies or who've been network engineers for a long period of time, who've been systems engineers working on kernels, all sorts of backgrounds in that way. But I mean, not everyone in SRE has a degree. Not everyone in SRE came through software. It's very varied.

Charles Humble: I think that's fantastic, actually. I mean, obviously software development in general has this problem now, it wasn't always the case, but now it has this problem that it tends to be white middle class blokes like me who went to university, studied computer science, then got a job as a programmer. So it's fantastic that you've been able to have routes in and ways into the SRE space that promote more diversity.

Jennifer Mace: We've had some folks come in from our Atlanta data centre teams, for example, who've trained up and transferred across. We had a program for a while where we would bring people in who were sort of training on the job in various, if they would rotate through different teams and see which one was the best fit for them, who were maybe more junior than we would usually hire as full-time. So there's all sorts of ways.

How does the Google SRE model compare to DevOps?

Charles Humble: I wanted to talk a little bit about the SRE model at Google because having read the original SRE book-

Jennifer Mace: The Tome.

Charles Humble: Indeed the Tome, the original book, the way the SRE function is described in that is it's basically described as a centralised function. And I presume that the goal of that was to try and get some level of operational efficiency out of that, which I imagine you do, but it kind of feels like it slightly flies in the face of DevOps, which always seemed to me to be more about making engineers directly responsible for running the code that they developed. You build it, you run it. So how does that play out in practice?

Jennifer Mace: There are several ways to respond to that, and one of which is if the axiom of DevOps is make developers do operations, then SRE is DevOps. We are developers; we are the developers. It's just that we're in separate teams, not the same teams. So that's one argument. And there are definitely places in Google where the SREs are in fact implementing and running the systems that they implemented. And those are usually more platform and tooling type situations. So teams who might have an expertise in load shedding and might build a platform that does filtering and query dropping across the whole company might also write the software and run it simultaneously.

But the more traditional SRE team that you are asking about is, for example, a Search SRE, or my first team at Google was Ads SRE, right? Where you definitely have, this is a team of SREs and then you have 60 developers over there who are writing the code and you just have to make it run. And the intent behind SRE is to put someone who does not want to do rule book operations, who does not want to have a checklist that they have to run through every Monday and make that problem go away. So SRE, a little bit is built on frustration. We don't want to do the manual labour and so we get annoyed and we write a robot to do it for us.

Charles Humble: But presumably there is always a kind of bit of tension there.

Jennifer Mace: There is definitely always the tension between when you are in separate organisations, the developer org who is writing the features, who are writing the system from the ground up, may not feel the impetus to make it stable when they are designing it, to build in the monitoring, to build in the redundancies because they know that they aren't the ones who will have to deal with the consequences. And I think that's kind of what you're getting at, right?

Charles Humble: Yeah, absolutely. Because I remember from my own programming days, which admittedly is a little while ago now, but if you were working on a developer team, you were typically being measured on the number of features you were shipping. And if those features were causing trouble in production, that wasn't really a thing that mattered that much to you because you're being assessed on how many features are you putting through, and meanwhile you're handing it off to the ops people and the ops people are pulling their hair out going, it doesn't work.

Jennifer Mace: It doesn't work. Yep.

Charles Humble: Exactly. So are you saying that at Google that doesn't happen?

Jennifer Mace: I'm not saying that that doesn't happen, right? Sometimes it happens. Part of this is that by being a separate organisation with our own director and VP chain who can go yell at their bosses, if our partners are not holding their side of the deal, that really helps. Whereas if we reported up to the same directors and VPs that our developers did, whose entire motivation like you're saying is landing features, then we wouldn't have leverage. So that definitely helps.

And I'm not necessarily convinced that making the developers also go on call improves that if there are no SREs. You end up with a team whose entire incentive and award structure is still around landing features and they just have to eat their broccoli and they hate it and they don't want to do a good job at it. So they do the minimum that they need to and move on with their life. They have no incentive to become experts in reliability, to learn across the whole company what has worked in other teams. They have the incentive to do the minimum they need to where they are.

And I think this is why we see even in many companies who do the DevOps model, and I won't say that the SRE model works at every company, by the way, because it's a question of scale. At the Google scale, it's important; if you're smaller, it may not be worth the investment. But what I see with the teams who do DevOps, is that company almost always has a platform team somewhere as well, who are really serving the function of the SRE teams that I described earlier who for example, would develop the server throttler, like the how do we do load shedding across the whole company? There is a central team who's answering those questions because if you just leave it to “I am the team who is delivering the new banner ad that goes across the search webpage”, they are never going to come up with load shedding.

What is the relationship like between SWEs and SREs at Google?

Charles Humble: Right. Yes, of course. So thinking about this a bit more, you have a centralised function and you have developers who are receptive to taking feedback. How does that work out? What's that relationship like? Is that like a sort of partnership type relationship, a bit like maybe a consultancy type relationship? What's the feel of that?

Jennifer Mace: I think that as I've grown as an SRE, one of the increasingly important aspects of the job is relationship management. It's massive. You need to be in a partnership where you trust one another and where you give feedback rather than talking around things, but that you have ways to have that be heard. And some of it is at Google, developer organisations have SREs because they asked for them. We do not force them to have SREs. There are teams with no SREs. So to a degree they are incentivized to keep us happy because they chose to have us there. They want us to do our jobs, they asked us to.

And there's definitely a lot of back and forth. We try across the company to have written agreements for how much reliability is appropriate and we tell our developers, "you have asked us to deliver you four nines of reliability, like 99.99% of your queries succeed. In order to achieve that, we will do our best, but we need you to be responsive on bugs that we send you. If you feel that we are asking you too much, that's fine. We can downgrade to three nines." So there's kind of this balance, this negotiation going on. I will never try to force my partners to do something. I will just tell them what I cannot do if they choose not to.

As a Google SRE, do you ever modify business logic?

Charles Humble: And given that the SREs at Google are also programmers as distinct from sort of purely ops people, does that mean that you will actually get into the code itself, the sort of feature code and make modifications directly?

Jennifer Mace: Yeah. And I think that the way that I've been trying to think about it less as ops and more as reliability. So the phrase “SRE” has become a phrase and we don't think about what it says, it's just an acronym, it's just a word. So I tried to say instead, we are reliability engineers. We are not software engineers, we are reliability engineers. One of the tools that we do is we engineer software for reliability, but we will also engineer systems, change the architectures, make this traffic go to that traffic, which isn't per se software. But we definitely do go into the code. We will more frequently go into the code at the platform level or at the underlying frameworks that the binaries are built on. And we will less frequently go into the business logic and the features, but we will definitely sometimes go into the binaries and we will implement entire services on our own to do our jobs. Like I said, to automate the ops load. We call - “toil” is the thing that we are always looking for.

SRE at Google has a toil budget. No team is allowed to spend more than a certain percentage of their time doing toil and various people define toil in various ways. I personally define it as "boring, repetitive work that does not gain you permanent benefits." So if something is not all three of those, it isn't toil, right? Debugging an incident for the first time is not toil, it's operations, but it's not toil, because you're learning something. Debugging the same incident for the third time is toil because you're not learning anything. Why is this happening again, this is nonsense.

Charles Humble: I really like that actually because toils become one of those words again, it's got rather overloaded, but just because it's boring doesn't mean it's toil.

Jennifer Mace: Right. I'm thinking like refactoring your code so that the tests work well is boring and repetitive, but it's giving you a permanent benefit. So you do it.

Is the way you work with a legacy service different from how you work with something that's sort of brand new and greenfield?

Charles Humble: Is the way you work with a legacy service different from how you work with something that's sort of brand new and greenfield?

Jennifer Mace: I would say that it's pretty dang rare that SRE would be working on a new service from scratch, for a couple of reasons. One is we have a slight preference to start engagements after a developer team has launched something and run it for a period of time to show that it's stable. So we like developers to be their own ops for the first little bit and that helps sidestep some of the worst of the throwing it over the fence problem, because they've done it for six months.

So normally if an SRE team is engaged with a developer team who is developing something from scratch, something brand new we’ll be engaged on a review level, on a design review level, on a consultancy level, but not necessarily in the code. But we may be asked to step in and say, can you come help us put the monitoring right? We're not sure we're doing it correctly. Things like that.

The cutting edge stuff is frequently more featureful like a lot of the frameworks and underlying systems already have monitoring, already have automatic scaling. If you are building something in Kubernetes, you're not going to have to manually scale how many CPU or how many pods unless you have a specific reason to do so. The technology understands how to do that for you. Whereas in old school stuff, you may have to be running a script that says, are you running at 90% utilisation? Well then add another copy please before you die. So that's definitely a difference. Like the old stuff, you have to do a lot of things by hand, but it can be more stable because you've been doing them by hand for 10 years.

Charles Humble: By hand?

Jennifer Mace: Sorry, when I say by hand, I do not actually mean by hand. Macey, what do you mean? I mean you may have to implement a bespoke function to do it automatically, as opposed to the platform having it built in that it does it, right?

Charles Humble: Right. Yes. Do you ever get to a situation where it's just like, do you know what, we're done here? And you hand the service back to the developer team?

Jennifer Mace: There's a few situations where that can happen. That is explicitly part of the charter of what it means to be an SRE at Google, is that has to be on the table. And there are a few things that this can happen. One is "we had an agreement and you aren't holding up your side of the bargain", and you get three warnings, but this relationship is not working. And we very much try to avoid that. I was on a team for a little while who had what we called the assisted tier model of on-call, which I think is a brilliant tier. And basically what it said was, "hey, the entirety of our product area, if you are a small team who has a service that is basically boilerplate with some business logic on it, it fits all of these rules and it looks like everybody else's, we will take your on-call for free. And you have to fix your bugs enough that you stay over this line. If you don't stay over this line, we will off-board you."

And in that case, because it wasn't a particularly intimate relationship, we weren't working in their code, they weren't particularly changing what they were doing. Off boarding was very casual. So you'd get off boarded for a month, you'd fix yourself, you'd get back up to the SLO, you'd pop back on, we'd take your pager again. And that was fine. In other cases, it can be an apocalyptic scenario with directors yelling at each other.

And I have also been in a situation where the ads team that I was on had this little tiny binary that Geo or somebody was running and we were like, "We need this to be at four nines." And they were like, "We don't, it's running at two nines and that's fine." And ads is like, "But we need it." And Geo's like, "We're not going to do that. That's silly." So ads are like, "Well fine, give it here then." So ads took it, ads put in the effort because they were the ones who needed it. They invested and they got it up to four nines and they gave it back to Geo. "This is one of your binaries, here you go."

Can you talk about the relationship between reliability and risk?

Charles Humble: That's getting us into the relationship between reliability and risk, which is another of those things that I thought was really clearly articulated in the SRE book. And actually I found that fascinating. So can you talk a little bit about that relationship? The relationship between reliability and risk?

Jennifer Mace: Yeah. Because I know I've been standing around saying things like four nines all the time, which is a little bit jargon. So one of the first things I got hammered into my head as a baby software engineer in this job role was there's no such thing as a hundred percent reliability. And if you think you want it, you're wrong.

Charles Humble: Indeed. So how does an SRE think about reliability for a given service?

Jennifer Mace: The way that an SRE thinks about reliability for a given service is a thing that I like to talk about as appropriate reliability, which is to say, how important is this? What is the impact on the user if it fails? Okay. Then we are going to say, this is the application that displays the menus for the cafes at Google. It should probably work 90% of the time. It's not a big deal if it fails. You can walk to the cafe, that's fine. The cooks are not going to refuse to feed you. You will live. And once you have that and all you have the servers that register the clicks on advertisements for Google search, which, that needs to work. So maybe that we say is four nines. Once you have that number, which we call an SLO, a service level objective, that's the target, the amount of reliability you think is appropriate, we then say, okay, well your risk budget now, you can fail in the cafe's case 10% of your queries and it's fine.

So you want to redeploy everything onto Ruby on Rails because you feel like that will be fun this week and maybe you take it down for an hour, that's fine, go for it. You want to try doing this all in Haskell, I mean, it's your funeral, but have fun. Whereas if you are, say the web traffic servers for the whole of Google, you can have one minute of downtime a year or a second of downtime a year, and that's your risk budget. And so you behave very differently towards your service. You start over provisioning, you start thinking about N plus one or N plus two redundancy, having extra services in case an entire region goes down, the whole region can be filled in your spare capacity and this is expensive. And the other aspect of risk is also how fast do you move, how shoddy can you be versus how careful do you have to be in validating things before you even release them?

Because there's a lot of things that we can do to edge closer to a hundred percent reliability. Things like full scale reproduction of the production environment in a load test, right? In a test bed situation. That's expensive, takes a lot of people to work on it. It will always be going out of date. You only do that if you have a very low risk budget. Whereas if you have a medium risk budget, what you maybe do instead is you do the rollout on production, but to 1% of the instances and you canary it in production because if those 1% fail, you can fix that inside of 10 minutes and that doesn't blow your risk budget.

How does the trade-off between innovation speed and reliability play out in practice?

Charles Humble: There's a related thing, and again, this does come up in the book, which is that there is effectively a trade off between, I guess we might call it innovation speed and reliability. How do you find that that plays out in practice?

Jennifer Mace: Well, I think that there's a question about what you mean by innovation speed, right? So it is generally true that most outages happen due to some change that the operators have performed. This is not universally true. We've definitely had, you know, Flappy Bird caused many outages in the ad stacks back in the day because it was sending much more load than we expected because it was a viral sensation, which is the sorts of silly things you can't predict, right? There's the success disasters. Setting those aside for a sec.

Charles Humble: Yeah, yeah, totally.

Jennifer Mace: Yes. So success is, you know what I mean?

Charles Humble: Absolutely, yes. How could you possibly have predicted that?

Jennifer Mace: Oh dear, we totally didn't plan for this. Oh dear.

Charles Humble: Yeah, yeah.

Jennifer Mace: Yes that. Setting that aside for a second. When we're talking about innovation speed, a lot of the time we're talking about how often do you let people release? You are the kernel developer team, you will release once every six months because if you mess up, it's really, really bad. And you will spend five of those six months doing intensive testing to get as close to perfect as you possibly can.

I have a friend at a different company who works in chip design. If they mess up the molds, that is millions of dollars failure, completely unrecoverable. And so they have very low risk tolerance and they can only innovate very slowly.

But when we are talking about binary releases, the normal case of services, there is research, you can look this up with, I think, DORA is the body who put this out that shows the opposite of what your instinct might be, which is services which release more frequently are more reliable, not less. And there's a number of reasons for this, right?

Charles Humble: Yeah. It's fascinating though, isn't it? Because it's such a counterintuitive observation, it's like there's this very ingrained thing in us that if something feels risky, we want to slow down. It's like a very deep natural sort of human response. The observation doesn't feel right instinctively, and unless you've actually kind of experienced it, I think it's just really hard to wrap your head around. But it does seem to check out over and over again.

Jennifer Mace: It feels like if you want to get something right, you should do it slower and more cautiously. They've proven this correct in other contexts too, which is there's an experiment where they had a pottery professor split a group of students into two halves and they told half the student, you will be graded on making as many pots as you possibly can. And they told the other half of the student, you will be graded on making the best pot you possibly can. And so the ones who were told you must make the best pot dithered and were slow and was super, super careful and made their one pot. And the ones who focused on quantity, all of the best pots were in the group who focused on quantity.

Charles Humble: That's really interesting.

Jennifer Mace: It's very funny. But how this matters to reliability, there are a few things that you improve significantly when you have velocity. Velocity isn't quite what I mean. Agility, let's call it agility in your service. When you have the ability to make changes and you are doing it frequently, you are exercising, not just your paths for releasing new changes, but generally the mitigations that you have for fixing if you mess it up. If you are upgrading your, again Kubernetes cluster only for every major release and something is broken in it, there are 15 gajillion different things that that could possibly be because you have a whole quarter's worth of change, a whole six months worth of change, whatever it is. It's something.

If you are keeping up and doing the dot releases on a regular basis and something broke, you have a much smaller delta to try to identify the breakage within. And you probably have a much better chance at rolling it back by that much or by cherry picking a solution on, because you are used to updating your cluster on a regular basis. You have procedures, your nodes aren't all going to crash, your pods are still going to run. You have practised.

Charles Humble: Right. And it's that practice and procedures that makes the difference.

Jennifer Mace: So yes, I think that there is also - pet peeve corner. People think that time is the knob that you turn to get more validation of a release. When we're talking about canarying something, when we're talking about running it to seeing if it's well, time is not the knob you turn. Opportunities to detect failure is the knob you turn. If your system is not collecting data on failure over time, then testing it for one day or testing it for 10 days is exactly the same. So we see this in many people who are like, "I only ever find out about outages when my customers tell me about it. But what I'm going to do is do a four day release when I do one region a day for four days” and I'm looking at them, I'm like, "Your customer's not going to tell you until a week later. So what exactly have you gained by doing that?"

Charles Humble: Right. Yes, totally.

Jennifer Mace: You are not turning the right knob.

Charles Humble: Absolutely. But again, it's one of those things that goes against our natural instincts.

Jennifer Mace: It goes against instinct. And I will say, to go back to our earlier point, this is why you have a separable SRE organization, because I have seen this so many times because it's the entirety of my job.

Charles Humble: Right. Yes, of course.

Jennifer Mace: Whereas if I was a developer who was working on one project, I would've only seen how my thing fails and I could not do those sorts of generalized insights that I can because I've seen dozens and dozens of them.

What's the first thing you do as an SRE if your pager goes off?

Charles Humble: Yeah. Again, that's a really interesting kind of justification for this approach, I think. For someone who hasn't experienced it, what's the first thing you do as an SRE if your pager goes off?

Jennifer Mace: Well, I mean the first thing you do is log in. But sophistry aside, the first question that you want to answer with any incident is what is the impact? What is my signal for the impact? Which graph is accurately reflecting breakage? Because until you have that, you can't answer questions like, "when did it start?" "How bad is it?" "Is this an all hands on deck or is this God, it's that annoying thing, let me hit the server with a hammer and it'll be fine?" So the first thing you do is find the graph. The second thing that you should do is ask, "how can I make the problem no longer affect the users?" If you found the place where you're measuring the impact and the impact is ongoing and it's current, you're like, "Okay, how do I make that stop?" And only then once you have mitigated it, do you start asking, oh, exactly how did this happen? How do I fix it properly? How do I go into the code and fix the bug, whatever. So code bug is a great example here.

You should always roll back to the point that your graph says that the service was healthy before you know what piece of code was the problem. Because you can do that later when your users aren't being hurt by it. And sometimes this is impossible, right? Sometimes you cannot figure out how to mitigate until you've debugged the whole thing.

I think that one of my favourite things to write in the new SRE workbook was the chapter on incident analysis because they let me write down an entire Kubernetes outage, like completely true and publish it. So that one was very tricky and we couldn't do anything until we knew what had failed. And that is not a situation you want to be in. To the point where if your services reliability matters to you, you should be architecting from the ground up to make that never happen. You should always have a way to fix for your users before you understand exactly what broke.

What would your advice be to a junior SRE dealing with their first major incident?

Charles Humble: What would your advice be to a junior SRE, I don't know, maybe a new hire who is dealing with a major incident that might have contractual implications as well, so there might be an associated SLA with it. What would your advice be to someone like that?

Jennifer Mace: Well, there's a bit of a saying, or there certainly was when SRE was smaller, that on-call in an outage has director authority. And if I was talking to a new hire who was in this situation, I would also tell them, "and it is completely okay if you want to bring in someone more senior to take that from you, you don't have to, but we will support you." But yes, you can requisition other people in other teams. You can tell people, no, we are rolling back this big launch. You can tell people, yes, we are going to double our capacity for the next 12 hours because the users need it. Yes, I know that's expensive. Right? The important thing, and this is one of the funny things that you don't necessarily expect when you hear reliability engineering described, during an outage on-call is the advocate for the user.

Charles Humble: Interesting. So why is that?

Jennifer Mace: Because we frequently find that our developer partners are the advocate for the system.

Charles Humble: Right. Yes. And I guess that's valid because that's their area of expertise, right?

Jennifer Mace: They are the ones who understand how the cogs fit together. But if you are not specifically trained to do so, it's very easy to get laser focused on the wrong thing and trying to fix the code and forgetting that dozens of people are being told by Google Maps to drive into the middle of a lake or something terrible.

Charles Humble: Right. Yes. Yeah, absolutely. We should probably, I don't know, fix that first.

Jennifer Mace: We need that to stop, right now. That has to stop. That's more important than figuring out why.

Charles Humble: Right. Yeah. And in that situation, that's when you need those kind of quick ways to just please make this stop.

Can you tell us about generic mitigations?

Jennifer Mace: And this gets back to, I guess, the phrase that I coined in that writeup of the GKE incident, which is the idea of generic mitigations.

Charles Humble: Yeah, which is really interesting, actually. You wrote an O'Reilly blog on generic mitigations as well.

Jennifer Mace: Yes, I was very happy because they commissioned me some great art from Emily Dudak. So that was great fun.

Charles Humble: Oh, that's really lovely. Actually, it's one of the great privileges of my job at Container Solutions now is that we have fantastic in-house designers and artists and it's just so lovely. It's such a privilege to work with people who are good at that. So with this blog post, let's just unpack this idea of generic mitigations a bit more. What is the word generic there apply to?

Jennifer Mace: So the idea behind a generic mitigation is that the word generic is not applying to like “it's generic across multiple systems”. What the word generic is applying to is it is generic across multiple outages for your system. So you sit down and you say, I am a web server that runs weekly releases. The types of outages that I normally see, a lot of the times they are about binary changes and code changes, so a good mitigation for 50% of my outages, if I had rolled it back, it would've been better. So that mitigation, the rolling back is generic in that you don't have to understand your outage to use it.

Charles Humble: Right. Can you give other examples?

Jennifer Mace: So another great one is I will deploy twice as many copies of my service because for some weird reason it's overloaded. I don't know why. It could be a bug, it could be genuine user traffic, something weird is happening, but my queries are failing, so I'm going to double the deployment until I can figure it out. And that's a generic mitigation. I mean, a mitigation is just any action you take to not necessarily resolve, but to improve a problem. And generic just means you don't need to understand the whole problem to use it.

Other ones, we have the traffic drain, which is, this region seems to be weird, let's move all our traffic to the region that seems healthy. There's data rollbacks and data pushes. What are some of the other ones? Oh yes, isolating things. So this one user, every time their query hits a server, the server dies. So I'm going to block that one user. I'm so sorry buddy, you don't get to play anymore, but everybody else gets to be served. Or this one row on the database, every time it gets queried, somebody put Little Bobby Tables in it and everything goes terrible. So either you can do this with block lists or you can do it with what we've called quarantine servers where you have an actual copy of your instance running that you can move that user onto and no one else is there. So put them in the naughty corner and they can think about what they've done. So those are some of the ones off the top of my head from that article.

What is it that you are currently interested in?

Charles Humble: As we get towards the end of our time, could you maybe talk a little bit about what it is that you are currently working on? What is it you are currently interested in?

Jennifer Mace: Sure. So at the moment, I'm on a team whose focus is assisting Google usage of GCP. And so a lot of the stuff that I'm thinking about these days is fleet management and ways to validate changes across heterogeneous users. So if we roll out a policy change across every Googler who is using cloud and they're all using it in totally different ways from one another, how do we detect if that broke them? How do we detect what the impact was? Right? Which is a really fascinating problem because the lack of uniformity means you need cooperation from all of the users to a degree, because you can't automate that, right? “What does broken mean” is a very tricky question. So that's what I've been having fun with of late.

Charles Humble: Wonderful. Macey, thank you so much for joining me on this, the 12th episode of Hacking the Org from the WTF is Cloud Native team here at Container Solutions.

New call-to-action

Comments
Leave your Comment