WTF Is Cloud Native, Hacking the Org

Podcast: Honeycomb CTO Charity Majors on Code Rewrites, Acquisitions, Observability, and Team Performance

Charles Humble talks to Honeycomb CTO and co-founder Charity Majors. They talk about her experience at Parse, re-writing Parse in Go, and her lessons from Parse’s acquisition and shut-down by Facebook. They also discuss Scuba, why Charity went from CEO to CTO at Honeycomb, places where dashboards make sense, and how to think and talk about technical team performance.

Subscribe: Amazon MusicApple Podcasts | Google Podcasts | Spotify

About the interviewee

Charity Majors is the cofounder and CTO at honeycomb.io, which pioneered observability. She has worked at companies like Facebook, Parse, and Linden Lab as an engineer and manager, but always seems to end up responsible for the databases. She loves free speech, free software and single malts.

Resources mentioned

Accelerate by Nicole Forsgren, Jez Humble, Gene Kim
Good Strategy Bad Strategy: The Difference and Why it Matters by Richard Rumelt

Full Transcript

Introductions

Charles Humble: Hello and welcome to the 13th episode of Hacking the Org, the podcast from the WTF is Cloud Native team here at Container Solutions. I'm Charles Humble, Container Solutions editor in chief.

Before we get into the podcast proper, it will be remiss of me not to mention that our conference WTF is SRE is happening any moment now. It's May the 4th to the 5th in London, here in the UK. As I record this, there are still some tickets available, so if you haven't got your tickets and would like to get along to our website, I'll include a link in the show notes and see if you can still pick some up. My guest today, Charity Majors, is one of our keynote speakers and I'm also speaking there as are people like Matt Turner, Crystal Hirschorn, Sarah Hsu, Jamie Dobson, and many others.

Charity is an ops engineer. She describes herself as an accidental startup founder at honeycomb.io where she is CTO. Before she co-founded Honeycomb, she worked at Parse and then at Facebook on infrastructure and developer tools. And she always seemed to wind up running the databases. She is the co-author of the O'Reilly's database reliability engineering book, and she and I also share a fondness for single malt scotch.

I should say too, that we do swear a certain amount at Container Solutions. I normally manage to keep the podcast clean, but there is a little bit of bad language in this particular episode, so apologies if that bothers you.

Charity, welcome to the show.

Charity Majors: Thank you.

Why did you rewrite Parse?

Charles Humble: It's brilliant to have you on. I'd like to start maybe by talking a little bit about your time at Parse because I think there's so much there that's interesting, 'cause obviously it went through that huge scale up process, kind of riding on the whole rise in mobile phone stuff. And I was thinking about that this morning when I was prepping for this and thinking that there's a mistake we often make, which is we try and design a system for huge scale that we've got, I don't know, half dozen users or something. And we tend to do that badly. 'Cause I think it's really hard to see beyond maybe one order of magnitude or one step beyond where you are. And I know that Parse, at one point you had to do a complete rewrite from scratch as well, which is always one of those things that feels like that's a terrible idea. Why on earth would you do that? Was it as terrible as I imagine?

Charity Majors: Worse. And yet I would make the same decisions again.

Charles Humble: So given that, why did you do it? Why did you feel you had to?

Charity Majors: Well, you're absolutely right that most startups optimise for scalability before they should. Most startups fail and in the early days, the only thing you should be optimising for is your survival 'cause the odds are stacked against you.

So we started building Parse on Ruby on Rails, and Ruby let us develop very fast with a very small team. We were acquired for, what was it? I don't know, $50 million when we only had 10 engineers. It was no Instagram, but it was not bad and our growth curve of users were just like, it's the steepest hockey curve I've ever seen a company have. I mean, it helps, we were basically giving it away, but still, we were growing super, super fast. But we started running into this inherent limitation of Ruby on Rails, which is that there's no such thing as a thread per request. There is a pool of worker threads that's a fixed pool on every host, which is mostly fine when you have a single database behind it because there is only so many things that are likely to go wrong with your database on the backend, right.

But we weren't running well, we were running MySQL, we were writing also MongoDB, which at that point was around 2.0. They had just added the multiple lock. No, they only had a single lock per replica set when we were developing on top of it. So it was very bleeding new. And then we started adding more of them because we were just provisioning new users onto basically a replica set. And then we ran out of the replica set. And so we had to add another, and then we had to add another. And at some point we had 15 or 16 replica sets and MySQL and Redis behind this fixed pool of unicorn workers. And something was always breaking. Like, some queue was always running up or something was always getting slow. And as soon as anything behind this fixed pool of workers got slow, all of the available unicorn workers filled up with requests in flight to that backend.

And it was so predictable and it was so common and it would happen every single day. And there's just nothing you can do about that. We did investigate JRuby, but it looked like just about as much work to port up from Ruby to JRuby as it would be to rewrite it in a different language. And at the time when we had started building Parse, Golang wasn't available. It wasn't a thing. It only started taking off while we were building Parse. And we realised that if we were going to build it in Golang, at least it would be a great recruiting thing that we could dangle out to compete for engineers. And every other language out there with threads would significantly constrain the pool of engineers that we could recruit from.

So we decided to rewrite it in Go and it was every inch as painful as you imagine and then some.I don't think we shipped any new features for almost a year and a half and it was brutal, but it had to be done and it worked. And ultimately I think it was the right decision and we would've moved down from it. Like, we were continuing to add... Like, part of the reason it was so difficult was because we were continuing to grow, like crazy. But about six to eight months later after we finished it, of course Facebook shut the service down.

Charles Humble: Right. Yes.

Charity Majors: And yes I bear a grudge and I will tell my dying day because, okay, here I have to digress just a little bit into acquisitions, because acquisitions suck. They almost never work well. And even when they do work well, they almost always work well by losing all the people who loved it in the first place 'cause it's just such a different company. Never do an acquisition offer if you don't know that you have an executive sponsor on the other side of it. Someone who's in the inner circle of the exec team, a C level, a VP, who is championing this acquisition and wants you to succeed. We never had that at Facebook. This whole scheme was cooked up by Zuckerberg and one of his PM flunkies over there under platform engineering. All the VPs were shocked. They're like, "What? We're acquiring who?" And as a result, we got bounced around. We were under three different VPs in the first year we were there. Nobody's success was invested in our success.

What was Facebook trying to achieve with their acquisition of Parse?

Charles Humble: What was the logic for the acquisition? What were they trying to achieve?

Charity Majors: The reason that they decided to acquire us was because they're like, "Huh, developers hate our platform. Why do developers hate our platform? We don't know. Let's try buying a company that developers like." And I'm not kidding, the two that they came up with were Docker and Parse. They're like, "Let's buy one of these and then maybe developers will love us again." So they bought Parse and of course they immediately put us in the platform engineering team and they're like, "Hey, help our platform understand why developers hate us." And we're like, "Okay, this is an easy one. We can help you with this. Stop making breaking changes to your API." And they're like, "Mm, no. We're definitely not going to do that. Any other ideas?" And we're like, "Nope, none." So they moved us into infrastructure. It was that ridiculous. Never exceed to an acquisition if you don't have an executive sponsor.

So there was just after a year or two, you remember, Facebook just decided not to do platforms. They're like, "Well, nope. We're going to keep breaking our APIs and developers are going to keep hating us. We're going to lock it down, make it a shadow of its former self. We no longer care about platforms." And so then why should Parse exist? I don't know. They didn't know. The problem was even though they kept starving us for resources, Parse kept growing 'cause it was a great idea and a good implementation and our users loved us. To this day, I get fanboys all over sometimes where like, "Oh my God, you worked at Parse. I loved it so much. Why did it get shut down? And I'm like, "Let me tell you a story."

But that's not actually why I will bear a grudge. That's just basic startup crap that you just have to deal with. I get it, right. Things change very quickly on the ground. The reason I will bear a grudge is that they had offers to acquire Parse after that. Both Google and Microsoft, I've heard offered to buy Parse after that when they were shutting it down and they were like, "Nope, not worth the paperwork." And they killed it. So I'm sorry. Fuck that. Fuck them. Ah, just makes me angry every time I think about it. It was a great product. It was a great service. It was a great group and it deserved better.

Charles Humble: Yeah, I actually remember it really well 'cause I was following it for InfoQ where I was working at the time, and it was one of those things, it was so obviously a really good idea. I remember just sort of watching it rise. It was so good. And you just said, "That's brilliant." And then they kind of killed it and open-sourced it, which didn't make a lot of sense.

Charity Majors: And the open-sourcing it... Yeah. And this was another thing that bothered me is clearly the founders knew this was coming. The whole open-sourcing Parse thing never made any sense except as a gateway to not having Parse. And they weren't upfront with us. They weren't honest with us. As a founder, I get you don't want to tell your team bad news, but they deserve it. This is what Christine and I feel, to of the depths of our bones, we believe in transparency and we believe that everyone who works with us has an absolute right to know how the business is doing, what important decisions are ahead of us.

There have been a couple of times when we're like, we think we're running out of money and we're going to have to shut down, we might have to get acquired. It's like we try not to dwell on it. And as we've gotten larger, I don't go into the nitty-gritty details as much because you don't want to distract people, but at the bottom line, they have a fundamental right to know what's going on. And if they're asking questions, it is your absolute duty as a leader of the company to tell them what is up. So that's the lesson I took away.

Do you want to talk a little bit about Scuba?

Charles Humble: Right. Yes. For all, it was a pretty difficult time. I think you did come across their Scuba tool at Facebook, which was obviously a big influence on you. So do you want to talk a little bit about Scuba?

Charity Majors: Scuba was critical to our rewrite. I don't think it would've happened without Scuba. This is something that we're trying to figure out how to message to Honeycomb prospects today because I cannot emphasise enough just how doomed the rewrite effort was before we started using Scuba. Think about it. Ruby is a language that makes all these assumptions and all these guesses, and it's just like, well, let's call this a zero. Let's call that a null. Let's call that something else, garbage, whatever. And it's just throwing this crap all over the place just willy-nilly, and then storing that stuff in your database. Mongo is famously a JavaScript based engine, and between the two of them, they just came up with some random crap and Golang of course, is very statically typed and very strict about his types and stuff.

So when we started rewriting and we were going endpoint by endpoint and starting with the CRUD operators as you do, it took us four or five months to get the first end point out because what we would do is we would write the functionality in parallel with one of Ruby unicorn workers we would put up a Golang worker and we would pipe the traffic through both workers, return the output, compare the output and a little proxy level there, and then return the Ruby output to the user so it was completely transparent to the user. And so we could just copy side-by-side and see what was happening. And the long tail of exceptions that got turned out by that was just like, it took so long and every time we would roll it out, we'd get, maybe it'd run for a week and then we'd get corrupted data somewhere and we'd have to roll it back. We were just rolling this one endpoint back and forth and back and forth because it was all just guessing game and it was all just after the fact.

And when we started using Scuba, we were finally able to instrument at a level so that we could do this without having to rely on just manual comparisons. We could look at the results in Scuba, graph them, break down by Bill of ID by Go versus Ruby, and we could watch all of the differences and we could see the latency spikes. We could just see it so much finer of a detail that we were able to finally start making progress because we had visibility into what we were doing. I don't think we ever could have completed without Scuba. It was a game changer in so many ways. Visibility into MongoDB, visibility into the rewrite, visibility into what was happening between the SDKs and the backend. It was... Yes, the rewrite had a lot to do with our reliability improving, but I would say it was two-thirds the rewrite and one-third Scuba.

Charles Humble: I find this really interesting 'cause a lot of my ops experience kind of predates all of this, we're talking... I actually worked for a monitoring company for a period of time that was eventually acquired by IBM. And there were a couple of different scenarios. One was that you would rock up at a customer site and say, "What do you want to monitor?" And they wouldn't really know. And so you'd just do the sort of standard best guess estimates of the sort of things that could go wrong. But the other thing would be we had an incident and then we would spend some time instrumenting for that incident and building a dashboard. So if that thing ever happened again, we would know exactly what was going on because we'd seen it before. But of course it very rarely is the case. We were talking sort of distributed MQ systems. They very rarely failed the same way twice.

Charity Majors: Yeah. History doesn't repeat, but it rhymes. It's like that was the problem with the model of metrics and dashboards, right? 'Cause you had to create custom metrics for exactly this specific thing. Right, like, say you had a bug in your language pack or something, you'd have to create a custom metric for this device type, this language pack, this version of the software, this language, create a custom metric for all these conditions. And then if that never happened again, you're screwed, right. But if you're instrumenting with events, you just instrument all of the individual things, right. You instrument the language pack, and the build ID, and all the different languages, and all the different origin points, and all the different device names, all the different device IDs, all the different device versions, you know, instrument for all these different things. And then you wait for any possible combination of them to show up and you can just pinpoint it that way without having to predict in advance what's going to happen.

Do have an example of this core idea that you can't predict where a failure is going to happen in a system?

Charles Humble: Right. Yes. Do you have any examples of things that you saw in production maybe at Parse or elsewhere that illustrate this core idea, you know, how you can't predict where a failure is going to happen in a system? Was there one that you tracked down that maybe is a good illustration of it?

Charity Majors: Oh God, yes. Honestly, almost all debugging stories are some version of how is this thing I care about different from all the things that I don't care about, right. So Parse was very much reliant on developers uploading their own code or releasing their own code, and then we would execute it for it. And so, if one of our 1 million users out there uploaded a snippet of JavaScript with a library that wasn't working or that was returning bogus codes, how would you find it?

Or if, say, oh, here's a good one. Say, you've got all these developers out there who are uploading database queries and they make a change to one of those queries where the individual execution time only goes up, it only doubles say, which is pretty small, but the number of those executions goes up by a lot, so you can't track any one of them. But what is the aggregate amount of execution time that's coming from this user's database calls and is it say 90% of your overall database time that's being eaten up, right. That's something that's not going to be a spike anywhere on any dashboard, but it can take your database service down.

Charles Humble: Right. Yes. It's sort of top 10 slowest query list or something. And it's never one of those that's actually causing you the problem. It's always buried down on page 27, number 500 or whatever like that, if it's a database, right?

Charity Majors: Exactly. It's usually the 500th one on the list. And often when you've got those top 10 lists, even the ones that are super slow, they might not be slow because they're causing the slowness. They might be slow as a result of the slowness that's caused by something else.

Do you think there are still places where sort of an old style monitoring system or a dashboard does actually make sense?

Charles Humble: Right. Yes. Given all of that, do you think there are still places where sort of an old style monitoring system or a dashboard does actually make sense?

Charity Majors: Absolutely. I think there's a pretty clear division of labour, appropriate division of labour that's springing up around infrastructure, broadly speaking, which is the code you have to run in order to get to the code that you want to run. And then the code that's yours, like your crown jewels, the code that makes you exist as a business.

And I think that you want observability for that code because you need to be able to understand it so intimately. You need to be able to understand how it affects every single user individually. And users are usually the highest cardinality dimension in your data set, right. Which means that finding that 500th user that's doing something crazy or bad or having a terrible experience might make or break your business. But when you're dealing with infrastructure monitoring tools, metrics, dashboards are absolutely the right thing to use because if you think about it, observability tools aggregate around the request.

That's the only thing they aggregate around. They're events based. So for every request per service, you get an event, that's very wide, it has all of these key value pairs. But metrics aggregate around the system, or the host, or the container, or something like that. And so you do want to be able to track, like, these are two different domains. There's a domain of this code that I'm writing and doing something to, does it need to change? Does it affect people? And then there's the domain of, the code that I'm running in order to serve those requests. Is it healthy? Those are very different things. One can be broken without the other being broken. You can have 25% of your capacity down from the system side and still have everything be executing fine on the user side.

And everything can be busted from the user side and everything can look fine from the system side, so those are very different areas of domain. And when it comes to understanding is my system healthy, is the CPU spiking, is the RAM spiking? What are the statistics? Even for understanding is my database healthy? Right. You absolutely want metrics and monitoring tools for those.

How do you accidentally start a company?

Charles Humble: Right. Yes. Changing topic a little bit, I said in the intro you described yourself as having accidentally started a company. How do you accidentally start a company?

Charity Majors: Oh, yeah. I mean, I was never one of those kids who is like, "I want to start a company." I kind of low-key hate those people. The whole founder industrial complex just gives me, it's just this terrible smell. So when I was leaving Facebook, I was planning on going to be an engineering manager at Slack or Stripe. But I wasn't honestly getting offered the jobs that I wanted and thought I deserved, and some people were actually kind of pursuing me like, some seed investors were pursuing me after Facebook and I kind of went, "Wow, I've never had a quote unquote, 'Pedigree' in my life. You know, Serial dropout. I've always worked at startups. When I was leaving Facebook, I was like, "Oh crap. For the first time in my life, I have a pedigree. This is probably the only time that people are ever going to be chasing after me and being like, "Woo, would you some money?"

I felt an obligation on behalf of all dropouts, all women, all queers in tech to take the money and run. So then it was like, "Well, do I have an idea?" Well, there's this tool that I really don't want to live without, but I was so ill-equipped to start a company. I had never even heard the words, product market fit. I'd never worked with product or design as an engineer, just straight up an infrastructure engineer, right. I had never thought twice about sales or marketing or anything, although fortunately I had spent my time at Parse getting better at public speaking. If I hadn't done that, I think we would've been screwed and Honeycomb never would've made it off the shelf.

Why did you switch roles from CEO to CTO?

Charles Humble: There's an interesting thing that you did about four years ago, I think, which was you and your co-founder basically swapped roles more or less, right?

Charity Majors: Yep.

Charles Humble: You swapped.

Charity Majors: Yep.

Charles Humble: She moved into the CEO role and you sort of switched into CTO'ing. Can you talk a bit about that move? What was that like? Why did you do it?

Charity Majors: Well, I never intended on being CEO either. In the beginning we had a third co-founder. And he was the one... We had our token straight white dude, the CEO, but he didn't work out pretty quickly. It was like two or three months in and when he left it was like, "Oh crap, I guess I'm stuck with this." Right. And I did the best I could. I kept the company alive. I guess that's the best you can say for it. I didn't want to do it. I'm not good at it. I will never do it again. Like, I had nightmares about never again being taken seriously as a technical person, thought I was going to get stuffed in the room with all the PMs and people who were failed engineers, and I didn't like the work. I didn't know how to do the work.

Every minute of every day I felt like I was failing 'cause I was like. We had so many near death experiences as a startup, I still can't believe that we survived because every year it was like, "Well, this is definitely the year we're going to fail." And we didn't reach product market fit until about the time that Christine and I swapped places. But once we did reach product market fit, then it was time for someone like Christine. You know, what you want in a CEO is someone who is predictable, and regular, and plans, and shows up at 9:00 AM every day, same time, same place... And who loves that sort of stuff. And that's Christine. And she has come so far. She's way better at CEO'ing than I was, and I am still recovering from my time as CEO. Like, I lost my marriage, my health, I stopped being able to sleep. It was not good for me psychologically or physically or in any other way. So it's done. Did what I needed to do. Will never do it again.

What was it like when everyone started to co-opt the term observability?

Charles Humble: There's an interesting thing about the product market fit stuff, because one of the things you had to do is you had to explain to people why what you were doing was different from monitoring. So you found the observability term, which was a mechanical engineering term, and you did a fantastic job of communicating, you know, how this is different from monitoring. And you sort of won that argument. But then all of the old APM vendors went, "Look, we're observability vendors too." They weren't and caused quite a lot of, I think, confusion in the market for a while of what is this? Why is this thing different? What was that like from your point of view? Is that just intensely frustrating or was there a bit of going well at least we're winning.

Charity Majors: Surreal. Like, it was so weird. I mean, our only marketing strategy for the first three years was any place that was willing to have me, I would go and give a talk. And yeah, I mean you saw a version of that and it was super basic and it appealed only to a very few people, but the people who understood it, people who got it went, "Oh," and it made sense. It really jived with what they had heard. But for two and a half, three years, every day it was just like, what's observability? We don't care about observability. What are you doing?

And then one day it felt like one day I woke up and the entire world was like, "Well, of course we do observability. We love observability. We've always needed observability and we do observability too." And it was just like, "What?" It was just, it sent me reeling you know, and I feel like I spent a couple of years there very indignantly trying to explain to everyone why they were wrong. Never a great idea. And I'm done with that. I'll take it.

This is a better set of problems to have than the set of problems when nobody knew about it, right. The essential problem is that for early adopters, it was fine to be doing something radically different, radically discontinuous, something that they had to understand. But now that we're trying to move into the early majority, they're not going to do that work for us. We have to do the work to meet them where they're at. And we're still figuring out how to do that, although I'm optimistic that over the next year or two we're going to get a lot better at this. But the fundamental thing that we're dealing with is, you can't explain to people the magic of observability because every vendor out there is using the same words, the same examples, the same crap, but when people try Honeycomb, it's different.

We have no churn. Like, none of our customers leave. Like, ever. Even the ones who probably should. Like, we have no... Unless companies go out of business, Like nobody leaves because it is so much more powerful and it does what it says on the sticker, right. It changes your life. But we have to get a lot better at equipping our champions inside companies who have to then make the argument for using Honeycomb against established vendors and to execs who don't understand the difference and don't care who see them all as being the same. We have to get better at equipping our champions, we have to get better at speaking to the C-suite and to the buyers as well as the users. There are all of these defined marketing things that we just have to do the work because the product is that different and it is that much better and it does change lives. So it's really our responsibility to figure out how to get that into people's hands.

How do you think we should be measuring and perhaps talking to the business about technical team performance?

Charles Humble: Changing tack again a little bit, I'm interested to get your perspective on how you think we should be measuring and perhaps talking to the business about team performance in a technical context.

Charity Majors: Yeah. That is a great, great question and I think it's getting even more interesting with the rise of platform engineering because in the olden days, in the olden days, last year, I would've said, you start with the DORA metrics, right, which is at least a good measure of how happy your customers are, right. How often do you deploy? How long does deploy take? How often does it fail and how long does it take to recover? I think those are great fundamentals for every team to know.

But then there's this stuff that's behind the scenes and the problem is that everything you try to measure there is bad if people know you're measuring it and start to, even something that seems very innocent, number of tickets closed. You don't want engineers optimising for number of tickets closed, right. Or speed of landing things. You don't want engineers optimising for that. Size of diffs. You don't want engineers optimising for that.

You don't want engineers optimising for anything but doing good work, doing the right things and getting stuff done. But that's incredibly hard. So the best answer that I really have is to measure a handful of things. Don't dwell on any of them. Don't slap-up the results in team meetings and be like, "So our time to close tickets is getting a little bit slower, people." These are trailing indicators of health for you as leaders to be aware of, but you don't fix them by pointing to the indicators of health. You fix them by looking up the socio-technical systems that output those bits of health. So if your deploys are failing a lot, like why is that? And dig into it upstream. I do think that one thing that almost every team out there should be doing more of is keeping the time between when they write code and when that code goes live, as short as possible, you know, making those really tight virtuous feedback loops.

People underestimate how much this pain amortises over time. It gets worse and worse as it flows downhill. Like, if you can get your stuff out in under an hour or under 15 minutes, you have a healthy team, I can guarantee you that. Right. And if it's longer, I don't know if you have a healthy team or not, you really have to start looking at a bunch of different things. Keeping the time to production low, instrumenting your code, and then just watching a bunch of those sort of health indicators as a basket, not any individual one, and almost not revealing those to the engineering team. You almost don't want them to know what you're looking at.

Charles Humble: Right. Yes. 'Cause you get into the whole business of perverse incentives if you're not careful. I mean, I'm old enough to remember when we used to think that we could measure engineering performance using lines of code as a proxy of something.

Charity Majors: Such a bad idea.

Charles Humble: Yeah. It really was. But it was very kind of widely followed for a period of time.

Charity Majors: This is also I think why it's so important to have engineering managers who are good engineers because ultimately you rely on the engineering managers to understand someone's impact and make the case for it on the grounds of its impact alone, not on the grounds of any of those other indicators, right. When it's your review time and your engineering manager probably has to do some sort of calibrations with all the other engineering managers and justify why they gave you an "exceeds expectations." You know, you don't want them going in there and going, "Well, so-and-so wrote X slides of code or closes as many tickets. You want them going in and going, this is what they did and this is the impact on the business and that's why they deserve their rating.

Charles Humble: In the context of the DORA metrics in the Accelerate book, there's another thing I've sort of been thinking about, which is, that there's almost a sort of anti-physics thing in software. So if you are doing something incredibly high risk, I don't know, you are trying to cross a river that's full of crocodiles by walking across a plank or something, and you slow down so you don't fall in 'cause that would be bad. But in software, that isn't true, right. In software, what we've found is, if you can go quicker, if you can deploy more frequently and move faster, that's actually safer. But instinctively that feels totally wrong.

Charity Majors: It's deeply ingrained in us as humans to slow down when we get nervous or scared when we think things are breaking. But in software, yes, speed is safety in software. But the thing is that yes, this is a reaction that almost all humans have. That's fine, we know this. What we then know is that we have to train ourselves against it, right. This is also completely doable, but we have to do it consciously because we know this reaction is going to happen. We have to train ourselves to speed up. We have to train ourselves to never exceed to the idea that, ooh, we should freeze everything because things are getting slower. Well then what do you think is going to happen when you open the floodgates and everything comes out all at once? This is where process is king and engineering leaders, whether you're ICs, individual contributors or managers, your job is to protect the process.

Your job is to optimise the socio-technical feedback loops that are at the heart of your system. And that means aggressively defending this. The thing that I will tell any engineer who's like, "I get this, but my counterparts in the business don't get this and they're forcing this on me," make them read Accelerate. Ask them to read that and then tell them you'll do whatever they recommend, but you really need them to read this book first because that not only puts out all the stuff that we know in our guts to be true, it backs it with a ton of evidence. And once confronted with the evidence, one hopes that it would override the instinctual reactions of the business side to slow things down. So does it help? Does not help.

Charles Humble: I've seen this with technical teams as well though, to be fair, it's not just the business. I've had technical teams that have been like, "No, we have to have a code freeze 'cause otherwise terrible things will happen." If you haven't experienced it before. I think you are having to take an awful lot on faith.

Charity Majors: Absolutely. To me, this is always a sign that the team is software engineering heavy and light on operations' expertise 'cause this is something that you learn by being in the trenches operating complex software, which is not to say that every ops person knows this, of course, you know that's not true either, but you're right, it doesn't make sense instinctively, it makes sense only once you've experienced it.

What have you and Christine been thinking about recently?

Charles Humble: We are unfortunately getting towards the end of our time. But I wonder if to finish this off, you could maybe talk a little bit about what you and perhaps Christine have been thinking about or working on recently.

Charity Majors: Over the past four months, Christine and I have been really leaning deep into strategy. There's this great book called Good Strategy, bad Strategy. I don't know if you've read it. It's one of the only business books that I love and will recommend to people. The guy is this professor, I don't remember what he does. I think he was an engineer at one point and he has this dry sense of humour and he is clearly just so irritated by how many ways people will use the word strategy when they actually mean something like plans or goals or something that's not strategy at all. They're like, "My strategy is to double over the next year." And he's just like, "That's not a strategy. That is not a strategy. That's a great statement of intent. That is not a strategy." So just learning what strategy is, it sounds so simple and it is really challenging, but that's the job, right.

And so we came up with a strategy. We just had our first in-person offsite in three years. We flew everyone out to LA, it was magical. It was a good reminder that getting together in person is not an optional, it's not a nice to have. For distributed companies like ours, it is mandatory. It's like the yeast in the bread, it's the bitters in your cocktail. It is a necessary ingredient to make the whole succeed. So we presented it there and now it's working its way through the rest of the company and it's pretty exciting. I think good things are ahead of us.

Charles Humble: That's really fantastic. And I totally agree by the way. I think, particularly if you're like a distributed or remote organisation, you do have to make the effort to get people together in person and isn't it great that we now can. Charity, thank you so much. It's been brilliant to talk to you on this, the 13th episode of the Hacking the Org podcast from the “WTF is Cloud Native team” here at Container Solutions.

New call-to-action

Comments
Leave your Comment