Culture, WTF Is Cloud Native, Hacking the Org

Podcast: GitHub Director of Engineering, Liz Saling, on Pausing Feature Development to Focus on Tech Debt and DevEx

Charles Humble talks to Liz Saling, Director of Engineering at GitHub. They discuss the role of developer experience at GitHub, why the firm paused feature development for an entire quarter to focus on making developer experience improvements, service ownership, and handling feedback.

Subscribe: Amazon MusicApple Podcasts | Google Podcasts | Spotify

About the interviewee

Liz Saling is Director of Engineering at GitHub. As part of GitHub's Developer Experience leadership team, which aims to set the example for the world on how to build software with GitHub, Liz leads the teams that focus on how we use the environments we code on and the platform and patterns to test it. She’s launched more than a dozen engineering teams and departments, and created mentorship programs and coaching workshops to reinforce a culture of growth and innovation.

Resources mentioned


Full transcript

Introduction

Charles Humble:

Hello, and welcome to the sixth episode of Hacking the Org, the podcast from the "WTF is Cloud Native?" team here at Container Solutions. I'm Charles Humble, Container Solutions Editor in Chief. Imagine that you had the opportunity to pause future development for an entire quarter to focus solely on developer experience and technical debt. What would you do with the time? This was a decision that GitHub took in the summer of 2020, and I'm joined today by Liz Saling, a director of engineering at the firm, to talk about that decision what's happened since, and give us some background and context on developer experience at GitHub. Liz, welcome to the show.

Liz Saling:

Hey, thanks, Charles. It's so great to be here. I'm sure you hear that a lot, but I truly feel it is really fun to be here with you today.

Can you give us just a really high level overview of GitHub's architecture?

Charles Humble:

Oh, thank you. Lovely thing to say. I'm thrilled to have you on the show. I really am. So thank you for making the time. So can you give us just a really high level overview of GitHub's architecture? I kind of think of it as being a Ruby on Rails monolith, but I presume you've done some shifting to microservices, at least. What does it kind of look like now?

Liz Saling:

Actually we have a little bit of both. So GitHub was originally built 15 years ago with a monolithic architecture, and that monolith still lives today. And we also have added many services that ride along with it. So together that monolith plus those services make up what we see on github.com.

How often are you shipping updates to production?

Charles Humble:

And then how often are you shipping updates to production?

Liz Saling:

So for that monolith, we are shipping approximately 20 times a day. We also are shipping several hundred pull requests every month, several commits. And then each individual service obviously has their own individual cadences at ship in with that. But yeah, github.com is changing multiple times a day.

How does the role of developer experience at GitHub work?

Charles Humble:

And then in terms of developer experience at GitHub, it strikes me as being particularly interesting because it's obviously a high profile, widely used service for developers. So I'm imagining that your DevEx team has kind of two different audiences in effect. You've got the public facing part, but then you've also got internal developer experience. So is that right? How does the role of developer experience at GitHub work?

Liz Saling:

The way we like to frame it here is we're maybe the product manager's best dream and worst nightmare all wrapped up in one because we are the subject matter experts. We use GitHub every day to build GitHub. We keep our issues and our shipping flows and everything is in GitHub. So we get it. We know what developers want because we are developers and we want to tap into that developer community all the time and stay really close with it and understand it very well because we're part of it. So it's a super interesting space when you're talking about building, github.com because it's not some product that we don't get for some other customers. It's for us, it's for everybody.

What is the goal for DevEx at GitHub?

Charles Humble:

So diving into that a little bit more, what's the kind of big goal for DevEx?

Liz Saling:

Yeah. I mean our big vision, what we really aim to do, is to figure out the best ways to build software with GitHub. And then we want to share that with the world. We want to be that innovative cutting edge, testing its limits, pushing it beyond what anybody thinks it's capable of. And we want to get it out there and help others understand how we do it. And at the same time, we're always listening. We want that developer feedback. We're engaged with the community. We want to hear what are you experiencing? Because our use cases, we're not that much of a snowflake, but at the same time, everybody has a little bit of that unique characteristics out there. So we want to bring that all together and we want to experience. We want to prove that out and then share it with everybody.

How did you get to this point that prompted the creation of a developer experience team at GitHub in the first place?

Charles Humble:

How did you get to this point that prompted the creation of a developer experience team at GitHub in the first place?

Liz Saling:

So in the early days, this actually came, I think the DX team officially started, I don't know, about seven years ago [The DX team launched within GitHub in 2011]. The dates get fuzzy for me, of course. But it's the common story of, "Oh, you're working on this. So you have a gap to fill and you have a gap to fill." How about we just get together and do this and how about we just have some people specialise on it? So it was very grassroots, which definitely fit with the GitHub culture. People would come together and work on the most important things and then kind of swarm around a problem to solve it and do it well and share it out. As that continued, more formally established a DX team. And as the platform began to evolve, we introduced on-premise hosted for customers, for enterprise customers, and then we host a solution for customers.

Liz Saling:

As the platform continued to evolve and the needs got larger, that's where we started to really focus in on, "Hey, we really need to dedicate some effort into our internal developer experience." Because again, the tools that we used to build GitHub at the time were not GitHub. It predates actions in a lot of these things. So we're still supporting those and we have to make those work as we're trying to shift into what we've shipped to our customers and what we want to prove out the best ways to use that. So that's where we really started dedicating increased amount of activity and focus and value into our internal developer experience.

Where does the developer experience team sit within GitHub engineering?

Charles Humble:

And then where does the developer experience team sit within GitHub engineering?

Liz Saling:

We have three different main groups within engineering. We have our product engineering focused. A lot of that is how we get our code to the cloud and all the features around that. We have our data and services and security products and that's one section. And then we have the platform. So those are the three major areas. Developer experience falls within the platform group. And we're nestled nicely in between where the feature developers and the infrastructure and how we interact with it all. And then again, get to feedback into the features that everybody gets to use. So it's a really unique placement and it's perfect within the platform organisation so that we understand more closely all the different sets of infrastructure that we need to support well and how GitHub works with them.

As an organisation, you decided to stop shipping features for a quarter?

Charles Humble:

Now what prompted this particular episode of the podcast was a talk that you co-gave at GitHub Universe in 2020. And I'll include a link to the talk in the show notes because it's well worth watching if people haven't seen it. There are also some blogs that came out afterwards, which, again, I'll include links to, they're well worth reading. But just for context, am I right in saying that as an organisation, you decided to stop shipping features for a quarter, you paused your entire roadmap and instead you are focusing on fundamental health measures, paying down technical debt, improving DevEx, is that right?

Liz Saling:

It is. So as we were going through this large evolution, GitHub as an organisation was growing, our product offering was growing. And as we started increasing awareness and paying really close attention to some of these basic fundamental measures like you were mentioning. What our availability rates were, how many times we were shipping, what the success rates of that were, how long it took from the time you queued a change to the time it shipped. As we started looking at that and piecing together anec... I call it a anecdata. That anecdotal evidence of the things that you're hearing, but you're actually seeing how it's trending. And that created those impetus of like, "Whoa, we have a lot of big things we want to do, and we're holding ourselves back a bit. Why don't we slow down, reinvest some of that energy to set us up so that we can move faster." That quintessential cliche phrase. Slow down now to move fast later. That was exactly what happened.

How did you arrive at that decision and then convince everybody that it was the right thing to do?

Charles Humble:

It is I think kind of astonishing this, how on earth did you arrive at that decision and then convince everybody that it was the right thing to do?

Liz Saling:

So first and foremost, there had to be alignment between the engineering organisation and the product organisation. To make this big of an investment and this big of a move meant, a) we have to understand the value of what we're going to be investing in. So that we did almost like a request for comment on the ideas that we had. These are the investments that we want to make. We checked with our own internal users. We checked with product. Do we see the value of this activity? Do we agree that this is the right things to focus on? Because obviously given the intention to do this, the possibilities are endless what you might be able to accomplish. So what do we feel is going to add the most value?

Liz Saling:

So for example, I'll just give a highlight of a couple of the things that we chose. How do we store information and retrieve it? We came in with some things were documented in repositories. Some things were in Google documentation. Some things were scattered here and there. We need to bring this together and have a better mechanism and methodology around storing and retrieving information that we share internally as part of planning, as part of executing. And then another thing was look at our shipping times. What can we do to invest in reducing the build time, reducing the shipping time, reducing the validations?

Liz Saling:

One of the things we mentioned in our Universe talk was we run every test all the time. Can we take some of those out? What would be a good bar to set? If it's longer than this, it needs to run nightly. And then we have a different mechanism for rolling that back in. Should there be failures there? But the big thing was making sure that there was alignment between leadership and the engineering organisation and the product organisation so that we could all support each other to make this massive investment work. And how do we collaborate? Because this isn't normal things that much of the organisation is working on. So how do we set that up for success? Those are some of the key parts to making this project really sing.

How did you decide what to focus on?

Charles Humble:

Now you said there were a huge number of different things that you could have focused on.

Liz Saling:

Yes, there was.

Charles Humble:

So given that, how did you decide which ones to actually do?

Liz Saling:

So there was a period of time for probably about two months before we really set into these efforts in earnest. Where we were gathering the ideas, trying to assess the impact that they would make on the organisation, make on the future product offerings, make on availability. And literally just like any planning exercise, you're going through in your stack ranking based on the impact that we believe would be felt. Does it align with where we're heading as a platform direction? Would it be easy for people to come in and work on? Because, again, we're getting people to come in and engage in these areas that they don't work in day to day. So there was some evaluation about how generalized are the skill sets required to work on these things. Ultimately, we came up with a list of 20 work streams that we felt going into the start of this project that would make sense, that would land the most value, that would align with the future needs the best.

Liz Saling:

And even then when we had that 20 set, I think we went in with an air of, and we could be wrong. So as we're looking into this and breaking these down further, once we have people engaged, tell us, are these wrong? Should we pick something else? Should we do this thing differently? We might not even know. So have that curiosity going into it and a way to factor in that new information and make adjustments as we're going through it together. I think we completed 17 of the 20 work streams. We cut three of them. And even one of them we morphed. One of them was what if we had a different deployment mechanism that was able to group together our poll requests into batches? Right now that's all tightly coupled in our deployment system. And so the idea was an exploratory notion. And two weeks into it, and they're like, "Forget exploring it. Let's just start it. Let's just do a proof of concept. Let's see if we could actually do this." And that proof of concept is the merge queues that we've just recently shipped.

Charles Humble:

So that's really interesting. So you've actually ended up with a product feature from this tech debt work.

Liz Saling:

That turned into a product feature. Yep.

Can you talk a little bit more about the merge queues?

Charles Humble:

That's fantastic. So can you talk a little bit more about the merge queues? What was the problem that you were trying to solve internally? And then how did that evolve into the public facing version of the merge queue?

Liz Saling:

We still have an element of this today. We're still completing our own internal transition to this. So we would take in batch groups of PRs together into what we call trains. And we have this entire train mechanism that was controlled by our deployment system. And so when you queue up something to deploy and merge, that queue in the train is all managed within the deployment system, which means that, of course, with any tightly coupled system, if you have to make a change to it, that's very fragile. There's a lot of considerations that need to be made. The train experience in and of itself, because of the amount of testing and validations that had to be done on that train, means that we can only, best case scenario, we can ship one approximately every 90 minutes or so. That's best case. If there's a problem and we need to make a change that just extends that. And that means all the queue is just continuing to build behind it through the deployment mechanism.

Liz Saling:

And also I should mention, sorry, let me backup. There was this entire fear of being what we call the train conductor. The person who needs to be in charge and make sure everybody's staying engaged. And if there's a problem, they need to handle it. And we wanted to eliminate all that. It should be very easy to ship. We should be building that confidence. So we wanted to start to decouple all those pieces, let's handle the merging and the building and the validating separately. Let's handle the shipping separately. String them all together. The end result is the same, but as much as we can start to break this down and improve each individual experience, the net result would be something that would flow much easier, would improve confidence much sooner, and wouldn't be as seen as big and scary as conducting a current train is. So enter the merge queue prospect, which a lot of customers have also expressed an interest in having that baked right into the product. So it was just a perfect opportunity to just dig in and start solving for that for us.

What issues did you have around documentation?

Charles Humble:

You mentioned documentation as being something that got particular focus during the sprint. And this is maybe a bit of a pet subject of mine because essentially I write for a living these days. And so I maybe feel it more keenly than some do, but I do think it's desperately important. And you mentioned that one of the things was you had your documentation kind of spread around in lots of different places. So can you talk a little bit about this? What was it? Were you kind of consolidating stuff? How were you working? What were you trying to do?

Liz Saling:

That was exactly it. We started with a vision that it should be easy to find the engineering information, the engineering documentation that people need to do their work day to day. And granted GitHub search has improved a lot recently, but still back then it had room for improvements to be made. And so what we looked at were more traditional content management solutions, still using GitHub and GitHub pages, but having one centralised location. So we essentially created a new repository with new standards on it. And we began to centralise the information there for better indexing and discoverability, and then where they were located within a repository, at least sharing that over or pointing the link over to the new source of truth and creating this centralised place for easier searching and location. And really starting to remind us and reinforce that this is important. Documentation is the easy thing that everybody wants to burn past, or just do lightly and get onto the next cool thing.

Liz Saling:

But the value of having the information available, for onboarding, for activities like this where we're stopping and bringing people from different areas and they need to quickly come up to speed, for scaling the organisation. So it really reiterated the importance of having a good internal communication strategy, especially within our own engineering documentation. Are we recording decisions? This is one of my favourite things is what is our design process? Where do those proposals go? How do we preserve them? How do we understand who weighed in on them? Once the decision is made, where are recording that? We use architectural design records, ADRs. We talk about these all the time. To capture those decisions made and have that archeology of what alternatives were considered. Why did we pick the one that we did? Again so that when you have new information, what is the process to factor that back in and keep iterating and keep going? So yeah, establishing that single source of truth was key.

Charles Humble:

Yeah, yeah. No, I'm a huge fan of the architectural record thing, because so often you come back to, why was that decision? Sometimes even if it's your decision.

Liz Saling:

Wait, have you ever written one and you go back and look and you're like, "I wrote that? I don't even remember."

Charles Humble:

Yeah, yeah. Totally. I don't remember making that decision. I'm sure it was somebody. No, it was me. It says it right here. Yeah, totally. Totally. It's a really important thing I think. How is internal documentation at GitHub managed?

Liz Saling:

We still are using the hub as a central place of engineering information, especially that what DX calls user facing documentation. Our intention of how we use GitHub and how we standardise. So that is all still in, we call it the hub. It's our central location. There is still much information that we keep right close to the code either as a source of truth in the hub will reference it or vice versa so that we can refer back out to its location in the hub. Because we do feel that it's important to keep documentation discoverable right where it's needed in the repository, in the work that you're doing. So there is some cross referencing and indexing there.

Were there other major work-streams we haven’t talked about?

Charles Humble:

So we've talked about merge queues. We've talked about documentation. We've talked a little bit about build time. In terms of the sort of other work streams, were there any particular major ones that we haven't touched on?

Liz Saling:

So we took on a huge effort to completely wrap up containerizing all of the things. Every different piece we had, there were still pieces that were shipping to bare metal. There were virtualized pieces that we had a bunch of different deployment mechanisms. We had long before decided that we would fully containerize. And now was the time like, "Okay, we need to finish this. Let's go." That was a huge push. And there were some tricky bits to figure out how do we do this? In particular, when you're talking about Git, when you're talking about databases and stateful and stateless containers and what is our strategy and how do we do the very specific bits of this? So yeah, we ended up being able to wrap that up as part of that effort that summer.

Having made this huge investment in DevEx, how do you then avoid everything sort of slipping backwards again

Charles Humble:

Having made this huge investment in DevEx, how do you then avoid everything sort of slipping backwards again, as you get back into focusing on features?

Liz Saling:

Oh, this is my favourite thing that came out of this effort. And to be clear, it started before, but it was really solidified as part of that summer project that we wanted what we called anti-slip measures. We make all these investments, we are essentially setting a new baseline that needs to be met. So the key here is a) make sure you're watching those metrics and have some built in alarms going off when you're starting to see that trend backwards. Yes, of course, sometimes this is what tech debt is. We're going to make the investment. We're going to let that slide a little bit to invest in something else, but do it on purpose. Understand how much that represents. How much are you sliding? Now we know how much it's going to take to get that back. It's much easier to just keep that momentum up as you go and make little adjustments as needed, or decide we're just not making enough progress to what our target is and then you can make a larger investment.

Liz Saling:

But having that data coming in and understanding where that baseline is so that you can just say, "Nope, we're setting the bar. We're not going to fall below this." So we have a fundamentals program that sets us up on a monthly review cycle of essentially where those different bars are at. And the bars can be from everything to is all of our code, we call it durable ownership. Is that present? Do we know who's responsible for this and watching over this? Through security reviews. What are the dependencies? Are they out of date and do we need to make updates? And two, we call flaky tests. We make sure that our tests that are failing and passing on the same sets of data and the same runs.

Liz Saling:

Are we paying attention to those? Because those cause friction. So we've got a whole host of things that we measure and watch over to make sure that we're not slipping too far, or if we are, we're aware and we know how to factor that back into planning. And that fundamental program makes it really easy to continue to factor that information back into planning and so that product is aware and engineering has a audience to say, "Hold on, things are getting a little... We need to take care of this." And it just gives a really nice routine around how we do that to literally prevent having to stop for another quarter as that awareness comes. And we're like, "Oh no, no, no, we really need to do this."

How do you manage service ownership?

Charles Humble:

Yes. Right. It's interesting you mentioned the business of service ownership because this came up when I interviewed Sarah Wells. So Sarah Wells ran engineering enablement at the Financial Times for a period of time. And I interviewed her for the previous episode of this podcast. So I'll link to this in the show notes. It's another good upside. And she mentioned that the Financial Times have a tool called BizOps. And BizOps includes various bits of information about a microservice, including who owns it. What's the equivalent of that at GitHub?

Liz Saling:

With GitHub, the way we do that is with code owners so that we can establish down to lines of code almost who's responsible for what areas, because that's a whole thing. It's our natural tendency. We want to keep moving. We want to keep generating the new features and attracting customers and building value for our users. And you have to kind of fight that natural tendency to just keep rolling and make sure that we're taking care of what we have. And if we're just quickly cranking out new things and not taking real intention, depending on the size of the feature or the product line, is there a team representative that can pick this up and support it and maintain it? So this helps us fight that natural like, "Oh, let me just take one or two people and just throw them at this and get this solved and then move on to the next thing." It's like, well, yes. And we need to make sure we're set up for a little more long term success as we go too. And that fundamentals program and that verbal ownership there really, really helped nail that for us.

Do product teams look at tech debt metrics?

Charles Humble:

In terms of the sort of health metrics that you're monitoring, who is involved in that? Is that purely engineering or do you have product teams looking at the sort of tech debt related metrics as well?

Liz Saling:

We do have product looking at that with us. And we have project and program managers that help make sure that this is happening on the regular. And as much as possible we've made this just part of our routine planning cadence so that it gets factored in just as equally as the read on the market and the user's needs and what's coming out of support. This is just another element that gets factored in regularly for us.

How do you monitor internal developer satisfaction?

Charles Humble:

And then how do you look at your own internal developer satisfaction? Do you run developer satisfaction surveys or something like that?

Liz Saling:

We definitely do. So between every three to six months, we will issue a new survey that can be as light as which areas are you in and quick beat on how it's going for you to very detailed specifics, measuring every part of the developer experience and providing space for open feedback. So we generate what's pretty common with a satisfaction score so that we understand are we holding steady overall? How does it break down across different platforms and different product lines and offerings and different workflows? And this really helps us dial in that investment. Again, marries that what we're hearing and what we're suspecting and what we're seeing with the metrics and brings it all together so that we can really focus on the developer experience specifics.

Liz Saling:

Now I will say this. The satisfaction surveys I also are what I call a trailing indicator. I want to make sure that we're getting incremental feedback out of the support channels, out of the leading indicators, such as shipping times or build times or success rates, even down to what are the number of the steps that it takes, manual steps that it takes to ship a thing. So you take all that together and it just gives this nice circular feedback loop and input mechanism for us to factor all of this back into where are we going to invest? Do we need to invest more in the monolith development experience right now or has that hit a point where it's holding steady? And now we need to turn and look at shipping on one of our enterprise platforms perhaps. And that really helps us understand the investment and how many engineers it's affecting in a quick... I say quick and easy, except I'm not the one putting the dev set survey together. So barring that, it's a quick and easy mechanism for us to get that information quickly and factor that back into our plans.

How do you avoid survey fatigue setting in?

Charles Humble:

If you're surveying that frequently, how do you avoid survey fatigue setting in? I think this is a real phenomenon. And I think we've all experienced it probably where we've worked as an organisation where they send an employee survey out, it might be weekly or quarterly or monthly or whatever, but you dutifully fill it in the first few times and then precisely nothing happens. And after a while you start, or I do anyway, getting a bit snarky in the answers and saying, "I refer the right honourable person from HR to the answer I gave last time."

Liz Saling:

Yes, exactly that. Well, and let's be clear. I think we had an element of that for sure. There were a few years where there weren't large investments being made at the time, they were being directed elsewhere. Totally made sense for the business, but keeping the surveys going, we were seeing the same themes. And I think that actually helped. When you see the same thing, come up survey after survey, it helped justify the reinvestment and the change in focus. And that is exactly the key because, yes, survey fatigue is real. I skipped. Should I say this out loud? I skip surveys. So the whole point is when people understand, "Hey, we are going to be investing more. We want to hear from you. Where do you think the biggest impact for you would be? This is your chance to have a say in the planning process."

Liz Saling:

And that definitely has helped increase the engagement. And then of course communicating back, "Hey, we heard from you that our testing times could be better. So this is what we have done. This is the early feedback from the people that we've engaged to prove this out with." So that circling back with communications as how we were using that survey data and what it actually turned into. It turned into creating a new team to focus on this problem, which turned into a new testing environment, which turned into these are the things that help reinforce... We're actually using, this information is really important. If you want help, this is one way that you can provide input into getting the help that you need.

Charles Humble:

I think that's absolutely key. I think letting people see that you're not just paying lip service, but you are actually doing things on the back of what they're saying is super important.

Liz Saling:

Exactly. And frankly admitting when we're just paying lip service. Like, "Yeah. You know what? We see this. This is a third time this has been a trend. We made decisions to focus on another thing and we hear you. Sorry we didn't get on this sooner." Admit it when it happens.

How do you handle feedback and manage communication across different groups at GitHub?

Charles Humble:

Absolutely. I do think that's key really. Just being honest about it. How do you handle feedback and manage communication across different groups at GitHub?

Liz Saling:

A big thing that we try to do is really connect with our users, our internal users I'm speaking to right now. Of course, GitHub wants to connect with us customers and we do that as well. But even internally the developer experience team is a huge fan of... We have a couple things that we do. We do rotation programs where people can come and work with us, especially we're big on Ruby on Rails, that's what the monolith was developed in. So we have what we call a Ruby Architecture Rotation Group. Where people can come in and work with some of those core contributors and partner with them and understand the more inner workings of the monolith.

Liz Saling:

And then similarly, with other more of the service based things, we'll have these opportunities to come and partner with us. And we're actually discussing now reversing that. Where we go out and work with teams more directly so that they understand. Because a lot of times that's where you can produce documentation, you can hold demos and you can talk, but people will still miss the opportunities that are there to use the systems that we have. So what can we do to really increase the awareness there with some partnering and embedding? So these are all the things that we have done and are considering starting to do more of. We're huge fans of the partnering and collaboration model. I mean, we're GitHub, that's what we're built on is open collaboration. So we try to do that in our engineering group as well.

What are the key takeaways would you say from the last couple of years of work?

Charles Humble:

So trying to bring this all together, if you look back over the last couple of years, so since you did this quarter long pause, what are the key takeaways would you say from the last couple of years of work?

Liz Saling:

I think the big takeaways there were really doubling down on that fundamentals program and making sure that was just a routine part of how we run our business, making sure that we're listening to the user input and factoring that in just as much. Again, we're really unique because it also ends up feeding back into our platform. So there's a lot of synergy there. And then proving out the solutions that we send to our customers internally and making sure that the platform is really solid and we're proud to use it and we can't wait to tell everybody about it. These are some of the key hallmarks of what helps when we're thinking about tech debt. Take it in a bigger picture and look at how the impact is. Not only to us, but again, to our entire customer base that we adore and we identify with and making that an intentional investment and part of our everyday operations.

Charles Humble:

Liz, thank you so much for taking the time to talk to me today for this episode of the Hacking the Org podcast from the WTF is Cloud Native? team here at Container Solutions.

Liz Saling:

Absolutely. My pleasure. Thank you for having me on.

hiring.png

Comments
Leave your Comment