Podcast: VP Engineering and Chief Architect at eBay Randy Shoup on Improving their Developer Velocity

In our first “Hacking the Org” podcast, Charles Humble talks to Randy Shoup, VP Engineering and Chief Architect at eBay. They discuss how eBay’s architecture has evolved, discovering problems in engineering using value stream mapping, measuring engineering team velocity, and making cultural change.

Subscribe: Amazon Music | Apple Podcasts | Google Podcasts | Spotify

About the interviewee

Randy has spent more than two decades building distributed systems and high performing teams, and has worked as a senior technology leader at eBay, Google, and Stitch Fix. He coaches CTOs, advises companies, and generally makes a nuisance of himself wherever possible. He talks a lot -- sometimes at conferences about software -- and is interested in the nexus of culture, technology, and organization. He is currently VP Engineering and Chief Architect at eBay.

Resources mentioned

Accelerate by Nicole Forsgren, Jez Humble and Gene Kim
Flow Framework - Dr. Mik Kersten
Making Work Visible by Dominia DeGrandis
Project to Product by Mik Kersten
Principles of Product Development Flow by Don Reinertsen

Full transcript

Introductions

Charles Humble:

Hello, and welcome to the first ever episode of "Hacking the Org", a brand new podcast from the WTF is Cloud Native team here at Container Solutions. With this podcast, we're aiming to bring together some of the most experienced software engineering leaders and talk to them about their experiences covering topics such as building and leading high-performing software engineering teams. I'm Charles Humble, Container Solutions' editor in chief, and I'm joined today by Randy Shoup. Randy has spent more than two decades building distributed systems in high-performing teams and has worked as a senior technology leader at high profile firms, including eBay, Google and Stitch Fix. He coaches CTOs, advises companies and is a regular conference speaker. He is currently VP engineering and chief architect at eBay. Randy, welcome to the show.

Randy Shoup:

Thank you, Charles. It's wonderful to be with you personally. We've known each other for many years and what an honor to be your first guest.

What are you and your team responsible for at eBay?

Charles Humble:

Thank you. What are you and your team responsible for at eBay?

Randy Shoup:

Like you mentioned, I'm the chief architect and I'm the VP of engineering for what we call internally, engineering ecosystem and experience. And that's a big old mouthful but just think the developer experience. eBay has on the order of 4,000 engineers and my teams build the developer frameworks, the CI/CD pipelines, the environments where people do development, staging, et cetera. We do the mobile foundations and actually, we're also responsible for the external APIs that third-party developers use to help our buyers and sellers interact with eBay. But yeah, you can just broadly think of it as the developer experience part of eBay.

How has eBay’s architecture evolved?

Charles Humble:

Now, eBay has obviously been around a relatively long time. It's sort of 26 or 27 years old and architecturally, I kind of think of it as being J2EE- sort of Java enterprise from that period. But obviously, the architecture will have evolved over time. So can you give us an idea of what the current architecture looks like?

Randy Shoup:

We're definitely still a lot Java but never EE. And I'll tell you about that. POJOs all the way. I'll give you a super brief history. We are yeah, 26 now, 27 years old. We've been around for a while. The first version was Pierre Omidyar, the founder, three day weekend over labor day in 1995. He wrote it in Perl. Every item on the quote site was a file. There was no database so it didn't scale and wasn't intended to, he was just messing around with this new cool thing called the web. It wasn't even called eBay at the time. The next generation is what we called V2, which was a built in C++. It grew over many years, several years, call it five, six, seven years, 3.4 million lines of code in a single ISAPI DLL and if you understand a lot of those words, you're shrinking and cringing, but just know that 3.4 million lines of code, not even just in a single process, not even just in a single shared library, but basically in a single class.

Charles Humble:

Woah!

Randy Shoup:

A single class. As bad as you can imagine, it was worse. V3 was the modernization of that. That started about 20 years ago now in 2002. And that was a full rebuild in Java. No, EE. It wasn't called EE. It was J 2EE at that time but we didn't use any of the EJB stuff. We learned super quickly that wasn't going to scale as Spring showed us later, everything POJO's, Servlets, et cetera. You wouldn't call it microservices. Now you would call what we do microservices, at the time it was more mini-applications. We took this monolithic, the entire eBay in a single class and broke it up due at the time let's call it 200 ish different clusters of, okay, "This application server cluster serves the selling pages, another one for the buying pages, another one for the search pages, et cetera."

You can think of it as this is modern terminology. You can think of it as divided by domains at the top and then there was a sea of shared databases. We can talk about that if you want, but we learned the sharing part of that wasn't a great idea. We actually are just now, literally now in the next couple of months, getting rid of the last vestiges of that V3 on the site. But since that time we've had two full generations of further improvements in Java, one, we called very creatively V4, which morphed into something that we internally call Raptor that was Spring but not Spring Boot.

Spring pre-Boot, pretty monolithic, not the entire eBay in one JAR, but a whole area in several JARs and very tightly coupled. And that started about 10 years ago, then starting about five years ago, our more modern framework, which we call super creatively raptor.io, which is Spring Boot, so like in the spring ecosystem, very separable pluggable and playable JARs. And then as part of that move, about five years ago, we moved essentially the entire front end to Node. Now you can think of "Let's call it 10% of eBay represents that top level user-visible UI." Which is mostly Node and again, why isomorphic JavaScript back and forth between the browser and the server and then the back end, which is a lot of it is all various flavor of the Java frameworks.

Charles Humble:

How do you handle interservice communication between the various different Java microservices that make up your backend?

Randy Shoup:

Yeah, again, 27 years. We have a lot of ways of messaging back and forth. We have a pretty good legacy thing that we built in 2006, internal name doesn't matter. We call it business Event Systems or BES, whatever, but you can think of it as, I won't say it's Kafka, but it was a lot more like Kafka than you'd think. All the messages were stored persistently in an Oracle database, but whatever, then individual consumers, just like in the Kafka ecosystem, I'm using Kafka terminology, but essentially have their own pointers into the messages, if that makes any sense. The different consumers could be at different points in the event stream, if that's making sense to your listeners. We also have Kafka, so you can think of the multiple generations of the core stack I was telling you about. We have multiple generations of... there are still some applications that are using that kind of legacy, BES system that we built internally and then the modern stuff is all built in Kafka.

What prompted you to come back to eBay?

Charles Humble:

Changing topics a bit, this is your second stint at eBay. What was it that prompted you to come back?

Randy Shoup:

I joined 18 years ago back in 2004 for the first time and I worked there for about seven years and I worked mostly on eBay's real-time search engine. Happy to talk about that in great detail, but that's probably not the subject of this podcast. It was super interesting and groundbreaking at the time. Worked for eBay for seven years and then left to co-found a startup with another eBay colleague. No one's ever heard of it so we were in the 99% as opposed to the one. As you mentioned in your intro, worked at Google, worked at Stitch Fix, worked at WeWork actually more recently, a few other places in between for short stints. And yeah. Then just about two years ago now the current CTO who was a colleague of mine back in the day, asked me to come back, basically to solve product development velocity or that's at least how we phrased it internally.

Do we need to do more agile? Yes. Lean? Yes. Dev Ops? Yes. Kind of transform eBay at that time and I was super interested to, I love eBay. Maybe we'll talk about more detail why I came back, because there're several reasons. But the thing I wanted to come back and do was this, is bring eBay into the modern world of Lean and DevOps and Continuous Delivery. And I felt like I had the tools because I'd seen it done at other places, and I had the tools in my toolbox to be able to help when maybe 10 years ago I had the sense of what I would want it to be but I didn't have the tactile feeling of what it would be like to be there and what it would be like to get there, if that makes any sense?

What problems did you find when you returned?

Charles Humble:

Yeah, it totally does. And what did you find when you came back? I'm imagining after 26 or 27 years, there would be a reasonable accumulation of technical debt. Is that fair?

Randy Shoup:

Yeah. Oh, completely fair. Yeah. When I came back June of 2020, middle of pandemic, so actually a bunch of people I've never met, I worked with for over a year and a half. I'm sure that's the familiar situation for your listeners. Hopefully that will change. I did what the Lean people would call a value stream map, again, eBay's a big place, call it 4500 applications and services, roughly 4,000 engineers. And so I didn't talk to everybody but I did come back and say, "Hey, I want to solve this problem of making eBay faster," Whatever that means. "Give me three teams that are like a sampling that I can talk with and in detail. Right?" Really walk through with them. What does it look like when somebody has an idea and how long does it take? And like what happens between that idea becoming a feature that customers are actually using in the real world?

Again, the Lean people would call that a value stream map or a value chain. Basically, you're just looking at the system end to end like, "Well, what are the steps that happen between somebody having this idea?" An idea becomes a project, a project becomes committed code, committed code becomes a feature on the site and then we iterate on that feature in real time with experimentation and analytics and stuff. And so I look end to end with the teams at all those things. And we identified problems in every area, right? Everywhere there was opportunities for improvement.

And again, like you say, 27 years of the business and we'd accumulated lots of debts. We'd accumulate a lot of technical debt, a lot of organizational debt, and just a lot of, well we've always done things X way, but that way was the best practice that we had 20 years ago. Right? lots of opportunities for improvement. And so we looked end to end at those phases that I had mentioned like planning idea to project and software development as project to code and software delivery is code to a feature and then post-release iteration is how I frame the like experimentation and analytics. And again, like I say, there were problems and issues and opportunities at every one of those areas.

But what became very clear was that the bottleneck, so think theory of constraints for your listeners that know lean, the bottleneck was absolutely in the software delivery area. On average, the average one of those 4,000 applications and services was deploying once or twice a month. And so there's a lot of things upstream I want to change about the architecture. I have this chief architect title and you'd think that I'd come in and like "Let's make all these architectural changes." And believe me want to, but the open and honest truth is that you can't change the architecture if you only have 12 bites at the apple every year, if that makes any sense, right?

That's a big lift to say, "Yeah, let's do a lot of typing and then like every month, get it out there." And I was just using the architecture example, we want to do much faster experimentation. We want to get features to customers much faster so in all air areas that software delivery part of the product development life cycle was the bottleneck and so that's what we focused on for the last year and we made some really significant improvements there.

What is causing the bottleneck?

Charles Humble:

In terms of that bottleneck, can you break down a little bit more what's causing that? Is that too much work in progress? Is that lengthy build times, too much manual testing? What would be causing that bottleneck?

Randy Shoup:

It's all of the above. The things we have been tackling and continue to tackle are improving what I'll call the inner loop for developers, like improving the day-to-day, hour-to-hour productivity of developers and so that's things like build time, startup time, PR validation time, testing, way more automated testing, testing pyramid, right? We traditionally have had a lot of integration tests that are done expensively and extensively more at the end of the process and trying to slowly but surely move that more to the left. Move all the other things to the left. Move security testing and accessibility testing and performance things earlier and earlier in the pipeline having a much better CI/CD pipeline. We've had one for a while, that we rolled our own, but my team builds it and maintains it and so a lot more investment in making that better.

And so that is about, in some sense of individual productivity of individual developers, a lot of those areas, the software delivery half of that is all about... This is right out of the Accelerate book, Lead time for change. After a developer commits her code, how long and what happens between that and when it becomes a feature on the site? And sometimes that was measured in weeks here and it shouldn't be. Some of those same things that I mentioned also doubled with that so way more automated testing, way more shifting left on things like security, but also automated deployments, right? Canary deployments, which I'm happy to talk about. We introduced a capability, which we didn't invent, but I don't know that there's an industry term for, we call it traffic mirroring where usually in a read-only situation, you fire a real world production request at the production version, the old version, and the new version you'd like to use and then you compare the results, right?

You get your JSON response and you compare them. And so that's a way that is A, super easy to tell whether we've changed something and B, works as if that thing is a black box. Let's imagine you had no tests because it's super legacy or nobody invested in it. That's still a technique you could use for just about anything. Anyway we invested in all those areas and the other area that's related is in the mobile space. When I arrived, we were doing our mobile iOS and Android apps. We were releasing once a month and then I challenged the team, just mine. And I said, "Hey, let's try to see if we can release once a week." And they were like, "You're crazy. Here's all the stuff that happens after we do this code freeze." And I'm like, "Okay, let's work on the process about how that works and do more things in parallel and more things upstream and automate more stuff."

And I'm so proud of this because the biggest skeptics were my release team. After a couple of months we went from monthly to biweekly releases. We started in January of last year, let's call it April or May, we went to biweekly releases. They were like, "Well that's not so bad." And then our big old stretch goal, which people thought was unattainable was weekly by the end of the year. And then we just did one where we were doing these biweekly releases and they're like, "Let's try one in July. Let's try a weekly release and see how that was." And it's like, "This is great." Five months early, our stretch goal that people didn't think we could meet, we did.

So we've been weekly ever since July of last year and that's amazing. Obviously, we're getting features out to customers much, much faster. We've reduced the lead time there from measured in months to measured in now, days. But the other thing, as you'd imagine in the accelerate book, in the state of DevOps report, you would predict, and it's happened is it's completely changed the mental model of the development teams and the product teams because they're not rushing to make the train that's leaving and the next one doesn't leave for a month.

Charles Humble:

Right.

Randy Shoup:

Right? Instead, it's like, 'we're going to try to make this Wednesday. And if we don't make this Wednesday, we're going to make next Wednesday." No big deal. And again, you can imagine what a reduction of the stress level that is for the teams, but also what an increase in the quality that would be. Because again, nobody's racing and cutting corners and stuff to get there, which, open and honest. There was some incentives to do that before and there isn't now. And similarly, when, not if, when we find issues that happen in our apps, software's hard, so we find an issue where there's a bug or certain set of users aren't able to do stuff, we used to have a big old panic, like, ""Oh, we're going to have this big exception process where we have to release off cycle." And now it's like, "I don't know, can you wait till Wednesday? Because that's what's coming." And we're actually in the middle of one of those right now and it's no big deal. I love it.

Charles Humble:

It's such a profound shift that, and it is quite counterintuitive. I mean, when I started in big old enterprise IT 20, whatever it was years ago, we might release quarterly or even twice a year or something so if you miss the release, it could be three or four or six months before the next one. You didn't want to do that. Right? So you would rush and slam things in and then there'd be this long code freeze process when everyone was trying to fix things. But this idea that moving faster would give you higher quality outcomes. When I first heard that, it seemed very wrong. I accept it now, but it took a long time to get my head around it.

Randy Shoup:

Charles, it's so incredibly counterintuitive even to those of us who have experienced it, like we have even lived this world and it's still counterintuitive, this idea of like, okay, "I want to increase quality. Well, what I should do is slow down and think harder, think really hard upfront and try really hard and put a lot of stress into it." And it turns out that's exactly the wrong thing to do. Just like, "Hey, let's run a marathon. Let's just start right now. Start running." No you exercise every day and you work your way up to it slowly but surely. I started my career, I'll date myself super clearly because it's easy to find on LinkedIn. I started my career in 1990 working at Oracle and I didn't work on the database, but I worked on an ad hoc query tool on top of it. And we did yearly releases.

That was Oracle version six. I remember that one going out. That's how old I'm talking about. But anyway, the whole suite would all go out together and like there was a big, I don't remember, June first of 1991, all this stuff's going out. And so yeah, to your point, it was not only months and months and months of planning and months, and months and months of typing but then months and months and months of "We're not changing anything in this code freeze, but it doesn't work." Three months format, I'm making it up, but it's probably not wrong that, let's imagine four months of those 12 were like, "Just get the features that we thought were in and we were done with like to actually work anyway." Yeah, 100%.

And this is the great beauty of Nicole Forsgren's research that's part of the state of DevOps surveys and expressed and her Accelerate book, There is no trade-off between speed and stability quality and in fact they are self-reinforcing. The best way to get faster is to invest in higher quality through automation and good practices and the best way to have good quality is to invest in going faster. And the insight, which again is not intuitive. I know you know this is smaller units of work, smaller units of work. Why is that a thing? It's because as we were talking about, if I release a year's worth of software, it's going to be four months to stabilize that thing. Right? We've lived that back in the bad old days. If we're releasing a day's worth of software or an hour's worth of software, if it breaks, there's not that many things that it could be. Right?

Charles Humble:

Right, Yes.

Randy Shoup:

You can tell by inspection, looking at it, "That's not going to break these 95 things. I guarantee you that." I look at it like, "That doesn't touch on any of those things." Okay, so the code review in the forward direction is super easy because it fits in your head and the diagnosis in the reverse direction, like, "Okay, when I'm wrong or whatever, how many things could it be? We only changed two things or one thing." And so A, it's easy to diagnose it, but B, whether I diagnose it or not just roll it back, what's the big deal, right? "Oh my God, we lost an hour of Randy's work." Like, "I'm okay with that."

If it stops customers from having a bad experience then we can calmly figure it all out. Right? This is why the Accelerate metrics are around speed, deployment frequency and lead time for change, and they're coupled with stability or quality metrics, change failure rate and time to recover. And those things are all self-reinforcing. The organizations that are really bad at some of those are bad at all of them and the organizations that are fantastically world-beating at some of them are world-beating at all of them.

How do you find the trade-off between independently deployable microservices and opertioanl complexity?

Charles Humble:

Right? Yeah. I want to talk a bit more about the Accelerate metrics in a moment, but just to carry on this train of thought for a moment, part of where this idea of speed of deployment leads you is into highly distributed systems so sort of independently deployable microservice type units or whatever terminology you want to use, which is fine, but there is an important trade-off there, which is that the interactions between the different components becomes more and more complex. So what you are doing is you are trading off speed of deployment against the kind of operational complexity, I guess, keeping the thing running becomes harder. And sometimes as well, you can get sort of weird interactions between services that you didn't necessarily predict. You put a new service in and yes, the change in the service is small, but it turns out to have a sort of weird unexpected consequence somewhere else that you couldn't reason about because it's sort of in the glue code, as it were rather than service itself. Is that something that you've got experience of?

Randy Shoup:

All the time? I know you know this in your question, I mean, you know this yeah. 100%. You just described distributed systems, was it Lamport who said,? "The definition of a distributed system is a machine I've never heard of can break my thing"

Charles Humble:

Well, it's not a fact. Yes, absolutely. It was Lamport. Yes.

Randy Shoup:

I love that, man, anyway, he's so clever. 100% that you're expressing, I'll just take it a step back, you're expressing the trade-off that going from a mostly monolithic system where everything's in one place and it's easy to see all the interactions because they're in one place, to something where it's more distributed into, like you say, microservices or individually deployable components, processes, again, whatever word you want to use, services. And like you say, yeah, the interactions between them are more complicated and therefore there's more opportunity for not seeing them. I think that is true. Also, it doesn't have to make it worse. The organizations that really do service-oriented architecture as well, think very carefully about those boundaries and eBay's getting there, I'm open and honest. But when I worked at Google, I'm making this up, but Google probably had 10,000 services when I was there. And as a service owner, which I was, I ran the engineering for app engine that's Google's Platform as a Service.

We interacted with tons of other systems but not 10,000. You follow me. We interacted with, let's call it tens, small number, tens of services. And so that's our world. Our world was those tens of services, ones that depended on us and ones that we depended on. And we just need to make a sure that those interactions make sense. You're not really implying this Charles, but somebody who hasn't had this experience in working in a very large service architecture might think it's really overwhelming. "Oh my gosh, how can I possibly keep thousands of services into my head?" And you can't and you don't have to. No one can do that, literally no one. The cognitive load is too high and so what you do need to understand as a service owner and a contributor to a service is, what services depend on you and what services do you depend on?

What messages do you send and receive? And that's it. And is there more other stuff that's going on out broader than that, that is interesting? Yeah. But mostly you don't have to care about it. Exactly in the same way as we go about our daily life; there's a whole world with 8 billion people on it. But we're not thinking about dealing with 8 billion people. We go about our lives and we deal with some small number of hundreds of people or tens of people or ones of people in the pandemic. Even though there is a lot more to the overall ecosystem, our experience of it as an individual human is pretty bounded.

Using the Accelerate metrics

Charles Humble:

Now you mentioned the Accelerate Metrics that sometimes called the DORA metrics earlier. Are you using those metrics to measure the impact you are having as you are working on the developer velocity issue at eBay?

Randy Shoup:

Exactly. Right. Yeah. It wasn't my word but it's a fine word. We called what we're doing the velocity initiative. And again, as I mentioned, we did those value stream maps and we figured out what most teams bottle necked on and we've determined that it was mostly on software delivery and so that's wonderful because Nicole did nearly a decade of research to try to tell us how to fix that, which is amazing. Again, as we mentioned, the variously named Accelerate metrics, DORA metrics are the four, right? So two speed metrics, deployment frequency, and Lead time for change and then two stability or quality metrics, change failure rate, and meantime to restore. And so yes, we are measuring our progress using those metrics and I'd be happy to talk in more detail about that. Actually super briefly, we can go into a lot more detail, but in 2021, we did what we call internally a pilot, a bunch of us knew it was going to work, but we didn't know what we didn't know about eBay, if that makes any sense?

We work with about 10% of the team, say 300, 400 engineers across a cross section of all parts of eBay. Teams in search and selling and in buying and in shipping and in payments and cross-section of everywhere and punchline is we ended up doubling their productivity through all the work that we did. Again, improving build and startup time, improving their CI/CD infrastructure, improving their practices. We worked together with the domain teams in our phrase, like the pilot product engineering teams, where they made a bunch of changes to their processes. They told us things that they needed out of the tooling and the infrastructure and my team built or partnered to improve the tooling and the infrastructure and that partnership worked great. The punchline there is we doubled the team's productivity as measured by hold the team size and the team composition constant.

They produced twice the features in bug fixes that they were doing before. That's a metric that you would call flow velocity if you use the flow framework from Mik Kirsten but just basically think like stuff getting through the overall product development life cycle more quickly than it was before. In terms of the accelerate metrics eBay overall was then, and is still now on average, a median performer along those four key metrics, again, deployments once or twice a month, lead time of a week and a half. And now those teams that we work with are solidly high performers. Instead of twice a month, deploy twice or three times a week and lead time from 10 days down to two days, and again, we're keeping going so those teams that we're working with before are going to continue to get better, hopefully on the way to multiple deployments a day and lead time of an hour, which is industry leading those things.

Yeah, we definitely used those Accelerate metrics to notice areas of opportunity, right? Like, "Wow. it looks like your lead time is high." And we worked overtime with the teams to make that, not feel blamey. Do you know what I mean? We're like, "Hey, first to open and honest." First time the teams were like, "Wait, what? You got a big old, fancy title and you're coming in and telling me that I'm not good. That doesn't feel good." But when you couple that with "I'm from the government and I'm here to help." And that's a joke in my country, but is real in your country is like, "Hey, open and honest." we actually have the same goals. Like, "I want you to be better and I bet you do too." And like, "I'm here to offer you suggestions and tools and resources to help you do that."

What are you getting the data from?

Charles Humble:

Where are you actually getting the data from to produce the measurements for your DORA metrics?

Randy Shoup:

I'm so glad you asked that because I did really want to tell this story and I had almost forgotten. When I arrived at eBay, because I also have the team that builds the developer tools, including the dashboards for these metrics. We had already had a dashboard where we were tracking bugs and various other things. And the leader at the time came to me very proudly and said, "Hey, can we show you this? We have this one metric, this one number that expresses, I forget the phrase, but like production stability." And basically, it was a combination of all the Accelerate metrics.

They weren't familiar with the Accelerate metrics if they came and gave me this one number and I was a math major so they showed me the formula and it's got exponential decay and it's weighting and all this stuff and I'm like, "It's even too complicated... I know exactly all the Greek letters in there and it's still too complicated, instead what if we broke it out into these four things and oh, by the way, there's this book that says these four things actually matter."

We do actually measure, we have a dashboard starting way back almost a year and a half ago. First thing we did because we had all the data, which was amazing and we just put it together in a form where we measured the accelerate metrics by the definitions. Where do we get them? We used GitHub. Let me just do it in order, deployment frequency. We have our own deployment tooling. We know when we're deploying things so we can tell when those things start and end and we can tell what we're deploying. And that was relatively straightforward. And then for lead time, we are looking at GitHub to see when the work starts, when does the commit happen or in our definition, when does PR get opened? Because we think that means when we're done or when we think are done. We're ready for a review.

I'm not talking about draft stuff, but when the developer's ready to be done, that's kind of when the clock should start ticking on that. And then we know when the deployment happens, right? That's the lead time. And then change failure rate. Actually, we have an imperfect measure right now we measure rollbacks. We measure when we've rolled something out and then we've rolled it back. We're actually going to be improving that, looking at what we would call P1 bugs, which are the equivalent of what you need a hotfix or a patch for in the Accelerate definition. We haven't integrated that yet. And then meantime to restore is again, the time between when that deployment happened, which we know to when that rollback happened or in the future when that P1 bug gets resolved. Does that make sense, Charles?

How are you managing the cultural aspects of the change?

Charles Humble:

Yes, it does. That's perfect. I'd like to move on to talk a little bit about the cultural aspects. We've touched on this a bit already, but basically what you're describing is a form of organizational transformation. Now this is very much the work that Container Solutions does with Cloud Native transformations, where we help companies get the benefits out of a shift to cloud and we like to say, this is really a cultural transformation, at least as much as it is a technology transformation. How have you found those cultural aspects? For example, have you got good buy-in from the executive team?

Randy Shoup:

It's definitely a cultural and behavioral change at every level. Right? I mentioned a little bit, not too much in detail, but at individual team level so it's one thing to give them tools and infrastructure, it's another thing to switch from long-lived feature branches, which has been our default for many years to trunk based development. That's a huge behavior change for the individual development teams now that we're measuring lead time PR reminders, right? Like, "Hey, you're done with your code and you're waiting on Randy to review it. And I don't want to make you wait and that's a problem." And things like that and that's like cultural and like behavioral at the way lower level. At the way higher level from the executive organization there's definitely also a cultural shift there.

Charles Humble:

That's really interesting. It sounds like you've got executive buy-in and it sounds like you are getting developer buy-in over time as you're showing them what's happening, but are there other places where you are finding, getting that level of support more difficult?

Randy Shoup:

The ends of the hierarchy really see the problem and the value and we're still working on the middle. The individual teams that live every day in these things really want to go faster and to have it be easier for them. There's no argument about that. Developers want their inner loops to be tight and stay in flow and stuff and then the execs are like at every company I've ever been at, "why is it so darn slow?" No, I've never found an exec at a company that thought we were going too fast. The execs see the problem and are interested in the solution. What we're working on is the cultural change in the middle, because the incentive structure does change from, "Let's do big stuff with big batches and think really hard" to, "Let's do small experiments and iterate quickly."

What are you focusing on next?

Charles Humble:

And that leads naturally on to what better be our last question because we are unfortunately getting short on time. Can you just expand a little bit on what it is you're focusing on now at eBay?

Randy Shoup:

Yeah, that's great. In this year 2022, we are expanding that "pilot" from 10% of eBay's teams to 50% so the big thing we're doing is scaling this program and bringing the goodness to everybody or half of everybody. We were talking about, let's call it 300 engineers, now we're talking about like 1500 or 2000 engineers so it's a big scaling of it and it can't be done in the same way. Right? It can't just be Randy and my partner, Mark Weinberg on the product engineering side. We can't personally meet with all 1500 engineers so we're trying to scale that for those teams that are already quite good, the PM teams that were in the pilot, they got a ways to go, right? They're in the high performers, but there's still an elite performer status to get to where it's multiple deploys a day and Lead time of an hour.

We're continuing to work with them and remove more of their bottlenecks. And also we're moving upstream and downstream in that product development life cycle from the software delivery part so more fo focus on the downstream iteration and production. That's a major focus on feature flags this year, getting our experimentation platform even easier to use and better analytics and faster analytics, not waiting days to get results, instead, try to get them in hours and minutes and then upstream.

The next bottleneck I predict is all about what eBay would call planning and the fact that there is a word for that, I guess itself is a little bit of an issue, right? We just need to understand that we shouldn't think in quarters and years, we should have visions in quarters and years, but we should execute in weeks and days and so that cultural shift around smaller batch sizes, fewer inter-team dependencies. And then you hinted at this before, but every team has too much WIP, too much work in progress and so that's a thing that again is very counterintuitive to tell execs, "You know what? We're going to go faster if we're not working on so many things at once." And they're like, "What are you talking about? You lazy engineer." Like, "No, no, please read this book by Don Reinertsen." And they're like, "No, I'm not going to read that."

But anyway, no again, it's equally counterintuitive, but trying to reduce WIP to increase flow.

Charles Humble:

Randy, thank you so much. I could chat you for hours. I'll include links to all of the books and other resources we've mentioned, including "the principles of product development flow" from Don Reinertsen in the show notes. And I just wanted to say thank you so much for being the first guest on the first episode of "Hacking the Org" from the WTF is Cloud Native team here at Container Solutions.

Randy Shoup:

Thank you Charles.

Podcast: VP Engineering and Chief Architect at eBay Randy Shoup on Improving their Developer Velocity

About the interviewee

Resources mentioned

Full transcript

Introductions

What are you and your team responsible for at eBay?

How has eBay’s architecture evolved?

What prompted you to come back to eBay?

What problems did you find when you returned?

What is causing the bottleneck?

How do you find the trade-off between independently deployable microservices and opertioanl complexity?

Using the Accelerate metrics

What are you getting the data from?

How are you managing the cultural aspects of the change?

What are you focusing on next?

The Rockefeller Habits Applied: Managing Remote Teams in Sca...

WTF is Value Stream Mapping?

Talk to sales

Stay In Touch