Leadership, WTF Is Cloud Native, strategy, Hacking the Org

Podcast: Toli and Andy Norton from Cinch on Team Topologies, Theory of Constraints, and Serverless

Charles Humble talks to Apostolis Apostolidis (aka Toli) and Andy Norton from Cinch, arguably one of the UKs most successful start-ups. They discuss: how the company got started, how Team Topologies, the Spotify Model, and the Theory of Constraints influenced how the organisation was designed; building a learning organisation; migrating from Kubernetes to Serverless; and observability and SRE practices at the firm.

 

Subscribe: Amazon MusicApple Podcasts | Google Podcasts | Spotify

 

About the interviewees

Apostolis Apostolidis (Toli) is Head of Engineering Practice at Cinch, arguably one of the most successful startups in UK startup history. Having been at cinch since the early days of their startup and scale up journey, Toli has helped establish DevOps, observability, and event-driven socio-technical practices across teams as they built on an entirely serverless platform on AWS. At cinch, Toli really "saw" Mr Conway in action.

Toli draws from his experience as a Software Engineer across many software paradigms, stacks, and industries for over a decade. Viewing software engineering as a socio-technical practice has been influenced by his academic background in Mathematics and playing basketball semi-professionally for a number of years.

Andy Norton started his career as a software engineer in the mid 2000s, and over the last few years has been a Technical Architect, Engineering Manager and Head of Engineering at organisations big and small.

At Cinch he was the Head of Engineering Practice. He has a keen focus on how people, communities of practice and lean agile ways of working can help us to build better outcomes.

Resources mentioned

CFP - WTF is SRE 2023

Theory of Constraints
Team Topologies
The Spotify Model

AWS Lambda

Full Transcript

Introductions

Charles Humble:

Hello and welcome to the ninth episode of "Hacking the Org". The podcast from the WTF is cloud native team here at Container Solutions. With this podcast, we bring together some of the most experienced software engineering leaders and talk to them about their experiences covering topics including building and leading high performing software engineering teams.

I'm Charles Humble, Container Solutions editor-in-chief.

Before we get into the podcast, I will be remiss not to mention that our conference WTF is SRE is back for a third year and will be held in person for the first time. The conference takes place in London here in the UK, may the fourth to the fifth 2023 with four tracks covering observability, DevSecOps, reliability and for the first time this year DevEx. Our call for papers is now open, so if you fancy speaking, point your browser of choice to www.cloud-native-sre.wtf/cfp and you could submit a talk there.

If you didn't manage to get that URL, I will put it in the show notes or you can find it by Googling WTF is SRE.

I'm on the content committee for this, so if you do submit a talk, I look forward to reviewing it and I'd like to make a special plea if you're from an underrepresented group in technology. I'd particularly love you to submit.

For this episode of the podcast. I'm joined by Apostolis Apostolidis, known as Toli for short and Andy Norton from Cinch. Toli is head of engineering practice at the firm, which is arguably one of the most successful startups in UK startup history. Having been at the firm since the early days of their startup and now scale-up journey, Tali has established DevOps, observability and event-driven socio-technical practices across teams as they built on an entirely serverless platform on AWS. Andy Norton started his career as a software engineer in the mid 2000s and over the last few years has been a technical architect, an engineering manager, and head of engineering and organisations large and small.

At Cinch, he was head of engineering practice and he's just left the firm to join Prolific. He has a keen focus on how people, communities of practice and lean, agile ways of working can help us to build better outcomes. It's such a great story, this I'm thrilled to have both of you on the show. Thanks for coming.

Apostolis Apostolidis:

Thanks for having us.

How did Cinch get going in the first place?

Charles Humble:

Toli, maybe you could start by telling us a little bit about how the company came into being. How did Cinch get going in the first place?

Apostolis Apostolidis:

Well, it all started in 2019. We're owned by an automotive group that thought about an idea of bringing a solution to consumers. Until then, their primary focus was b2b, so interaction with car dealers more than anything. So the idea started in 2019, we started building out a team, or a company at the end of 2019. The offering at the time was, how can we connect dealers to the public? So the public can easily, in a nice digital way, find cars from various dealerships. About 2020, at the start of the first lockdowns that we were having at the time with COVID and the pandemic was we decided to pivot to sell cars directly to the public, which Cinch owned at the time so we pivoted to building a platform that would enable that.

Did the nature of what you were trying to do influence how the company was designed?

Charles Humble:

So I have to admit that when I first heard about Cinch, it seemed to me a bit of a mad idea to be honest. I did find myself thinking, "Will people really buy second hand cars on the internet? That seems kind of crazy." And I wonder if that influenced how you thought about the design of the company at all, whether for example you had a particularly strong need for experimentation or a product mindset or something like that given what you were trying to do. Andy, do you want to come in on that?

Andy Norton:

I think the most important thing was the trust element. Like Toli mentioned, doing the whole thing online, there's no kicking ties, you can't see the thing until it's on your driveway. That is quite a unique proposition for something that's so expensive. If you're buying a house, you go and look around it. You don't buy a house without seeing it. It wasn't so much experimentation, I think it was more really understanding the customer journey and really understanding what jobs to be done. Why would a customer want to buy a car online? Who's crazy enough to want to do that? And what are the current points in that that would distract people and maybe turn them away? What are the points that actually scare people in that journey? Because you mentioned you're not going to go around the showroom for hours on end and sit in all the cars so what unmet needs do we have around customers that we need to figure out, "Okay, actually we need to go and make sure that they're happy to do this. They're going to part with quite a lot of money."

I'd say probably the customer journey stuff and that the trust is probably the bit that I picked up on quite early on.

Apostolis Apostolidis:

And I think I would add to that, to what Andy said, we weren't the only ones who are doing it. It was a bit of a trend in the States with companies like Carvana but actually one of the things we want to establish early on is optimising for learning. So what can we learn? What are customers doing? And what are their unmet needs? What troubles are customers having when they go into physical dealers or even online dealers? And where are the gaps? And by optimising for learning, we discovered some things that we wouldn't have hypothesised at the start. A great example of that is buying a car online is not buying a book. You typically can't just use one single credit card to buy a car outright. So we had to find innovative solutions to be able to allow customers to easily buy a car outright. So an example of that is splitting your card. So if you try and go and ask your bank to raise your daily limits just to buy a car, you'll probably struggle a bit.

You'll have to be on the phone for a long time, whereas we offered the option of splitting the payment of, say, 10k between two or three or four cards so you start to discover those type of solutions.

How do you teach people to be curious?

Charles Humble:

I think there's a really interesting kind of bullet here and I'm wondering if this resonates with you, which is how do you teach people to be curious? Is that something you thought about at Cinch?

Apostolis Apostolidis:

To establish a culture of curious, of curiosity and learning, everything starts from recruitment to start with. So we started with not caring about... For example, when we're hiring for engineers, we didn't care about what tech stack you're working in or what language you're used to, we want you to want to solve problems and want to collaborate. One of the big things we had very early on is that it's not about engineers with a headset writing code and other people in your cross-functional teams working completely separately. We want people to work together. So collaboration was a big thing. So I feel like it fell into place quite naturally to be curious when you're told to work with a UX designer or you're told to work with a product owner. They have [a] different set of questions to typically what a set of devs would have in the traditional manner so it starts becoming more about the product.

There's various techniques that you can apply. The one interesting technique that we did very, very early days was one of the UX leads at the time asked the engineers, the entire team, to go out in the streets of Manchester and go and ask people questions about how they're buying cars. And I think even though none of that research outcome was useful, it sparked an interest that you should be thinking about talking to customers, you should be thinking about the world is your library, how can you learn? How can you understand problems and then think about solutions?

Andy Norton:

Yeah, I think there's a point in there as well around the type of engineers that we're trying to recruit as well. Toli mentioned product engineering is the name of the wider group, but we want people that are really curious and want to solve customer problems. They care about the customers. They go and see the thing being done. Very early on in some of the early discussions' I had with people at Cinch, it was what problem is that solving? And I wasn't used to hearing that from engineers as much. They really want to use data to decide whether this is the right thing to do and yeah, go and see the thing being done. So my second week I went to one of the sites that refurbishes the cars and I was told to take a pen and paper and ask lots of questions and we did, because it's not enough to just understand some things that may be in Miro or in Confluence, but actually go and see the whole process and speak to the people that do this job.

Part of the value stream that you're working in. Go and ask the questions and understand where the pinch points are in this whole process. So very early on it's be curious, go and find out what's going on and don't take things for granted. If you can go and give people the opportunity, give them a pen and pencils and say, "Go and have a look." It really does help, I think.

Apostolis Apostolidis:

Just to add to that, around 2020 when we were hiring and when we were talking about our engineering practices and how we're going to be thinking about software in terms of DevOps and building and shipping and supporting, we were asking candidates in interviews, "Okay, you built your software, how do you know it's going to be working in production?"

Charles Humble:

Yes.

Apostolis Apostolidis:

And we asked that question over and over again to the point where we were able to achieve observability being as a core practice like testability, like TDD as all the rest because that's where you get the feedback. That's where you start becoming curious because you start asking the questions that you didn't know you wanted to ask before, but you have the capability to do it if you've built it in the process. And I've always seen testability and observability as equivalent practices and it's just that observability shifts the attention right, to production, to the things that are happening in our live systems.

Were there specific books or articles or models that influenced how you thought about the design of the company itself?

Charles Humble:

Were there specific books or articles or models that influenced how you thought about the design of the company itself?

Apostolis Apostolidis:

Yeah, we naturally fell into the Spotify model language so our first team structures were squads, so cross-functional teams and we did call them squads. The characteristics of those squads were very much tech lead and product owner as a dual-leadership thing and software engineers and any roles actually, squad needs to be autonomous. So the Spotify model would be an influence from that perspective. Team topologies has been quite a big influence from very early on because we started to think about what these teams are, which are they're stream aligned but as we grew the company, we started thinking about other team structures and the collaboration modes and thinking in terms of cognitive load.

And a big one for us, and I think Andy's touched on this a bit, is the theory of constraints has influence our thinking a lot and rather than thinking about how we go about doing standard practices like, for example, having multiple environments in which we can test our software with codes before we go to production, one of the things we did was, well, we won't add an extra environment other than production until we actually need it. So we took a very lean approach and thought about our value stream towards production every day almost. It wasn't something that we did as a one-off, it was a practice that we thought, what is getting in the way of us delivering value? And trying not add more constraints than we need.

Andy Norton:

I think that fits in as well with the theory of constraints is that we actually went and saw the process and there was each one open. So if you've read the book, the constraint and the process is the fact that you need to have a piece of machinery which needs paint applied and then it has to be cooked and while that's happening, the process before it would have to wait and the processes after it would start. We literally that problem in some areas. So not just the technical side but could see the thing happening in the real world and you could ask question, so if car has to wait for that to happen then you have to wait and that creates a bottleneck and that thinking's gone a long way I think now and continues to be a source of what's the constraint in the process? And that process might be people, it might be technology, but yeah, I think that has had a big impact on us as well as Team Topologies and Spotify methodologies.

Did you do anything in terms of deliberately introducing constraints at the beginning as a way of influencing how the organisation would develop?

Charles Humble:

Did you do anything in terms of deliberately introducing constraints at the beginning as a way of influencing how the organisation would develop?

Andy Norton:

I think one of the things that has been introduced, which is a real constraint that's been helpful is limiting the decisions that people have to actually care about. Technically speaking, having a paved path. We were saying, well these are the things that we want to adopt and we want to optimise for. So for example, serverless in your SQL databases, event driven architecture, these are the things that we want to do. There are other opportunities and other things you can do, but we want to focus our attention so you don't have to make all of these decisions. You can almost take technical choices off the shelf and run with it. Some of these are solved problems as well. So there are quite a lot of things around the paved path which think have been really useful for us. And some of the third parties we use as well. I think we said, "Well we'll just use this. We're not going to build this, we're going to buy it. Use that as a source of truth for this thing."

That's really helped people to not over-engineer things and think that they have to build these things. So the more I say constraints but also some heuristics around the approach to software that have been really useful.

Apostolis Apostolidis:

I think it's funny, early on the first iteration of Cinch was built by an agency and it was doing the bare minimum. The first things that we launched, and the website was using a lot of really good practices, because everything is Azure, a lot of Infrastructure as Code, a lot of containers interestingly. But when we started building that team at the end of 2019, we looked at the architecture that was built and we could see Conway's law in action. So there was a front-end that was very different to the backend end, there was an infrastructure, it's very different to the other two and you could see that from the teams who had built the software. They were all really good teams, really, really advanced. They did an excellent job. But we recognized that that wouldn't scale for us. We wouldn't want to have a separate front-end team, separate backend team, separate infrastructure team because we're introducing bottlenecks at each stage.

So the first iteration of our design was, I think we had four cross-functional teams aligned to domains. So one would look after searching cars, the other one would look after inventory and so on and each team would own the entire stack, if you like, so front-end, backend and infrastructure. And we looked at that and we were happy with that and then we looked at what we had and we thought, "Well, what's holding us back? What's going to reduce our cognitive load and what's going to take us away from being curious about what product to build rather than knotty technical decisions? And one of those was Kubernetes and it was all hosted in Azure, but there was a lot of knowledge about AKS and the host Kubernetes solution in Azure. Again, we had to build a team around it potentially because early on we also brought the website down because we didn't know enough about upgrading Kubernetes.

So the decision we made was quite radical and the credit to the directors at the time who were able to persuade the business that the right decision is to move to serverless from an container based solution to an entirely serverless solution. So despite the fact we had a code base that was working in production, we decided for our new B2C solution, we do everything in serverless so that we reduce our cognitive load in terms of infrastructure. We don't have to have a team around this. We can just have stream aligned teams as Team Topologies refers to. Looking at the architecture, we sensed Conway's law and we adjusted our architectural approach, our golden path if you like, and we move to a completely different cloud solution from Azure to AWS at the time. So there's a real separation.

How many of the Team Topologies teams do you have at Cinch?

Charles Humble:

I really want to get into that story because it's really interesting. But before we do, I just want to sort of finish off the conversation about Team Topologies. So Team Topologies talks about four types of teams. It has stream aligned teams, enabling teams, complicated subsystem teams and platform teams. How many of those do you actually have at Cinch, would you say?

Apostolis Apostolidis:

At the time, at the start, so about two and a half years ago we started very lean. Cross-functional teams were essentially stream aligned teams aligned to a domain, they owned the value stream but then as we grew we realised quite quickly that we need enabling teams. I think they came first. We need a set of principal engineers, for example. And you could take maybe a set of delivery managers or engineering managers as enabling teams as well and then we started building out platform teams a few weeks later, a few months later around the infrastructure, around things like the design system, things like a bit more domain specific like orders. Platform teams I would say are probably the hardest to think about, the hardest to implement, and their goal is to optimize for cognitive loads to help the stream aligned teams to deliver value and offer value quicker.

So I think the answer is that we started really lean and we ended up incorporating almost all the team types apart from complex subsystem, which I'd say you could argue a couple of them are, but they tend to be more platform. I think the interesting point with Team Topologies is the collaboration modes between teams. I think that's where you see the real added cognitive load. So where teams need to collaborate quite closely or whether they're X as a service. So I think the three types are X as a service, collaboration, and enabling and I think you could sense them, you could see them and they informed some decisions whether a new team should be created or whether two teams should merge, things like that.

Charles Humble:

That's very interesting. So just bringing all of that together. So you've got cross-functional teams which are effectively squads I guess in Spotify parlance, right?

Apostolis Apostolidis:

Yes. Yeah, that's right.

Charles Humble:

You've got a very sort of product engineering centric organisation by the sounds of it, from the get go.

Apostolis Apostolidis:

Yep. Yeah, yeah.

Charles Humble:

Is that a sort of reasonable sum up of where we are?

Apostolis Apostolidis:

Yeah.

What drove the decision to migrate from Kubernetes to Serverless?

Charles Humble:

And then you said you've made this transition from the agency built site which was on containers and running on Azure, I believe, on the cloud.

Apostolis Apostolidis:

Yep.

Charles Humble:

You've gone into some of the reasons for that. The stack itself, so the stack you moved away from sounds pretty reasonable. So I mean AKS, Azure. I believe it was TypeScript and .Net Core as well, if I remember rightly. Is that-

Apostolis Apostolidis:

And C#.

Charles Humble:

So it's not like it's a great big legacy thing. So it's really interesting what actually drove that move. Was that related to the pivot you were making as a company? Because I think that happened broadly about the same time, didn't it?

Apostolis Apostolidis:

Yeah, absolutely. The trigger was the pivot. We knew we needed new capabilities, we knew we needed a new platform and we were aiming to launch within six months. So we were having the discussion March, 2020, September, 2020 we wanted to launch and we thought, what's getting in the way of actually delivering the platform and the website but also actually accelerating after that? I think that was the big thing.

So we wanted to be able to be in a position that we can actually evolve the system based on what we're learning. And that's when we landed on TypeScript, Serverless and event-driven architectures. I think event-driven architectures were a very important aspect. So almost like a side-effect of our decision because we thought about going serverless, then we thought about TypeScript and then we thought, well, serverless kind of lends itself quite neatly to event-driven architectures and that offers low coupling between teams and their integration points could be EventBridge and Serverless and that's it.

They don't have heavy coupled APIs between them. And that allowed us to optimise for the economies of flow. So teams could autonomously have flow of changes in the flow of discovery that they couldn't otherwise.

Andy Norton:

I think as well the bottle of options, so having been Serverless, having these components that are decoupled through using event-driven active. As we learn more about the domain and the used car world, it's easier to add additional capabilities on, it's inherently easier when, like Toli mentioned, if you already have these events that are based in the business domain. A car's been bought, finance has been raised, finance has been closed, if you then have add-ons to that, so additional revenue streams and things like that, you're hooking into existing processes or existing events but you're not impacting the teams around you as much. So you can add new things without going, "Okay now everybody has to change all their stuff because we want to do this new thing." So it meant that teams were able to experiment a lot more as well, right? They're able to go away and do their thing and actually do some stuff in production, test some new products out and the blast radius of that was quite small. The risk was quite small because you are decoupled by nature to be honest.

Apostolis Apostolidis:

And I think the big aspect of this is the self-serve aspect of serverless. It's funny how quickly, it's almost too quick how quick you can get into production via Serverless. You're not expecting any team to build out any clusters or anything that you need. I think both models would ultimately work. It would just mean that we would have to have a team owning the platform, whereas we kind of outsourced the team to AWS and we were happy with that for our use case.

Does Lamda offer any particular advantages to you in terms of your particular workloads?

Charles Humble:

So does Lamda offer any particular advantages to you in terms of your particular workloads?

Apostolis Apostolidis:

We're an e-commerce website, we sell cars. Our workloads are not too intensive so whatever we do doesn't need loads of time and loads of power. What we do need is, we're having to have elasticity, so we want to be able to know that we don't pay for things when we're not using them because we have loads of... I don't know if you've seen ads about Cinch since on TV, but we have loads of spikes because of the TV adverts or because of events that we're sponsoring, and the elasticity that it provides to us is amazing.

The first day of our launch, we had a TV advert for the first time and the website actually got faster because of the higher traffic with minimal cost and I think that's a big advantage where there is the cost aspect and we don't need to do anything for elasticity at all.

Did you hit any specific problems with Lambda in terms of things like cold startup time or dwell time?

Charles Humble:

Did you hit any specific problems with Lambda in terms of things like cold startup time maybe or dwell time if you've got long function chains, those sort of problems?

Apostolis Apostolidis:

I think cold starts were okay due to TypeScript. We didn't really have an issue with that. I think it's more complexity because it's a new architectural paradigm. You have to think in different ways and our initial solutions had various different paradigms blended together because we're hiring for this kind of setup and this paradigm of event-driven and serverless but we were hiring people with more traditional backgrounds if you like. More object oriented or more thinking in terms of CRUD, whereas event-driven you have to consider a bit of a different paradigm and that was a learning curve.

You presumably had to scale up very, very rapidly?

Charles Humble:

And presumably you had to get up to speed really quite quickly as well. Because as I remember it, after that pivot happened, it wasn't that long before there was a huge nationwide marketing blitz with TV ads, you sponsored the England cricket team, which was actually where I first heard about you. So presumably you had to scale up very, very rapidly.

Apostolis Apostolidis:

Yeah, we had a day between our launch advert on TV. We had to scale by doing and I think that was the beauty of all of it and that's why the culture was really important because there was a sense of, "We're all in this together." And I think what event-driven architecture is offering you is that teams were not talking in terms of APIs, they were talking in terms of events. "What event are you publishing and what can I get out of it?" And they consume that asynchronously, so the contract between teams is not a performance contract, it's just a signature contract. So it's the event. So as long as you commit that your event will contain this information, I can figure everything else out, whereas if we had APIs, if they were talking with APIs then there would have to be an inherent contract about, it's going to be generally up for example. We're going to give you a response within 200 milliseconds. Those kind of conversations went away and it was all about the data within the event and it was all about the nature of the event.

Things like, "Vehicle updated," "Order completed," things like that was the language that you were hearing between teams and then teams within their context. They were learning how to write Lambda function handlers and how to interact with AWS services and that's where the learning curve is. It's fascinating what happens when you put very smart people together to solve a problem and empower them and enable them to do what they need to do. The achievement is pretty fascinating.

Charles Humble:

Andy, do you want to come in on this as well?

Andy Norton:

One of the things that was really interesting is that there is a learning curve around serverless, but the way that we try to approach it is to create some examples of what we mean. You can talk about observability, but what's our take on that? You can talk about services, but what's our take on that? What do we mean? And we've built some examples of how we see the services working. The engineers built a fake version of Cinch called Cinch Brew, which is about ordering coffee funnily enough, but it fit all of the different parts that we cared about. It brought together the observability practices, it brought together the way we want to use events, it brought together how we wanted to deploy code and some fits in your head and something that you can run in your own machine. It's something that when you join you can see example of how this all fits together without having to learn all about Cinch's core architecture in one go.

So I think there's a really good opportunity sometimes to create something as an exampler, put it out there, get feedback and grow it over time and I think having that really helped, especially people who are coming from maybe a .Net, C# and things like that, to see this is the way you need to change how you architect your code because there's a lot less you have to care about and how to chain events together and how to use something like EventBridge in AWS and what it means to add different events onto that and things like that. So those examples really help with that learning curve, especially when people haven't been exposed to some of this before.

Apostolis Apostolidis:

I think given the audience of this podcast, it's interesting. So we have an example on this Cinch Brew which is about observability, but we also have a backend template which is a startup project for people to start what you would term microservices, to make it really easy to lower the threshold, have some sensible defaults and go with that. I think what's interesting is who creates these things and who owns them and that's where the organisation, the things that we've talked about Team Topologies, Spotify model is a bit vague on. So these two things had probably different reasons to start, and different people started it but ultimately the two vehicles for ownership or vehicles for creation were working groups. So very early on we had various things that we had to do across teams and no one to do it. We had no enabling teams dedicated, we weren't paying anyone to sit there and create things.

So we had to figure out solutions and one of those was working groups. So we kept together, again, the collaborative angle that we had from the get-go and get a lot of people from different teams together and figure out a solution and hand it back to teams. And one of those was this backend template and that's used extensively at Cinch right now with no explicit owner and the Cinch Brew with the observability exemplar was very similar. And the big trade off is what happens with these things when the working groups... Because the working groups are just time sensitive groups of people and once the job is done they go back to their teams. What happens long-term and that's probably a question that we're trying to deal with but as a midterm solution, is that we have a lot of communities and communities of interest to community of practice that take ownership of some of these things and use them as artefacts to make sense of things.

So our observability community would own the Cinch Brew and I use a word "Own" loosely. Our backend community would own the templates again and there would be some sense of evolving these things that are used in the entirety of the engineering community at Cinch.

Andy Norton:

I think the communities of practice have been really useful for us because... And the communities of interest because we've been able to identify the core capabilities that we think are important to engineers, build a community around that, give them time and space to get better at it. So observability is a really good one, but those things that may fall into being an individual role in some organisations, like we don't have QAs but we have people that are really passionate about quality assurance and testing. So that is most likely going to be something that's a community, a group of people who care about a thing and want to get better at it together. Especially because you mentioned the Spotify model, I think naturally our teams fell into the pattern of being quite tribal and if you've got five tribes, you have five groups of people who optimised for that group that they're in but the expense of the groups you lose that learning, the experimentation, what they're doing in one group, other teams don't really find out about it so the communities...

Again it's a pun on vehicles at Cinch there's a bit of vehicle for learning, it's just they actually connect the dots across five tribes by having people from each of those tribes come along and bring their experience and the things that they're learning in their teams and codify it and turn it into our overall experience of what we mean by this technology, this capability we want to develop.

Charles Humble:

You've mentioned observability kind of in passing a few times and I think I'd like to dig into that a little bit more. I wonder whether you think that with serverless it's a little bit different in terms of how you think about building observability and monitoring in, in that I think the nature of the abstraction maybe makes you think a bit more about business processes and a bit less about individual service and boxes and wires and so on. Has that been your experience? Do you use custom metrics and Lambda to capture things like, I don't know, "order complete" for example?

Apostolis Apostolidis:

Oh absolutely. I'm smiling here. It's something I'm really passionate about. From day one, we wanted to be able to observe our systems. We wanted to be able to know, I heard a really good term at one of the conferences this year is know what the health of your business transactions look like. And what serverless software does is this kind of cognitive load shift from infrastructure to observable systems to shifting right to looking at our systems. And with Lambda we were able to use metrics, we use tracing quite a bit to be able to know both the "What's." So how many orders are completed today but also the "Why?" So why is something going wrong? Something weird about this, what's going on then? Why is this happening? We're able to answer these questions and things have happened in the last two and a half years that I never expected would happen and when we first started out, which is testament to what happens when you have space to do things and not get bogged down by technology.

And one of those is that teams actually know at any point in time what's going on with their systems. They can quickly find out if they have any sort of trigger. And when I say teams, I mean engineers here, engineers at Cinch are talking about observable systems, are talking about, "How can I make this observable?" Although there are some silos in terms of analytics or decision intelligence, but they will get that benefit from our observability tool. We have a bit of a decoupled thing between our operational platform, which is AWS and our observability platform, which is Datadog. So you go to Datadog to understand the system and you go to AWS or your infrastructures code to change your system.

Andy Norton:

I think a big part of the learning around observability is we've got into this culture now of having really human readable dashboards. We don't want to have lots and lots of things that blind people with science, but teams maintain these really nice to look at dashboards that answer questions, are customers able to order a car? And then within that there's the answers to that question in the dashboard. Our customer's able to search. You can see all of that without having to dive into anything more. We've abstracted away the business process into a dashboard. You can see it if something's going wrong, they use that and it's become their home and decided to have links to what the team works on and where they do the standups and all this information into this observable team as well as the systems that they work on, which has been really interesting to see and that's evolved quite a lot over the last, I'd say, 12 months.

And I think a lot of that is because it's now a group of people who are as a community coming together and making it better and I can tell you roles over last year as well that helped to amplify those ways of working.

Apostolis Apostolidis:

I think what's interesting, again going back to who does all this stuff, it doesn't just happen. Just because we are using serverless doesn't mean that people care about observability. That's my core learning point. I sat there shouting at everyone, "But what about observability? Why is no one thinking about observability?" And the reason is because you have to understand the value to be able to make a software testable, for example. It's the same with observability. So what we did in terms of org design is we would have an embedded DevOps minded person on each team, which we termed the Automation Engineer. They would be software engineers, but they would be the people who'd go, "Hey, how do we know this is going to be working production?" Or, "Hey, our pipelines are really slow, we need to improve this." Or, "Hey, we should really have this as infrastructure as code."

So ask all these questions that remove toil and get us back to the DevOps mindset of flow, feedback, experimentation and learning. Obviously that doesn't scale as you grow, you can't continue having these roles, they're really hard to hire for as well. So the evolution of that, so that evolves as well. The evolution of that is that we've introduced a staff engineer role, that's what we've called it, which are people with subject matter expertise. They sit within the product engineering organisation, they work across it and they work as a team. We've got three of them right now and we're trialling it and working towards the end of the trial and they are there to answer any questions or support teams, enable teams, as an enabling team in Team Topologies speak and they will help with observability. They will help with CICD.

They will help with infrastructure but not offering solutions in absence of teams. They're solving problems that the teams have and offering good engineering practices from their experience. So this is where I think the social technical aspect of things is really important is my core learning. You can't just throw a tool at people and they just use it. They need to learn the value of it and learn the good practices.

Andy Norton:

Yeah, I was going to say one of the things that's quite interesting is that Toli, said who is responsibility is it to get better? It is everybody's, but I think there's a bystander effect there if you don't have people who are optimising for learning as well. The automation engineer role, which morphed into the staff engineer, very dedicated, they can help make sure that we don't end up with bystander effect of everybody expecting someone else to help push forward observability and good DevOps practices. We've said, "You're the person that's going to the one that people are going to look to and you're going to help them get better at it." But otherwise I think it's difficult find these things sometimes because everyone's so busy with the product work that you need these specific people, I think.

Charles Humble:

So you've mentioned dashboards and dashboards have, I think, a bit of a poor reputation these days and I think it comes out of... I mean I remember visiting ops bridges when companies run their own data centres and there'd be the dashboard flashing red and you'd say, "Well, does that matter?" And everyone would be like, "Oh no, it always does that. It doesn't matter, it's fine." So no one was paying attention. How do you make your dashboards alive? Do you have particular rituals or ceremonies or things around them to try and make sure that people do actually look at them and pay attention to them? How do you make that work?

Apostolis Apostolidis:

Yeah, absolutely. That's a really good shout. I think it doesn't work without rituals. I think that would be my main thing. We have a lot of teams who look at their dashboards that stand up or just have to stand up. For example, we have teams who will do a weekly improvement Kata thing around observability or any other improvement points. So they'll look at their dashboard and like, "hmm, don't really trust this thing. Is it telling me what it needs to tell me?" Or, "We've got this new feature, we don't really have this represented anywhere and we should do." So it's catching those things. It's having that rhythm of improvement and having that rhythm of curiosity, which is really, really important. I think the hard bit with observability in general is that there's a few different approaches of observability, which is probably out of scope for this podcast but ultimately you have to do what works well for you and I think that's what we try to do.

We try and empower teams to decide what's useful for them to gain an understanding of what's going on in their software system and the bits that they own and ownership is probably an important catalyst for that to succeed.

So if something's going wrong and they all know about it, how important it is, there are techniques to figure out, things like Service Level Objectives, which we're delving into or things like creating a bit more of a ratio of what's going wrong. It's what percentage of the traffic, what percentage of users are having this problem. If it's one user. It's a lot less important than it's a bit more widespread. So observability and good telemetry data often contextualise these things quite well and in that light we would favour more real traffic than synthetic traffic. So it's more about how users are using the software rather than artificial pings to the software, if that makes sense. The traditional health checks. The health checks tell you a lot. If it's down, it's down, but the real traffic will give you the in-betweens.

It sounds like you are embracing more SRE related concepts over time

Charles Humble:

It sounds like you are embracing more SRE related concepts over time. You've mentioned eliminating toil.

Apostolis Apostolidis:

Absolutely.

Charles Humble:

You've also mentioned SLOs, of course. Do you do other sort of things in that territory? Do you use things like feature flags? Do you do progressive deployments?

Apostolis Apostolidis:

We use feature flags, but we don't use them in any elaborate way. So we've rolled out our own feature flags at points. We use AB experimentation at points as well, but I think that's the extent of it. Dark launching, but nothing too elaborate. I think we've got teams who do trunk based development, so trust their tests, trust their observability. So I think it's more about when we want to release something and how a particular experiment is performing. I think that's the extent to which we've gone to. I don't know, is that your experience, Andy as well?

Andy Norton:

Yeah, I'd say so. I think when we move very quickly, I think it's not as important to us for some elements. I think we're able to move quite quickly and decouple ourselves anyway.

Charles Humble:

We're running very short of time. I just want to fire a couple of rapid questions in just to round us off. So the first thing is, do you run a UAT or a staging environment or do you find you just don't need one?

Apostolis Apostolidis:

So we started out with a dev environment and then we built out a product environment and I remember distinctly having a conversation with the engineering director at the time about our, I think we called it UAT environment, and we weren't going to hit the deadline of launching the platform back in 2020. And we had this discussion, well what do we want the UAT environment for and what would be its function and remember thinking and deciding that we don't need one. If we can't name it, we don't need it.

So we went with development and production as our two only environments. Dev being our integration environment for our website to an extent. And now we're in a space where most teams have production and ephemeral environments that they work with. There's only one non-ephemeral environment for the front end, but mostly it's very much... Because we don't have QAs, it's very much a... It goes to dev as part of your pipeline almost, and then it goes to prod and you've got a very short feedback loop and that's generally worked quite well.

For the backend, it works a lot easier for our Lamdas and our components, but the front end is a bit more complex because there's a lot of teams owning parts of the front-end.

And how many times are you committing code?

Charles Humble:

And how many times are you committing code? How many times are you deploying code into production per day?

Apostolis Apostolidis:

I find that a fascinating question because we've been talking about this quite a bit in terms of the industry around how many commits per day you do. I think we can make as many deployments production as we want. We don't really have that much, apart from the front end, which is a bit more contentious.

I think it's more about, there are times where you are doing more discovery work as a team, so you're not doing continuous deployment all the time. But I think ultimately we... I don't know, I can't remember the last time we checked. We probably have 50, 60 changes to production per day.

But ultimately it's not really a problem of deploying. I think our suggestion and what we're aiming for is to not deploy big batches of changes. So most of the time we tend to have fairly short batches, small batches that can go to production and we have a less risk profile attached to it.

Andy Norton:

I think the observability practices have increased our tolerance of risk. We've got enough things in place to really understand our systems and be able to observe them post-deployment, which has meant we've been able to have less heavy handed testing and more, "We'll see." Because we're pretty confident because we've seen this before and we've got the things in place that tell us if it's not working and we'll know very quickly, which has been, I think, quite refreshing to be honest without having to have much complexity and too many long running tests in production. You can do a lot with what you've got when you've got observability in place.

Charles Humble:

It's such an interesting story this. So thank you so much to both of my guests, Toli and Andy for taking the time out to talk to me for this episode of "Hacking the Org" from the "WTF is Cloud Native team" here at Container Solutions

New call-to-action

Comments
Leave your Comment