Charles Humble talks to Adrian Cockcroft, ex of AWS, Battery Ventures and Netflix. They discuss: memes in computing, serverless first, chaos and ideas around continuous resilience, strategy and Wardley Mapping, using large hardware such as AWS ultra clusters, and sustainable software.
Subscribe: Amazon Music | Apple Podcasts | Google Podcasts | Spotify
About the interviewee
Adrian Cockcroft has had a long career working at the leading edge of technology. He’s always been fascinated by what comes next, and he writes and speaks extensively on a range of subjects. He joined Amazon as their VP of Cloud Architecture Strategy in 2016, recruited and leads their open source community engagement team. He was previously a Technology Fellow at Battery Ventures. There he advised the firm and its portfolio companies about technology issues and also assists with deal sourcing and due diligence. Before joining Battery, Adrian helped lead Netflix’s migration to a large-scale, highly available public-cloud architecture and the open sourcing of the cloud-native NetflixOSS platform. Prior to that at Netflix he managed a team working on personalization algorithms and service-oriented refactoring. Adrian was a founding member of eBay Research Labs, developing advanced mobile applications and even building his own homebrew phone, years before iPhone and Android launched. As a distinguished engineer at Sun Microsystems he wrote the best-selling “Sun Performance and Tuning” book and was chief architect for High Performance Technical Computing. He graduated from The City University, London with a Bsc in Applied Physics and Electronics, and was named one of the top leaders in Cloud Computing in 2011 and 2012 by SearchCloudComputing magazine.
- WTF is SRE
- Adrian Cockcroft's architecture trends and topics for 2021
Trends and Topics for 2022
- Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures
- Evolution of business logic from monoliths through microservices, to functions
- Aerin Booth's upcoming course on Cloud Sustainability: LinkedIn Post, Early access link.
- Matthew Clark on the BBC’s Migration from LAMP to the Cloud with AWS Lambda, React and CI/CD
- Toli and Andy Norton from Cinch on Team Topologies, Theory of Constraints, and Serverless
- Team Topologies
- The Value Flywheel Effect
- Engineering a Safer World: Systems Thinking Applied to Safety
Hello and welcome to the 10th episode of "Hacking the Org", the podcast from the WTF's cloud native team here at Container Solutions. I'm Charles Humble, Container Solutions Editor-in-chief.
Before we get into the podcast, I will be remiss not to mention that our conference, WTF is SRE, is back for a third year and will be held in person for the first time. The conference takes place in London here in the UK, May the fourth to the fifth of 2023. With four tracks covering observability, DevSecOps, reliability, and for the first time this year DevEx. If you'd like to find out more, point your browser of choice to www.cloud-native-sre.wtf and click "I want to attend. I'll include that link in the show notes."
Now my guest on the podcast today is Adrian Cockcroft. Adrian retired from Amazon and indeed full-time corporate work at the beginning of June of 2022 after a long and frankly glittering career working at the leading edge of technology. He's always been fascinated by what comes next. And he writes and speaks extensively on a whole range of subjects.
He joined Amazon as their VP of cloud architecture strategy in 2016, recruited and led their open source community engagement team and then turned his attention to sustainability.
Before that, he was a technology fellow at Battery Ventures where he advised the firm and its portfolio companies about technology issues and also assisted with deal sourcing and due diligence.
And before joining Battery, Adrian helped lead Netflix's migration to a large scale, highly available public cloud architecture and the open sourcing of the cloud native NetflixOSS platform. Prior to that at Netflix, he managed the team working on personalization algorithms and service oriented refactoring.
He was a founding member of eBay's research labs, developing advanced mobile applications and even building his own home brew phone years before the iPhone and Android launched. And he was a distinguished engineer at Sun Microsystems where he wrote the best selling "Sun Performance and Tuning" book and was chief architect for high performance technical computing.
Adrian, it's brilliant to have you on the show. Welcome.
Thank you. Great to be here and good to see you again.
The meme phenomenon as it applies to technology
Likewise. Really lovely to see you as well. Thank you so much.
So we were talking off mic before we started this recording about the meme phenomenon as it applies to technology. So how something gradually gets a word associated with it and builds up a head of steam. And you said you have a little bit of a theory about that, so that seemed like quite a good place to start. So do you want to talk a little bit about your meme theory?
Yeah, generally the new word comes along and the reaction to it is a mixture of some people saying, "oh, this is a cool new thing that everyone should understand." And then a bunch of people saying, well, "we've always been doing that, why have you misappropriated an old name that whatever." And so you get a bunch of grumpy, get off my lawn reactions.
And I think we're seeing that right now with platform engineering. So I dug into it a little bit to figure out what was going on here. Because when I looked at it, it was 10 years ago we did that, of course we did that. That looks like that's what we did at Netflix. We set up platform teams and we had loads of platforms and things. And the question is, why does it happen at a certain time to become a buzzword that's got currency and you start seeing discussions about it?
And if we go back to microservices, probably a clearer one, I mean, people said, "oh, it's just SOA, it's nothing new. It's service oriented architecture, we've been doing that before." And it was, I mean, I called it fine grain SOA because we were doing smaller chunks and we weren't writing it in SOAP, which was the SOA model that everyone had previously anyway. So the concepts were there.
The thing that changed were around 2010 or so when we were doing it was we'd gone to at least gigabit ethernet. We had faster CPUs and we had more efficient on the line protocol. So it was possible to break the larger systems into smaller chunks without having too much overhead. If you tried implementing the fine grain microservice architectures about 10 years before, you would've found that you'd spend all your time in overhead, mobility, SOAP messages and XML and slow CPUs and slow networks. It just wasn't viable.
So you could do fairly course grain, but there was this technology limit that we had to get to where microservices became viable. And the other thing that happened was cloud came along, so creating and updating your own system by just deploying code was trivial to do, API driven. So this combination of it becoming technically feasible and the delivery mechanism becoming much simpler was really what I think drove microservices then.
And then there were a whole bunch of us calling it different things. And there was a conference called MicroExchange in I think 2012, 2013, something like that in Berlin, where a whole lot of us got together and James Lewis said, well, he wrote the blog post, Sam Newman wrote the book. I basically went around telling everyone else that that's what Netflix had done. And a few other people were involved with it.
And we all watched each other giving talks on the subject and decided to synchronise what we were doing and say, okay, we're going to call this the same thing. And we're all going to agree that we are talking with a common basic definition of it.
And then we all spread out. And a few years later it started taking off pretty rapidly. And I think there's this sense of getting a clear identity around it, but it wasn't driven by any one company as a marketing buzzword. It took off more organically. So that was an interesting one.
You saw it again with Cloud Native. And Cloud Native computing was something we were talking about for a while, and then the CNCF came along and said, well, this is Kubernetes, and CNCF became a synonym because it was almost like a marketing brand around Kubernetes. You wanted cloud native in your data centre, that was how you did it. So that was another one that took off that way.
More recently with platform engineering, I was trying to understand why is it happening now? And I think it's because of Team Topologies which came out a year or two ago. And one of the key teams is the platform team. So people started organising or at least thinking about the way they're organising their teams. And there's this definition of what is a platform team.
And I think that is underlying why now. It's almost like there's a theoretical basis behind having a platform team rather than just being a name for a thing that people were doing. And it's a little bit more defined. So I'm happy to see it and I'm trying to not be a grumpy old get off my lawn person for this one because I think it is a useful concept and I'd like to see it done more.
What do you think of Team Topologies as a book?
You mentioned Team Topologies briefly there. So what do you think of team Topologies as a book?
There's some really good ideas, and one of the ones that, not just like how you arrange your teams, but the idea of curating a team. A team is a group of people that over time gel together to work effectively as a team. And then you can give different work to that team and they're adaptable in what they do. But you don't just blow it up and reassign the people to different things. There's a sense of working together as a team. That idea I think is a really cool one. That management just do random reorgs and trash things too easily without realising that what they're doing is they're breaking the ability to be productive for the people in the teams.
So reorganising the work and moving teams around is different to laying out everything and splitting teams up. It takes a long time to recover from actually shuffling all the people across teams.
A bunch of thoughts there, but I think there's some interesting things. And the other thing about this is there's always a meme going along because somebody's investing marketing in it, and the conferences are looking for what's the latest thing that's going to get somebody interested in a conference. So all the program committees get together and decide on this is the theme for the year, and we're going to be talking about platform is one of those new things. So you'll tend to see it in lots of conferences through the year and then there's probably a new one next year. So there's a bunch of reinforcing conference driven development stuff going on as well.
Right. Yes. And having been on the other end of that process quite a few times, the other thing that happens is if you want to go and speak at a conference, you think, well, what topic is likely to get picked up? Oh, everyone's talking about platform engineering, so I'll do a platform engineering talk. In the same way that a few years ago everyone was doing microservices talks. So it becomes self-reverential, self-feeding, if you will, both from speakers and conference organisers.
I've been talking about platform engineering as part of some of my architecture talks. I've got to write a blog post at least saying how people should do platforms.
Do you see serverless as a natural evolution from microservices?
That would be good. I look forward to it.
So you gave a talk at re:Invent in 2020 called rather brilliantly something like "Adrian Cockcroft's Topics and Trends for 2021", which was such a clever title because it meant you could actually find it.
That's exactly why I do that.
One of the really annoying things with re:Invent is that don't list the authors. You can't search by author. So there's all these cool people, I do talks on Reinvent, and no one had any clue I was doing a talk at Reinvent. So I decided I was going to do a talk where my name was in the title and it somehow got through the system. No one stopped me doing it. And then it was the first time everyone could actually find that I'd done a talk at Re:Invent. So it's hacking the system.
Always good. And actually it turned out that that was a very Google-able title as well. So when I was doing the research for this podcast, I wanted to go and watch the talk again and it was very easy to find.
So in that talk and indeed the follow on blog post that you did the next year updating it a bit, you covered basically five topics, so serverless first, chaos and resilience, Wardley mapping, large memory systems and sustainability. And I'm going to use that as the scaffolding for the conversation. We'll basically try and hit as many of those topics as we can in the time.
With serverless, I have followed this for a while actually, and interviewed quite a few people including Matthew Clark at the BBC who was head of architecture for the BBC's digital products. This was when I was chief editor at InfoQ, so a few years ago now.
And in that podcast they were migrating basically from LAMP to the cloud with AWS Lambda and Reacts. It was a good case study. And then last month on this show I spoke to Andy Norton and Apostolis Apostolidis from Cinch. Again, really interesting case study for serverless because they started with managed Kubernetes and React and Type Script and C# all running on Azure. So basically it was already a very modern stack. And they made this decision to migrate away from that to AWS Lambda. It was part of a business pivot.
And I think my understanding from talking to them was it was mostly driven by just a need to be able to move even faster. Which of course is, you know, it's why we did microservices in the first place was so we could innovate more quickly. Can you unpack this a bit for us? Do you see serverless as a natural evolution from microservices?
Yeah, actually on my Medium blog, there's a post called The Evolution from Monoliths to Microservices to Functions. And it's the mostly highly trafficked thing I have on my blog. I forget, 55,000 views or something like that. Where I set out this thought originally.
But yes, there is some element that it's a evolution and there's some places where it really works and there's other places where it doesn't work as well. But generally the places where it doesn't work are being chipped away by AWS and the other vendors over time.
So I did a talk called Serverless First, and that's the attitude. The other interesting book right now is "The Value Chain Flywheel" by David Anderson. He might be good to have on the podcast sometime. I just recorded a fireside chat with him and the other authors for IT Revolution a week or two ago. So that's up on YouTube.
But they basically were trying to go faster and ended up going just ridiculously faster. This is Liberty Mutual, which you wouldn't think would be a good example, but they are one of the fastest moving IT organizations I've ever seen because they really got deeply into serverless and Wardley mapping and a whole bunch of other things. So there's some cool stuff there.
So in some sense, you can use serverless first to do anything you want to do because you can build it in a few days or a few hours. There's a bunch of examples that they have where a team that had never used their technology stack before, went on a training course for half a day, two days later, they'd shipped a completely new service that they'd built from scratch. And they have stories like that over and over again. Once you're up to speed, you can do stuff in a few hours.
Most people can't conceive of having a completely new idea for a completely new service with a completely new endpoint that does something and getting it out later that day, it doesn't make sense. No one can do that.
But it's possible because you're just writing the core business logic and then getting it out. With all of the other stuff templated and wrapped and architected around it using things like CDK for Cloud Developer Kit to wrap up all of the best practices. So that's one end of it.
But there's still things that you build it that way and you go, it isn't really working for some reason. And the reasons it doesn't work are usually because you need it to do something that had a lot of persistence in it. So you need to set up some new database or data store so that you'd run that somewhere else. Or you'd need to get some specialised data store that isn't available as a service.
Or you need an instance type, you need GPUs or huge memory or something, you need a special instance and it's going to be long-lived. And then the other case is mostly around efficiency and things like that.
But if you've got a very constant workload, the traffic's coming in all the time, it's just always pumping in, like if you're doing ingestion for a monitoring service, for example, you know how many customers you've got, you know how many machines you're monitoring, a few of them add and go away maybe, but mostly you are monitoring, say, a hundred thousand machines out there and they're sending you data every minute. You know the data's coming in, you just pump it straight into Kafka or something. And that's a pretty constant size thing. Dropbox is another one. They're just a bunch of full disks and the disks gradually get fuller. And that's basically most of their cost is in the storage part. So if you've got a very constant workload, it makes sense to optimize with just building that.
I listened to the Cinch example you were talking about. And they've got a very busy workload. They have a TV ad that causes a whole load of traffic to turn up. You see this anytime where you've got spiky traffic, BBC is very driven by programming. You are delivering to a very large number of people who all act in concert. So they come in with a big spike. Any use of flash sales, things like that. And things that are intermittent, iRobots the other one that's interesting. Over Christmas, everybody gets new little robot vacuum cleaners and they all turn them on the same day. So they have a massive spike in onboarding and it all works fine. So there's things like that that are natural fits. You get very high utilisation.
Most of the time when people do their cost analysis of their AWS bill, the Lambda part is so small it's lost in other. So if you do everything with it, then it comes up to the top and you find that your API front end and your logging are still more than your Lambda cost. But altogether it's a cheaper way of doing things as well. So that's why I think it's interesting.
And we've seen AWS working through all the issues like start-up latency with Java was one of the most recent ones. Where Java takes a long time to start a JVM. They've basically figured out how to snapshot a JVM after it starts so that when you deploy the new function, it actually runs the initialization phase and snapshots after that point. So you've got a very rapid start for your Java-based system. So things like cold start latency, they've found ways around that.
So that's what I was thinking. And so the point here is that the best way to approach something is to start with the serverless and see if you can solve with that.
And then the other tool that's also getting more powerful is step functions. So it's almost if you're doing complex business logic, just do step functions first. And then figure out what you can't do with step functions, do that with Lambda. Figure out what you can't do with Lambda, and then you build throw ups and containers one way or another. So that's the most effective way of building something new that I've seen.
I remember seeing a statistic being quoted a couple of years ago that Lambda was used on more than 50% of new AWS projects. Do you have any insight on that?
And I think I'm right in saying as well, I don't know if you know, but I think that Lambda is used very heavily internally at AWS as well. I remember seeing a statistic being quoted a couple of years ago that it was more than 50% of new AWS projects. Do you have any insight on that?
Yeah, it's very hard to tell. There's so many. I was in Amazon, I didn't know what all the teams at Amazon were doing. It's a very distributed organisation. And yeah, there's a lot of use of it.
Generally AWS tries to dog food its own services. For example, when Graviton first came out, there was a big internal push to migrate to Graviton. It's the customers who are a little bit longer to adopt a new technology, but you can push it internally.
So you tend to see at launch or near launch a very high internal use of an AWS service. But as it rolls out and matures, that'll become a minority use. It'll become mostly the customers. But it's quite normal for early adoption to be internally driven. And there's a number of generally just people doing it because it's the right thing to do. And also internal pricing is sometimes used to make new things cheaper for internal. So to use some economic incentives, games you can play with internal chargeback.
I see a lot of people talking about chaos testing, but I don't actually see that many firms using it as part of their day-to-day practice. Does that chime with your experience?
Right. Yes. A second topic you talked about in that talk was around resilience and chaos testing. And I think a lot of us that have been around the industry for a while are familiar with the idea of, when we all ran our own data centres and we would have a fall over data centre in theory, but we never actually tried to use it. Or even maybe taking backups that we then never tried to restore from. And in theory the cloud gets us away from a lot of this. But my own experience with chaos testing is, I see again a lot of people talking about it, but I don't actually see that many firms using it as part of their day-to-day practice. Does that chime with your experience?
Yeah. Chaos engineering, we popularised it probably around 2012 or so. There's two ways of looking at it. One is as an architectural design control. When you think about this, it makes sense.
So we wanted to have microservices that were stateless and autoscaled. That means that when we deploy them, we have an autoscale group that says make n of these things. And what we wanted to do was encourage people to do that and to not store state within their instances, didn't want a session state. We wanted basically all session state had to be somewhere else. We had like a Memcache tier for your session state, for whatever you'd normally put on a session cookie inside a monolith. So monoliths have a lot of state in them in memory as well as persisted. So we wanted to move away from that pattern of having a session cookie and a stateful instance to the statefulless ones.
So one of the ways we enforced that was by deleting machines at random. So the state would go away. So don't do that. And then the other thing is that autoscaler scaling up is easy because you just add machines. But scaling down or rolling out a new version of the code by scaling them down means you have to think about, is it okay to just delete them? Or do you have to drain traffic through them? Like turn off traffic and wait a while and hope that everything's drained out and be nice and shut it down nicely and whatever.
We just said, no, we don't want to have to shut down things nicely. Things might die. But we also want to be able to make sure that if you don't get a response from your request, you just retry it and you assume that machine's gone away.
So there's a few things you have to do at the back end by making sure that your data writes are idempotent and things like that. So you can write the same thing twice and it's not a problem basically. So those kinds of things. So it's a little more tricky.
But basically that was the idea was we wanted to enforce the idea everyone was going to stateless autoscale services. And that was the only deployment model we had. And you could do anything you wanted, but it had to run in that way. So that was the initial thing.
And then we were going, okay, well, we deleted that. So we had this other design rule that AWS has zones in a region, and were running on three zones, and in theory you're supposed to be able to lose a zone and keep working. It just says, well, we should test that. So we built the Chaos Gorilla, which deleted everything in the zone and it picked the zone at random.
I think the first time we ran it, we discovered that everyone had put their MySQL masters in zone A, and it picked zone A and all the MySQLs were failing over at once. We didn't have that much MySQL, it was mostly Cassandra, which was fine. Cassandra's built to deal with this. It's one reason we use Cassandra.
This is, okay, if you've got anything that cares about zones, you have to deal with it. The fact that every couple of weeks Netflix deletes a zone and has been doing that for, I don't know, at this point 10 years or so, means that everything they've got works when a zone goes down. So again, it's a design control. But the testing for the design control, first you got to do it in test.
But the thing is we set up the design controls at the beginning and then built the architecture to that design. So everything is designed to work with this. Now if you look at a normal architecture and you go, I'm just going to turn off a zone, everyone like, no, no, no. It's like that data centre fail over. You know everything won't work. It's going to take down your whole site. So you don't just do that.
So it's hard to get into it. So the right thing to do in my opinion is to start with a fresh set of AWS accounts and zones and chaos testing. And you put the chaos testing in first. And then you gradually migrate your services that need to be resilient into that system, one at a time, data sources and your services. And then you use peer pressure.
The management say, we've got this resilience system and we've got the not resilient system. And management says, well, I want everything to be resilient, because of course you do. And then non resilient system will fall over occasionally and everyone will come shout at them until they finally figure out how to make their code work in the resilient way and move it into the resilient system.
And that way I think you can gradually get there. And that's what we did. Our old data center kept falling over and it was monolith and Oracle and all this stuff, kept crapping out SANS and various things went wrong, storage area networks and whatever. So the cloud was more resilient fairly early on, even though it was built out of these less reliable components if you like. In the early days, AWS instances would just randomly disappear, but that doesn't happen - its much, more rare now. But in the early days, that was also a thing.
And we decided we would say when we're were moving from IBM P series hardware, which is horribly expensive and supposedly very reliable to random whatever AWS was running Intel, Optoron, AMD machines, something like that, that were certainly cheaper and more likely to fail spontaneously. So that was the mindset that we went through.
The thing that's happening now, and actually I'm doing a keynote at the Chaos Carnival in mid-December, is to move to continuous resilience. And the point here is, let's put the resilience in the CI/CD pipeline so that as your code runs through, a new version of the code runs through, we run some chaos tests on that service. And it's like canary testing. Canary testing is testing does it work? Does the old and the new code work normally with good traffic?
But if you introduce failures as well, you've got to make sure your old code and your new code both fail in the same way. Or your new code doesn't have a new failure mode under stress. So you want to overload it, you want to break some of its dependencies, you want to make things slow, whatever, make it run out of disk. So all those kinds of chaos tests.
So putting it into continuous resilience in the delivery pipeline I think is where most people probably should be. And you're basically just doing test driven development with a bit more operational testing in there too. I think it's really an extension of TDD overall.
What do you think we can learn from Professor Nancy Levinson's work?
I was thinking as you were talking there, we've put a lot of emphasis on detecting problems as an industry. So the whole observability thing is about detecting problems and diagnosing them quickly. We've spent rather less time I think talking about response and how we respond to incidents. You cited Professor Nancy Levinson's work, the book "Engineering a Safer World", she's from MIT. What do you think we can learn from her work?
This goes back to thinking that it's a control theory problem. And unfortunately control theory isn't on the syllabus in most developers academic careers. You can learn a lot of development stuff before having to deal with control theory. I did a physics degree, which includes some control theory and applied physics and electronics degree, so I have a little bit of extra background, not that I've ever been a full control engineer.
But the point about "Engineering a Safer World" is to think of it as a control feedback loop. And the correct terminology is observability and then effectively modelability and controllability. And we've been focusing a lot on observability. And the definition of observability is very clear, and it goes back to a paper that Kalman wrote in 1961. So some of us purists go, stop misappropriating good names for things. But that's fine.
But fundamentally, observability as a property means you can construct the behaviour of a system from its externally visible components. So if you have a black box system, if it is exporting enough information about what it's doing that you can predict its behaviour, that is what observability is. It's a very pure conceptual term.
And what we're doing with our systems is trying to get better. So tracing and some of these things definitely give you better observability on what's going on inside a system.
So now you've got this data coming up. The next point is can you make sense of it? And there's a few start-ups and a few people looking into modelability. Can you make sense of this? AI ops or whatever you're doing, you're looking at all this data and you're trying to decide can you model what's going on? Maybe you've got to a run book, but few experienced engineers who have seen it before.
And then the next thing is controllability. Do you have the ability to control the system to steer it back onto a good path? Because generally speaking, systems are running in control with an internal control feedback loop. They just run. And every now and again they get out of control. That's when the human has to get in the loop to fix it. So there's usually an automated feedback loop like an autoscaler as part of a control system. And then if the autoscaler stops working because you're out of memory and you weren't looking at memory you were looking at CPU for example, autoscaler will just crap out. So you have to intervene and say, no, we see a different problem, we need to go do something about it.
So this outer loop I think is interesting. So I'm certainly expecting over time that more people will figure out this modelling and control side. And then you've got a control system which needs to be independent of the thing it's controlling. And if you put it all in the same zones, in the same region, in the same system, it's all going to fall over together and you'll be blind. So you need it to be externally independent and more resilient. Your control system needs to be more resilient than the systems it's controlling, which is also hard for some people.
Like some monitoring observably systems are basically best effort. Like Prometheus is just scraping the system and sometimes it doesn't see the data, whatever. I wouldn't build a control system based on something that's grabbing data on a best effort basis. You've got to have a much stronger understanding of the data. And the Levinson gives you some models for doing that. Anyway, that's an area I've been particularly interested in and advising a few startups and people like that in that space. So I think there's still interesting work to be done there.
Is usage of Wardley mapping still nascent?
You mentioned Wardley mapping earlier, and funnily enough, I was talking about Wardley mapping with a colleague of mine. And his take was very much, it's a nice idea in theory, but I don't think anyone's actually using it. I'm not sure that's entirely fair. But I do think it's usage is quite nascent. But I do think it's really useful. Interested to get your perspective on it.
And I think it is relatively nascent. I think at some point it'll start becoming standard curriculum in business school stuff. That's when it will be adopted. It's really one of those topics. It's a strategy mapping tool.
I find it useful in my usual on-ramp. I did a little talk once called "Mapping Your Stack", which was a very simple on-ramp to it. Basically you have a technology stack, you've got users at the top, you've got your layers of technology all the way down to databases and cloud and data centers. You can just map your dependencies as a value chain. And then you can look at how evolved it is and say, okay, and the classic case is, we have some code which writes custom files to a file system. And it reads and writes these custom files and eventually somebody says, we should put that in a database, that's more evolved.
So now we've got a database where you're putting it in Oracle or something like that. So it's okay, this is costing a lot of money, let's go find a cheaper one. So you put it in MySQL, it's open source, and then eventually you put it in some database as a service from your cloud vendor.
You're functionally the same layer in your application stack, but you've gone from full custom product to a more generic product to a utility. You can drive all the levels in your stack to the right in that way. And that's a useful way of using Wardley mapping for a very specific purpose that gives people an on-ramp to really get used to it.
I think if you come in to Wardley mapping more generally you say, well, you can also use it to map the landscape, the market you are in, your competitive issues, all these kinds of things. People get confused because they see it being used in 10 different ways.
And we were talking about this with the value chain mapping, value chain flywheel people. And they're saying, a good analogy here is music. Most people can listen to music and enjoy it and understand it to some extent. And then some people can listen to the music and say what key it's in and how it's constructed and can make music because they know the rules of music for constructing a piece of music. And Wardley mapping's like that.
A skilled practitioner knows a whole lot of theory and practice so they can construct a Wardley map that's going to be effective and work in a situation. The people seeing the map don't generally need to know much about mapping. It should be fairly obvious, but exactly why you choose to put a thing in a certain place is like choosing a certain chord because it's in key or whatever. You have to know that there's some theory behind why you do things in certain ways. So it takes a little while to learn, you see lots of maps. Eventually you start learning that there's patterns and there's all the stuff behind it.
So it's one of those things where I find it useful, it's one of the tools in the kit bag, you get it out at the right times to go solve problems of certain types. And occasionally if you've got a room full of people that can't figure out what to do, you just get up on a whiteboard and start mapping and everybody zones into, oh, that, yes, that. What did you just do? And mostly they go, how did you do that? But it has a very useful ability to coalesce a conversation down to something that makes sense to everybody. And make it clear what you're going to do next and why. And that's really the most powerful thing.
Do you think the map itself is disposable?
Yeah. And funnily enough, I found just the exercise of building a map collectively is in and of itself incredibly useful. Even if the map at the end of it is actually fairly disposable, if you see what I mean. The business of getting there.
I also think it can be quite good for depersonalising things. So you know how you can get into situations where someone is quite precious about their idea and it feels like they're personally being attacked, not the idea is being attacked. And when you've got a map, you can be attacking the map because the map's wrong and it doesn't feel personal in quite the same way. That might just be my own very specific experiences of where I've applied it. But I think there's some truth in that as well.
Yeah, definitely. And Cat Swetel has done a few talks who actually says, throw away your maps when you're done. The act of creating it is the thing. It's something like, if anyone's ever written a book about something, or you write a paper about something, or a blog post, you actually end up understanding it better after you've written the thing than before.
And it's similar. By drawing the map, it gets a lot of ideas straightened out. So yeah, I write quite a lot of maps that I never actually share with other people, I'm just trying to understand something.
Right. Do you mainly do them on your own then or collaboratively?
I mostly do them on my own, occasionally in collaborative situations it's powerful.
How would you take advantage of EC2 Ultra clusters?
We're quite pushed for time and we've got two more topics to do, so we'll need to do these quite quickly. And another thing you talked about, which I know is an area of interest, is this trend towards ever larger hardware again. So you talked about EC2 Ultra clusters, for example, which are these massive 4,000 or more A100 GPUs, 400 gigabit links, petabit scale fabric, I mean, huge, huge machines. I mean, there's machine learning applications and those things, but how would you take advantage of those capabilities? Why would you want something like that?
Yeah, those systems, when they appeared, you're like, what is that? It turns out this is what the big AI companies are using. If you think of a company that has got a very heavyweight AI training workload, they're off doing this. I won't name names, but you can properly guess.
This is the stuff they're doing and they're just driving AWS to do it. These are big expensive training systems. Since I did that talk, there's a newer version. It's 800 gigabits of bandwidth per node, and the nodes are based on these Trainium machines that have several times the capacity with these custom chips on. And they're running even bigger clusters of it. This is truly ludicrously sized machines, and you're just in there solving. This is what ChatGPT runs on, ChatGPT is running on Microsoft I think, but it's the equivalent thing. So people building those types of systems are doing it in these huge clusters.
And then there's this interesting segue into high performance computing, which I've got more involved in recently. I just did a blog post on high performance computing. They use some of the same hardware, some of the same techniques, and there's some overlap. But HPC is solving some more different problems, computational fluid dynamics and all these kinds of things, crash testing things without actually physically crash testing them and the world's weather and stuff, climate, things like that. So big model solving.
But there's definitely a place, and we're starting to see more and more HPC on cloud. And some of it is being driven by the same underlying technology. So effectively you can see AI cross funding development which is actually useful for building supercomputers, which is assisting HPC to move to cloud.
The HPC marketplace is a decent size, but it generally is not a big driver of business. It's always been a side thing in the computer industry. And AI and machine learning is big enough now that it's actually driving it. So they're piggybacking on each other from a technology point of view. So that's been happening.
The thing I was specifically interested in was very large memory. You can get machines now, I think the biggest AWS machine is 24 terabytes of RAM. I believe Azure has a 48 terabyte RAM machine, mostly for running SAP HANA, which is an in-memory database.
But I think that we are under investing in what you could do with these huge machines because basically there's this idea of big data. Big data's data that doesn't fit on one machine. So from my point of view, big data is anything bigger than about 20 terabytes. If you've got less than 20 terabytes, you should put it all in memory in one machine and process it in place. And the fact is, well, yeah, but you have less CPU in that one place. But the system is so much more efficient because it turns out you're spending most of your CPU time converting things into strings and sending them somewhere else and trying to guess what's in that string and then doing the compute on it.
The cost of serialising and deserialization is actually really high when you do it at scale, isn't it?
Right. Yes. Because the cost of serialising and deserialization is actually really high when you do it at scale, isn't it?
It is. Particularly if you're not careful with the mechanisms you're using for it. So that's one of the things I think if you take away all of those overheads, you can get a lot more done in a single machine than you'd think. And I just think we're under invested in that space.
There's a really interesting research project, a new operating system called Twizzler down out of UC, Santa Cruz. They're starting to commercialise it. There's a company called Elephance with a CE at the end, which is starting to commercialise it and figure out what to do with it. But it's a completely new operating system designed for these huge memory machines and where data is the essential object. You have global addresses to things, it's a bit ... I don't think we have time to explain it here. But if anyone's interested in what you would do if you were trying to build an operating system for terabytes or petabytes of data in memory, that's what's going on there.
And there are some others, there's a company called I think it's Brytlyt in the UK that's looking at in memory database stuff I think.
Brytlyt's different. This is another aspect of it, the GPU memory is also getting very big. Brytlyt run a Postgress-Ed compatible database in the GPU. Where you have this huge CMDI array processor that without even indexing, you can do parallel look ups because you're doing it in parallel using all the processing power. So a very powerful system. It's used for some real-time analytics, particularly in the telco industry from what I've heard. So there's some interesting work there, a bunch of other stuff going on.
I'm still waiting for persistent memory to come along, it's almost coming along and then going away again, I think it may come back. There's this thing called CXL, which I've been talking about in this recent blog post, and maybe it'll come along as a CXL thing in another year or two. So that's pretty much where we are.
I came up with an architecture I call Petalith, which I describe a bit in this blog post, and I've been thinking about for a while. Which is a peta scale monolith where all the different pieces are in one address space. But each piece is running like a microservice, but it's connected using a memory transport instead of a network transport, that's conceptually that.
What you are doing in the sustainability space now?
Right. And it would use less compute and be more efficient and therefore also more sustainable, which is obviously a hugely important topic. I know you are doing some work in your semi-retirement around this. Do you want to talk a little bit about what you are doing in the sustainability space now?
I'm advising a few companies on adding carbon as a metric. So I think all the monitoring tools that are out there will eventually have carbon as just another metric like utilisation and response time. So I think that's the end state. We're a few years away from it.
But I did a talk at Monitorama last year about that. And I'm pushing along trying to get people to think about what that would be. There's a number of things in its way right now, but I think they'll resolve over the next few years. But that's where we're going to be.
There was a statistic I saw actually that claims that just by moving to, this was specifically to AWS, to public cloud on AWS, you would get something like an 88% reduction in carbon footprint. Which I presume is primarily just down to greater workload density, greater efficiency and those sort of things. Is that right?
It's a combination of much higher utilisation, more efficient machines end to end, and then the clouds are building large wind farms, solar farms and battery farms, all of the cloud vendors. So literally all the cloud vendors in Europe and the US right now, you're pretty much at zero scope two carbon, which is the carbon burnt by the machines. Most of the carbon that needs to be eliminated right now is still in Asia.
Aerin Booth, who you may know in the UK, he's doing an online course into renewable and low carbon computing, and I'm going to be helping with that. So anyone's interested in there, look up the class that he's putting up. He's doing it on Raven.com.
Brilliant. I'll include a link to that in the show notes as well. Adrian, I wish we had longer. It's always brilliant to talk to you. And thank you so much for taking the time to be my guest on this episode of Hacking the Org.