Keeping our software up and running isn’t so different from keeping our organisations functional. We can learn from each other and use the same techniques.
The field of site reliability engineering originated at Google with Ben Treynor Sloss, who founded a site reliability team after joining the company in 2003, but the practice has spread across most or...
There are two hard problems in tech: cache invalidation, naming things, and off by one errors. We have proven this over and over again through a multitude of poorly named things. Whether it’s AWS Serv...
Prometheus is a simple and effective open-source monitoring system. In the years after we published the article Monitoring Microservices with Prometheus, the system has graduated from the Cloud Native...
Those of us who make a living producing software or managing software teams are ultimately getting paid to improve business processes, be it making cars autonomous to improve safety, save people time ...
About a year ago, brick and mortars like restaurants and grocery stores were scrambling to set up delivery and curbside pickup. A lot of them used chaos engineering, in production, to hunt for failure...
Supporting Cloud Native applications is no easy task. Through offering Customer Reliability Engineering (CRE) support—essentially, Site Reliability Engineering (SRE) as a service—for multiple customer...
When you’re offered a Covid-19 vaccine this year, which sort would you like? One that’s been through animal and human trials, received government approval, is made on a standardised production line, a...
The truly Cloud Native way to work in teams, according to the Maturity Matrix, means SRE and DevOps. But what does that mean? You might be wondering, Isn’t SRE basically just DevOps?