Charles Humble talks to Jennifer Mace, aka as Macey. She gives us her definitions of "site reliability engineer" and “toil”, discusses how google recruits SREs, explores how to manage risk and speed, ...
Why should we try to make incident reports engaging?
Recently, in Container Solutions’ engineering Slack channel, a heated argument ensued amongst our engineers after a Pulumi-related story was posted. I won’t recount the hundreds of posts in the thread...
Keeping our software up and running isn’t so different from keeping our organisations functional. We can learn from each other and use the same techniques.
The field of site reliability engineering originated at Google with Ben Treynor Sloss, who founded a site reliability team after joining the company in 2003, but the practice has spread across most or...
There are two hard problems in tech: cache invalidation, naming things, and off by one errors. We have proven this over and over again through a multitude of poorly named things. Whether it’s AWS Serv...
Prometheus is a simple and effective open-source monitoring system. In the years after we published the article Monitoring Microservices with Prometheus, the system has graduated from the Cloud Native...
Those of us who make a living producing software or managing software teams are ultimately getting paid to improve business processes, be it making cars autonomous to improve safety, save people time ...
About a year ago, brick and mortars like restaurants and grocery stores were scrambling to set up delivery and curbside pickup. A lot of them used chaos engineering, in production, to hunt for failure...