Podcast: Jennifer Mace on How Google Does SRE

    Charles Humble talks to Jennifer Mace, aka as Macey. She gives us her definitions of "site reliability engineer" and “toil”, discusses how google recruits SREs, explores how to manage risk and speed, ...

    Ditch the Template: Incident Write-ups They Want to Read

    Why should we try to make incident reports engaging?

    Is it Imperative to be Declarative?

    Recently, in Container Solutions’ engineering Slack channel, a heated argument ensued amongst our engineers after a Pulumi-related story was posted. I won’t recount the hundreds of posts in the thread...

    19 min read

    What SRE Teams Can Learn from Business Continuity and Vice Versa

    Keeping our software up and running isn’t so different from keeping our organisations functional. We can learn from each other and use the same techniques.

    Almost 20 Years In, SREs Are Still Finding Their Place

    The field of site reliability engineering originated at Google with Ben Treynor Sloss, who founded a site reliability team after joining the company in 2003, but the practice has spread across most or...

    DevOps - The Sec is Silent

    There are two hard problems in tech: cache invalidation, naming things, and off by one errors. We have proven this over and over again through a multitude of poorly named things. Whether it’s AWS Serv...

    16 min read

    A Beginner's Guide to Using the Prometheus Operator

    Prometheus is a simple and effective open-source monitoring system. In the years after we published the article Monitoring Microservices with Prometheus, the system has graduated from the Cloud Native...

    13 min read

    Why Should We Care about AIOps?

    Those of us who make a living producing software or managing software teams are ultimately getting paid to improve business processes, be it making cars autonomous to improve safety, save people time ...

    Why You Need Chaos Engineering Now More Than Ever

    About a year ago, brick and mortars like restaurants and grocery stores were scrambling to set up delivery and curbside pickup. A lot of them used chaos engineering, in production, to hunt for failure...