37 min read

    Cloud Native Java: Infrastructure Automation with Kubernetes Operators

    Java (and its other JDK-based siblings) is the most widely used programming language in large companies. Java developers are backend focused and used to building complex distributed systems. Yet these...

    Comparing Chaos Engineering Tools for Kubernetes Workloads

    For most people the word ‘chaos’ means complete disorder and confusion. So what does it mean to engineer chaos? The distributed systems we build are becoming more and more complex, thus their state ca...

    SRE

    2 min read

    What Does an SRE Do?

    Being a Site Reliability Engineer, or SRE, is a hot job—and an expensive one to keep on staff.

    What Is CRE, and What Does It Have to Do With SRE?

    Site Reliability Engineering, or SRE, an engineering practice formalised and named by Google, has helped many organisations maintain their platforms and ensure application performance and reliability,...

    3 min read

    Incident Management: 9 Great Resources to Tackle Unexpected Problems

    As Site Reliability Engineers, or SREs, we spend our days (and sometimes nights and weekends) making sure the platforms we oversee run smoothly. We also follow careful protocols for responding when so...

    SRE

    3 min read

    Isn’t SRE Just DevOps?

    This is the conclusion of a three-part blog series. For more information, request our free e-book, SRE: The Cloud Native Approach to Operations. If you’ve been following parts 1 and 2 of this blog ser...

    SRE

    6 min read

    What Does SRE Have to Do With Cloud Native?

    This is part 2 of a three-part blog series on Site Reliability Engineering. You can read Part 1 here. Part 3 is here. To learn more, request a free copy of the e-book on SRE from which this was excerp...

    What SRE Is—and How It Helps You Keep Innovating

    This is the start of a three-part blog series on Site Reliability Engineering. To learn more, request a free copy of the e-book on SRE from which this was excerpted. Almost all enterprises nowadays lo...

    9 min read

    What We've Learned from Launching a Runbooks Project

    Back in 2017, I wrote on my personal blog about Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites. A lot of it focussed on runbooks, or checklists, or whatever ...