Java (and its other JDK-based siblings) is the most widely used programming language in large companies. Java developers are backend focused and used to building complex distributed systems. Yet these...
For most people the word ‘chaos’ means complete disorder and confusion. So what does it mean to engineer chaos? The distributed systems we build are becoming more and more complex, thus their state ca...
Being a Site Reliability Engineer, or SRE, is a hot job—and an expensive one to keep on staff.
Site Reliability Engineering, or SRE, an engineering practice formalised and named by Google, has helped many organisations maintain their platforms and ensure application performance and reliability,...
As Site Reliability Engineers, or SREs, we spend our days (and sometimes nights and weekends) making sure the platforms we oversee run smoothly. We also follow careful protocols for responding when so...
This is the conclusion of a three-part blog series. For more information, request our free e-book, SRE: The Cloud Native Approach to Operations. If you’ve been following parts 1 and 2 of this blog ser...
This is part 2 of a three-part blog series on Site Reliability Engineering. You can read Part 1 here. Part 3 is here. To learn more, request a free copy of the e-book on SRE from which this was excerp...
This is the start of a three-part blog series on Site Reliability Engineering. To learn more, request a free copy of the e-book on SRE from which this was excerpted. Almost all enterprises nowadays lo...
Back in 2017, I wrote on my personal blog about Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites. A lot of it focussed on runbooks, or checklists, or whatever ...