Crisis management

Incident Management: 9 Great Resources to Tackle Unexpected Problems

As Site Reliability Engineers, or SREs, we spend our days (and sometimes nights and weekends) making sure the platforms we oversee run smoothly. We also follow careful protocols for responding when something goes unexpectedly wrong. Here’s a list of some of the most useful resources for incident management, compiled by Container Solutions’ Customer Reliability Engineering team.

  1. Paper Duty’s Incident Response documentation. The company’s comprehensive incident response resource covers all the bases of Incident management and response, and includes sections about on-call, pre-incident, response/triage, post-incident, training, and resources for further reading as well.
  2. Google’s SRE book-Emergency Response. The Emergency Response chapter from the Site Reliability Engineering book introduces three interesting new incident scenarios (test-Induced emergency, change-Induced emergency, process-induced emergency). Each one is a real account of an incident caused by unanticipated side effects, the Google team’s findings from this (like what went well, and what was learned from it), and the takeaways.
  3. Google’s SRE book-Managing Incidents. We also recommend the Managing Incidents chapter from the same book.
  4. Google’s SRE book-Incident Response. More wisdom from the company that formalised the SRE role, this time in the Incident Response chapter. 
  5. ‘How to Establish a High Severity Incident Management Program’. Gremlin’s guide on establishing a program to tackle big problems. 
  6. ‘Incident Management’. A collection of tutorials, tips, and best practices around incident management from Atlassian.
  7. ‘Incident Management at Heroku’. A look at how Heroku, a cloud platform company, handles incident response. 
  8. Gitlab’s ‘Incident Management’. This guide from Gitlab’s Engineering Handbook defines roles and responsibilities, lays out guidelines for statuses and labeling of incidents, and links to Runbooks for engineers on call.
  9. ‘Extended Dreyfus Model for Incident Lifecycles’. This can be useful for identifying competencies and behaviours.

    You can find all of our information about SRE and CRE in one place. Click here.

New call-to-action

Leave your Comment