Humans are built to make mistakes. It’s how we learn.
That’s true whether you work for a small startup or a global enterprise. The trick is to try to avoid making the same mistakes, over and over.
No matter how brilliant your engineers are, or how cohesive your team is, things will happen. Things like service outages, or security breaches. Things like, say, the whole world is stuck at home for months and your e-commerce site hasn’t been prepared for the spike in traffic.
At Container Solutions, we hold to a principle of psychological safety—the idea that no one who works with us will be punished or ridiculed for speaking up or sharing an idea. It’s not just because it makes for a better work environment; it does, but that’s not the point. The point is that when people feel safe, they can solve problems collaboratively, better and faster than if they felt too intimidated to share their thoughts.
This is never more important than during and after a crisis. That’s where the blameless postmortem comes in.
The blameless postmortem is a best practice for Site Reliability Engineering and for IT more generally. It’s also a practice that more organisations need to adopt in order to give themselves every advantage in solving problems. The blameless postmortem helps teams discover what happened when something goes wrong, and why—and how to prevent it from happening again.
At Container Solutions, we have created a procedure for running blameless postmortems, both with our customers and internally. Here are our guidelines:
Create a document. The document can be implemented on a GitLab repository using issues, or in your content/ticket system of choice, Google Docs, Confluence, Jira, etc. It should be opened as soon as possible following an incident, with a title—everything else can be filled in later.
Include some standard information in the document. It should always cover:
- Who discovered the incident
- Impact of the incident
- Timeline of the incident
- Answers to the 5-Whys, an iterative questioning technique
- Ways to prevent the incident
- Related incidents that happened before
Keep track of the document. Use GitLab's due date and checklists, a calendar event, or other task tracking tool of choice to ensure this postmortem is not forgotten.
Assign one person responsible for the postmortem. It does not have to be who found the incident or who solved it, just somebody who facilitates the conversation and makes sure that the postmortem gets done.
Hold regular sharing sessions. Thank everyone involved in solving and learning from the incident.
Celebrate your learnings. Have a small prize (cake at the office, maybe?) every time a postmortem is closed.
Resources for a Deep Dive
The internet is full of information on the subject of blameless postmortems. We recommend two places in particular for a deep dive:
Code as Craft, a project of Etsy, the e-commerce company, includes lots of valuable information about how and why to conduct a blameless postmortem as part of incident response. You can find its open-source repository, including a Debriefing Facilitation Guide, here.
The Postmortem, an online resource by Pager Duty, is an exhaustive guide to the blameless postmortem, explaining not only the concept but how to introduce it to a team, steps to take, templates to fill out for incident reports, and resources for further reading. There’s also a GitHub repo.
You can find all of our information about SRE and CRE in one place. Click here.