Supporting Cloud Native applications is no easy task. Through offering Customer Reliability Engineering (CRE) support—essentially, Site Reliability Engineering (SRE) as a service—for multiple customers, we here at Container Solutions have learned that the incident response process needs to be as clear and concise as possible.
Fire drills are a way to help any company improve its incident-response processes and to validate that its engineers know how to react when an incident occurs. And yet, we have found little guidance available on how to design or run a fire drill. So this blog post is going to cover the main things you need to keep in mind while designing your own fire drills and then running them.
Experiencing how to react to an outage before a real one occurs will give your engineers confidence and help the company to refine its processes. Fire drills will help you achieve that, and we are going to describe some guidelines and provide an example of how to do this.
At the end of this blog post, you will have a better understanding of fire drills and some practical guidelines to help you build your own. They can help your company level up its incident-response readiness and processes, making your engineers truly ready for anything.
What's a Fire Drill?
Fire drills are fictitious events based on a custom scenario simulating a real-world incident. They help you to validate incident-response processes and check if your engineers know how to react under the different scenarios that can come up during an incident.
A fire drill can be seen as a role-playing game, but this doesn’t reduce its importance. It shouldn’t be treated any differently than a real incident when it’s happening, as the idea is to detect flaws in the incident-response process by being as realistic as possible.
There are other ways to refer to this kind of event, like ‘Wheel of Misfortune’ or ‘Walk the Plank’, referring to the feeling participants can get while going through it and being the team’s center of attention. One term that is often confused with a fire drill is ‘game days’, the objective of which is to test infrastructure resilience rather than incident-response processes, Game days are closely related to chaos engineering practices, which are also an important part of SRE teams (but chaos engineering is out of this article’s scope).
Unlike chaos engineering and game days that consist of introducing failure to specific components of the system, an important element of fire drills is their ‘flexibility’. This means that when you define a scenario, you need to set a path to follow but also be flexible enough to respond to unexpected questions and guide the participants through the areas of your incident- response process that you want to test.
Fires drills have two main objectives: training engineers, and testing incident response.
Like any other skilled task, if you don't practice you can't expect to be any good at it. Fire drills give engineers in a team a way to practice debugging and incident-response skills before a real event. Otherwise they won't be any good at them when it counts.
While preparing a team to start the on-call process, you need to understand the system you are going to work with. If a fire-drill scenario is based on a real application, engineers can develop intuition about how to debug a specific system.
When joining an SRE team, new members commonly work as shadow on-call engineers to see how other engineers react to incidents in the actual system. A good way to give a new team member intuition about your system is to run fire drills with them as the on-call engineer.
Testing Incident Response
For any SRE team, it's key to have an Incident Response Guide in place. This allows you to standardise how engineers respond to incidents and reduce the need to make decisions that are not relevant to mitigate or solve the current incident (for example: naming files, defining customer communication channels, escalation policies, etc.). An incident response process is a list of procedures that we can follow to validate that our steps make sense before testing them in a real incident.
Running fire drills periodically can help you to find flaws in your procedures, the tools you are using, or even your communication processes.
As an example: After running our first fire drill, we found that our engineers weren’t comfortable following our incident-response process and reported that it was too complicated. As a result, multiple improvements were made to our incident-response process to streamline it and make it easier to follow.
It is important that the fire drills happen frequently enough so that you can keep refining your incident-response processes. It is also important that your scenarios are flexible enough to test almost anything that could happen. Keep in mind problems related to applications, infrastructure, or external elements. For example:
- A microservice starts failing
- The service provider availability zone goes down
- Primary and secondary on-call engineers are not reachable
While running a fire drill, it’s important that you respond to the incident as if it is a real incident. Only in this way will you be able to detect flaws in your processes and procedures.
Our Fire Drill Framework
At Container Solutions, we have created some guidelines that help us to define our fire drill scenarios.
Sometimes there is no need to create a fire drill scenario. You could simply replicate a previous incident that you have faced. This would give other team members who were not involved in the original incident exposure to the same problems. It could also help you see more clearly what went wrong the first time, and what could be done differently if something similar should occur again.
If you choose to work with a custom scenario you need to define a ‘story’ that makes sense. Remember to add context to your story and think where you want your team to go during the fire drill. Try to think in advance of the different paths the on-call engineer can take to solve a problem and prepare responses for their questions. Your story needs to have a well-defined end; it could be a definition of done to indicate that the problem is mitigated, or a limited amount of time.
The different elements of a fire drill are:
Roles are the people involved in the event. You can define roles for every possible element in real life. For a fire drill, you need at least two roles: one for the ‘game master’ and one for the ‘on-call engineer’. The game master is the person in charge of running the fire drill, while the on-call engineer is the person who is going to triage the incident. The game master needs to fully understand the scenario in advance, and the on-call engineer should ideally have no prior knowledge about the scenario.
Additionally, you can add a variety of roles that will help you to simulate participants in a real incident, such as:
- Support game master
- Secondary on-call engineer
- Service provider support engineers
- A friend you can call for advice
It might take too much work and time to define a complex scenario. It is not in our interest to spend time creating scenarios. Instead we want to test the incident-response procedures, so whenever possible we try to recycle the pieces of a fire drill path. At Container Solutions, we have defined problems by level, and we combine these like building blocks to create different paths within the same scenario. This means you can reuse the infrastructure description or other items that required some work.
The levels of problems on which we build a fire drill scenario are:
- Service Provider
Scenarios built on these problems—built on either one level or several—will allow you to test different parts of an incident-response process. You can create a list of problems of each type and combine them, depending on the difficulty of the event you want to host.
For example, if you want to solve a simple problem just at the application level, you can create a scenario where the front-end application is not responding. But if you want to test how to react to unusual problems on the service-provider level, you can add things like, ‘a complete availability zone of the service provider is down’ or ‘the two on-call engineers are unreachable’ from the external problems level. The ‘unusual’ problems can help you to detect edge cases in your incident-response process.
For each piece on the scenario path you need a description of the problem. When building a scenario, a storyline to connect the blocks could be helpful. Try to be as clear and detailed as possible so other participants could run this fire drill as a game master in the future. If the scenario is clear enough you will be able to provide almost everything the on-call engineer will ask. It is important to define these details but also important that you’re prepared to improvise when you don’t have a predefined answer.
Add things like:
- High-level overview of the system
- Triggering alerts
- How systems are affected
- Whether there are multiple alerts
- The root cause
- Location of the on-call engineer or others involved
This information is aimed at providing the game master and other participants involved with the necessary context of the fire drill. But don’t share everything with the on-call engineer— sharing the root cause defeats the purpose of the entire exercise.
Try to provide the on-call engineer with clear descriptions. These don’t need to be a detailed collection, but they should reflect the status of the problem described.
These could be the primary inputs for the on-call engineer:
- Missed service-level objectives (SLOs)
- Error codes
- Duration of service to responses
- Broken services
- Logs messages
- Alerts received
- A customer request through the service desk
The initial pager call could include some of the symptoms defined for the scenario. You can also provide more information on this during the development of the fire drill or if the on-call engineer asks for them.
Resources and Preparation
What things you could expect the on-call engineer to interact with? This could include logs, monitoring system screenshots, documents.
The scenario should be as flexible as possible so the participant can move around and ask questions about multiple parts of the system. Resources could be obtained from past outages, or simple reference images or logs created by the author.
More important than having lots of logs, images, or precise responses is to have answers for most of the questions the participants might ask. That is why the game master should understand the scenario in order to provide high-level responses to participants. You need to anticipate questions like:
- What touched the problem area last? → version release, hardware upgrade
- What can I see in Grafana dashboards? → reference image or short descriptive answer
- Where are the logs? → log fragment
- Where is the documentation? → link to documentation
- Are there any other alerts going on? → No or yes (with details)
- What is the status of the pods? → Response with clues
- What do the logs of pod X say? → the logs reflect there are HTTP codes xxx
- Is curl
<load_balancer_ip>responding? → responding/not responding
The game master or author of the scenario needs to plan how the fire drill is going to develop. This means setting an expected timeline and a list of events that should happen.
The author could think of scenarios where external elements can also be present, like an angry customer messaging the team or things like unexpected and unrelated alerts brought up while the on-call engineer is working to solve a problem.
Time is important during an incident; that is why it’s OK to jump ahead in the timeline. The game master can announce things like ‘20 minutes has passed’ in order to get into the desired status of an incident. Timing to add external interactions should be defined by the scenario author.
The scenario described can go in any direction that allows the engineer to ask all necessary questions. The game master should always try to drive the situation through a known path or could also add someone to help the on-call engineer ,such as the ‘call a friend’ option. The game master might deny help to on-call engineers, if the idea is to test what happens in that case.
Here’s a sample scenario:
- Application → Pod A not working because Pod B can’t read a volume
- Platform → Normal status at this level
- Service Provider → Normal status at this level
- External → Customer is super annoyed and is spamming through communications channels.
- Definition of done → Pod B responds 200 to Pod A
In your chat platform, the fire drill development could look like this
Running a Fire Drill
The fire drill session should be driven by the same guidelines defined in your incident-response process. The fire drill should start with an alert from your real paging/alerting system for example.
The one difference between these fire drills and a real outage is that instead of using a terminal and interacting with real systems, there will be a channel called #firedrill in your communication platform, and/or via a video chat. In that channel, the on-call engineer will chat with the game master and request information. The idea is that the incident evolves as a discussion but also that the on-call engineer can pass commands and expect a concrete descriptive response for each one. The commands could be to retrieve information or to apply some changes.
For customer communication, you can use the channel #firedrill-client, where the game master or secondary game master could act as a customer.
You should also have a definition of done, or a ‘stop’ condition, to know when to finish the event.
We have described what a fire drill is and how it can help any SRE team to train engineers and test incident-response processes. Also, we described how fire drills are different from ‘game days’.
Keep in mind the different types of problems and try to create generic abstractions, so you can reuse each block you create multiple times. Also, when building a scenario, having an objective, clear roles, and definitions of done can help create your narrative.
To run a fire drill is to define how much time the event should take, a definition of done, how communication is going to be between the game master and on-call engineer, and how other roles interact with the experiment.
As incident response is a wide topic and could be different for every organisation, we haven’t delved deeply into that subject here. But now you have a nice reference to test any process and help you to find any gaps in it.
Running fire drills will help you to achieve two key objectives in an SRE team: to keep your engineers trained and let you improve the incident-response management of your company.
Now go start building and running your own fire drills.
Related Cloud Native Patterns
Use metrics as the primary data source to monitor application and system health.
Cloud native distributed systems require constant insight into the behavior of all running services in order to understand the system’s behavior and to predict potential problems or incidents.
Use runbooks to provide a set of instructions and documents, which help to identify quickly the root cause for incidents.