What SRE Teams Can Learn from Business Continuity and Vice Versa

Keeping our software up and running isn’t so different from keeping our organisations functional. We can learn from each other and use the same techniques.

I started writing code in 1999, just in time for the millennium bug. For a long time, I wrote code, checked it in, and that was pretty much it. Then, as we started to adopt distributed architectures, move to the cloud, and release code more often, I needed to broaden my skills to include deployment pipelines, infrastructure, observability, and more.

But then I took a new role at the Financial Times, as the Tech Director in charge of Operations & Reliability. What I didn’t know until after I started was that I now also needed to know and care about business continuity.

Essentially, business continuity is about making sure your business continues to function when something goes wrong, whether that is an earthquake, a bomb scare or a pandemic.

It means making plans for what happens if you lose access to your office, your electricity, your network, and your critical software systems, but also means understanding what you would do if half your staff were off sick at the same time.

It’s about having a backup plan, and knowing what is critical and what can be left not quite working for a while. Which sounds a lot like reliability engineering!

And it turns out that some of the techniques we use to make sure our software stays functional also work for the wider organisation. For me, the learning went both ways: companies have been planning for business continuity for a long time, and have experience of what does and doesn’t work.

Let me start by talking you through some of the FT’s business continuity considerations.

Can we print the newspaper?

Printing the newspaper is not the only critical functionality for the FT, but it combines a tight time schedule with a pretty big impact if we get it wrong, so it is a major focus. The FT has never missed a print edition, which was terrifying as the new owner of the business continuity plan.

For the Financial Times, there is a critical period in the late afternoon/early evening when the print newspaper is being put together. Newspaper printing and distribution is a complicated business where physical copies go through a fanout process that often ends up with someone in a van driving bundles round. There are fine margins of error and if you miss the print deadline, lots of people won’t get their paper.

During this critical afternoon period, we have maybe 30 minutes to respond to any major issue and “fail over” to the backup plans.

Until fairly recently, the FT’s plan for loss of the office, power, network etc was a disaster recovery site 20 minutes walk from our office in London. We held a permanent suite there for a small group, the people essential to publishing the newspaper.

We also had a larger suite that could be set up within a day with space for around 10% of staff. Each department would pre-nominate the ‘critical’ people they would need in order to provide “good enough” business functionality if we couldn’t access our office for several days.

We had similar setups in place for other offices, notably our office in Manila.

It may not sound too likely that you would lose your ability to work in your office, but in just four years both a volcano and an earthquake affected our Manila office, we had a bomb scare in London where we were all evacuated for hours, and we were only a few hundred metres outside the cordon for the Borough Market terror attack. Had we been inside that cordon, we would have had up to a week of restricted access to our office. So, these things happen pretty regularly.

And then of course, the big one: the coronavirus pandemic.

Covid-19: big challenges

Early in 2020 our first office shut, in Hong Kong. If, like me, you were following the news closely, it seemed very likely that it was coming our way, in all our locations.

It was pretty clear our backup site wasn’t the solution here: we needed a plan for responding to a lockdown, which meant for people to be working from home, safely isolated.

Luckily the FT in London, our biggest office, had finished an office move in mid 2019, and most people now had laptops and VoIP softphones. We had made many things available over the public internet rather than via our office network. Lots of people worked one day a week from home. All these changes helped our move to get everyone working from home.

Our biggest concern was about the impact of minor snags hitting everyone at once. A small percentage of a big number can still have a big impact. We thought our service desk would be overloaded, and we might have teams where everyone was struggling to be able to work effectively, with a clear risk if that meant the company couldn’t carry out some critical task.

To mitigate that, in mid February we started asking one department at a time to work from home for a day.

Doing this as a drill meant that if people really were stuck, they could head in to the office and get sorted out. And it forced departments to think carefully about how to handle their ‘in person’ requirements — for example, signing of cheques.

We also started anticipating lockdown at very short notice, by asking people to make sure they took their laptops home with them overnight, and reminding people about this regularly.

By the time we made the call to work from home in London, a week ahead of the government lockdown, every department had gone through the drill and the process went pretty smoothly.

There are many things that business continuity shares with operations and reliability, and one of those things is that you may have to run in degraded mode for a while.

They are also similar in that reality may not quite match what you planned for. And in both, you handle incidents under stress, and to do that, effective communication and building a shared understanding is crucial.

I’m going to cover some of the things I see as particularly valuable to bear in mind, whether you are looking at the reliability of your software or of your organisation.

Understand what is critical functionality

Having a shared understanding of what is critical is really important, because you have to make decisions quite quickly in a crisis. You need to be acting on information you already understand.

A few years ago, we tried to categorise all the business functionality flows in the company that relied on software somewhere — which is of course basically all of them. We started by identifying those flows that are brand critical, i.e. if we couldn’t do this, it would impact people’s perception of our brand. Publishing the newspaper and keeping our paywall secure fit here, as well as keeping the website available.

Then we looked at business critical stuff. This is less likely to be a big story, but would stop us from doing our jobs. Losing customer care tools or finance systems are in this area.

We linked each of these to a service tier. Brand critical is platinum level, and that comes with a high level of reliability requirements, for example we expect them to run in multiple cloud regions, rather than just in multiple availability zones. We also expect these to be fixed out of hours, with development teams available on call. Business critical is gold level, and comes with slightly lower reliability requirements.

The FT also has a bronze level tier, where many systems live. If these break, they are not fixed outside of normal working hours. They are less likely to be in multiple regions and while they are likely to be in multiple availability zones, this is as much for zero downtime releases as anything else.

These kinds of distinctions really help you focus on the most important stuff. It’s common to have multiple issues happening at once, and we would look at things based on the service tier. This is an exact parallel to the way that business continuity focuses on the brand critical stuff first — and actually, that was a good guide to us when trying to assess where things sat: does this software get used by people who are set up with access to the disaster recovery site?

There is another aspect of understanding what’s critical that isn’t easily captured when talking about e.g. 5 9s of availability, which is the key times where the system is needed: lots of processes are similar to newspaper production in that they are critical only at specific times.

A common example for many companies is the payroll run. It needs to happen at the end of the month and if the finance systems are down at that point, the impact is big. No one wants to miss paying their staff.

Other examples at the FT include publishing to the website: new content is published regularly but an outage on a bank holiday has much less impact than on budget day, and you may approach the investigation and fix differently.

More tricky to capture is when the editor is about to publish an exclusive story and you have an issue: you can only know that it’s a critical time by building good relationships with your stakeholders.

Work out what is acceptable in the short term

One thing business continuity planning teaches you is to work out what you can accept when you are in ‘crisis mode’.

If we had to evacuate our office building because we lost our power and were working off backup generators, we wouldn’t expect a standard day’s work. Similarly, when we sent people to work from home, we were expecting some people to be in less than optimal conditions. Maybe they didn’t have a quiet place to work, a desk, a monitor, or a comfortable chair.

Think back to early 2020. You probably thought — as I did — that we would be out of the office for a month or two. But at some point, we realised this was going to be the way we were working for a longish time. So, the move to WFH had two phases. Get everyone out of the office, and then get everyone set up properly.

When dealing with software incidents, I often find that developers are focused on finding out what has gone wrong and fixing it. But that ignores something important, which is mitigation.

Even if you don’t yet understand what is happening, there are things you can try that might help. That can be a failover to a region that doesn’t have the weird error spike, or adding more instances to see if this is caused by load.

It can also be a decision to fall back to a reduced set of functionality: one option with a paywall is to turn it off if there are problems, and give anyone access, even non-subscribers. This is a lot easier to do during an incident if you have planned for it in advance. For the FT, “failing open” was a resilience choice. Netflix similarly architect for graceful degradation, for example by using non-personalised or cached results.

I also recently saw a talk from Vitor Pellegrino and Anderson Parra of SeatGeek about high-demand ticket sales, and they were very clear that SeatGeek has a different mode of operation under extreme load, where you open a waiting room and have people queue for access.

This isn’t something you’d want as your standard mode, but it’s a great approach for really spiky demand.

Prepare

Graceful degradation is often a consequence of scars from a previous incident. For example, the data for the FT’s operational runbooks—the things that explain how to troubleshoot and fix systems—live in a graph database that is not multi-region. To ensure a higher level of reliability, the data is regularly extracted and put into s3 buckets in multiple regions.

However, when we lost our single sign-on system, we discovered we couldn’t access the runbook for it! This led us to add an additional fairly basic level of backup, which was zipping those files and sticking them in Google Drive.

This is where chaos experiments really help, ie you deliberately simulate a failure that you think your system can handle because of the resilience you’ve built in. Even just the process of planning a chaos experiment can give value as you think through what you expect will happen.

For example, about 5 years ago the FT was still operating in data centres and we wanted to test that we could successfully run out of just one.

My team was writing a new system, which didn’t yet need high levels of resilience, and we were not running in both data centres. In fact, we were running in the data centre that was going to be cut off for the test. We knew we would be down for the duration of the test and that was fine. But when we went through the flows on a whiteboard, ahead of the test, we realised there was code in the active data centre that would call our system—and that call would timeout.

Unfortunately for us, this call was from a critical system, the main content management system used to publish articles. Even worse, this timeout was set to 10 minutes and hidden deep in the configuration of a message queue, meaning that any journalist publishing a story during this test would be blocked from doing anything else for quite a long time.

We would never have considered this without knowing that this test was coming up, but thinking about it meant we had the chance to implement a workaround in time.

Similarly with business continuity, preparation pays off. The move to WFH for the FT benefited from several levels of up-front preparation.

Firstly, our IT department had made those changes to more flexible working: laptops, softphones, web access to software. That was a strategic choice and it paid off. We would have had a much more difficult switch if the pandemic had hit just a year earlier.

But secondly, two of our biggest departments had previously run the drill for working from home, meaning we had a template for doing this and could also focus on those departments who had never run the drill. Despite turnover and office changes, we still felt secure that these departments would have enough people already set up for successful WFH.

Try things out

My own department, Product & Technology, was the second to run a work from home test, and largely copied the approach of the other department; announce the test up front, set a day, and ask everyone to make sure they took their work laptop home.

However, unlike the first department, we didn’t check in with everyone upfront; we just shared what was going to happen and set up a Slack channel for any questions or concerns. This really reduced the effort, since all we had to do was agree on a date and let external stakeholders know.

On the day, a few people had issues, but most people were ok and used the channel to share pictures of where they ended up working from—including a bar, a train, many gardens.

As expected, some people couldn’t work from home: several were mid house move and had no wifi; others had nowhere to work that was distraction free. We told them that this was absolutely fine, it was what we expected—but that did mean making sure they didn’t get hassle from their managers. A few people travelled into the office because of commitments with external people or because they knew they couldn’t work from home, which again was fine by us.

Keep things simple

There are many scenarios where you might end up at home without your work laptop. You could be away from your desk when a fire alarm goes off and have to leave without it. Or you could get told overnight that you can’t come in to work the next day.

A ‘full’ test would have sprung this on people with no notice, but that’s a lot more disruptive, and you still get a lot of value from giving people advance notice.

In fact, we did ask people to put their work laptop to one side for a couple of hours in the middle of the day and use whatever they had available. This led to some people spending a couple of hours setting up their development environment on their home laptop. Others chose to work on other tasks.

A few people told me they didn’t have a home laptop and asked if the FT would buy one if this was our business continuity plan. My answer was that actually, our business continuity plan is that enough people can work from home rather than everyone, but that if the issue went on for days we’d work out how to send work laptops home or find some other solution.

Expect your predictions to be wrong

We don’t have perfect information: your predictions will very likely be wrong.

In March 2020, I thought one of our biggest challenges of the pandemic was going to be the impact of sickness on teams. My head of operations and I spend a lot of time working out the absolute minimum number of people we could run a first line support team with, and where we could bring in extra incident managers etc. But we were very lucky, possibly because the FT sent people into lockdown early. We didn’t have lots of people off sick at once.

On the flipside, generally when you have multiple offices, part of your business continuity plans rely on passing work off to an unaffected office. There were times in 2020 where there were no unaffected offices, with lockdowns meaning access was severely restricted in all locations. We hadn’t predicted that, and had to find other solutions for things that required in-person activity, or access to specialist equipment.

Communicate

Whether you are thinking about business continuity or resilience, communication is important. Part of that is choosing the right communication channel.

Around the time I started being part of our business continuity group, we set up a WhatsApp group as the hub for working out what to do. This recognised that when things go wrong, people are far more likely to have a phone with them than a laptop, and even if they do have the laptop no-one wants to have to balance it on a bin while at the evacuation point. WhatsApp is on most people’s phones, where Slack may not be.

The people coordinating our response, in that WhatsApp group, would send information to their departments in whatever way worked. Different departments like different communication tools!

For operational incidents, Slack was the primary channel. It’s where the tech team spends most of their time and we were using a slack bot to manage incidents (based on Monzo’s open sourced Response bot).

But you do also need a backup plan for when you lose your main communication channel. The first time Slack goes down, you realise you need to plan for being able to run an incident without it. In the FT’s case, that was via Google Chat (or whichever equivalent Google had in place at the time).

Once you are all in the right channel(s), you need to get talking.

You need to communicate repeatedly, and you should still expect someone to miss the communication.

For example, we still had multiple people turn up to the office on the work from home test days. They hadn’t seen—or maybe hadn’t registered—any of the communication!

Similarly, even though we have a system for emergency information via text, some people still turned up to work the day after we locked down, because they don’t read email or Slack out of hours and hadn’t got the text.

The same sort of thing happens when you run a chaos experiment. Maybe you are taking down an API to test the website’s ability to degrade gracefully. We did that once and a developer on the API team restarted it within minutes because they saw the alerts but they’d missed the comms about the experiment!

Almost every problem is different

Business continuity and reliability engineering are both about responding to unexpected events, and almost every problem is different.

Even when you have a similar cause—for example, power is out in an office—you will likely have to respond differently. Losing power at 5 in the morning might mean you prioritise stopping people from coming to work. Losing power at 3 in the afternoon might mean working out how to get people out of the lifts!

That doesn’t mean you can’t prepare yourselves. Set up the communication channels, write templates and runbooks, build relationships, try things out, and make sure the right people are involved. And then enjoy the fun!