Benjamin Franklin once said “The Only Two Certainties In Life Are Death And Taxes”. But if he was an engineer he’d probably add another to that list, outages. Engineers at Facebook would doubtless agree after the recent outage, seemingly caused by running a command which unintentionally took down all the connections in their backbone network and shutdown not only Facebook, but also Whatsapp and Instagram.
During an outage you’ll see engineers frantically trying to figure out what is going on, stress levels are high, with pressure from all sides as the business is losing money. And if the company tackles this badly, it could not only stop them solving the problem by stressing the engineers further, but cause reputational damage, the risk of engineers and others quitting... and CEO’s losing billions in investment.
When the shit is hitting the proverbial fan, one of the first things you probably want as an engineer is a reassuring hug, whether that is physical or virtual. This year's Nobel Prize was on the topic of a good hug, as hugs have been proven to be good for your health and improve your ability to bounce back from a crisis. No wonder then that when an outage happens Twitter is inundated with hashtags of #HugOps. Engineers understand that the team involved in fixing the issue is dealing with high levels of stress and a show of support, even by just tweeting #HugOps, can make a real difference.
How would you go about making HugOps a real practise inside your company, and not just something you tweet?
This easy to talk about but hard to develop team skill looks to make sure people feel safe to raise issues and secure in both their job and themselves that they won’t be punished for doing so. It relies on effort from the team to hold each person accountable when they aren’t allowing others to feel safe.
If you go into finding out what’s wrong with the view of finding out who was responsible, then all that anxiety your engineers were feeling is justified. Structuring your postmortem to look for root causes, not root people, will allow you to focus on improving the system without the stress. If the outage was caused by “human error” look deeper - why was the error possible? Poor UI, or a lack of checks and balances in a system or process likely allowed a mistake to make it to production. Ideally you’d have a cool, calm chairperson who has taken their own emotion out of the situation and works to help others reflect, and deal with those who may have come into a postmortem with still heightened emotion.
Mike Tyson said it best, “Everybody has a plan until they get punched in the face”. To paraphrase, Iron Mike is talking about how the plans we make when we’re calm and good don’t translate when everything goes to hell. Chaos Engineering is designed to predict how the world might change; using both science and art. The art comes from Futures Thinking processes, looking at the world of the possible. The science is asking your system ‘what if’ based on these futures. “What happens to my data centre if the available energy drops to 10%” or “What happens to my company if us-east-1 goes down.”
So how would you know a company is doing HugOps well? We think low anxiety, high accountability, with great team empathy and trying to predict the future are all great indicators. Conversely, being yelled at by the boss before 20 engineers suddenly become unemployed is not.