DevOps, SRE, WTF Is Cloud Native

Dashbored: Stop Looking at the Same Tired Data

Dashboards are everywhere. Every service needs a dashboard, every team uses at least one dashboard. We have high-level dashboards, low-level dashboards, platform dashboards, infrastructure dashboards, it’s dashboards all the way down. Needless to say, as soon as an incident occurs we load up our favourite trusted dashboard and start looking at the metrics hoping we can gain some insight. Who else has looked around a conference hall like WTF is SRE and seen different people huddled around laptops pointing at a dashboard as they grapple with their latest incident? Who else has been one of those people? 

So where did the dashboard come from? Well, the word dashboard originates from horse drawn carriages where it represented a bit of wood used to protect the driver from mud or anything else for that matter. With the advent of the car dashboards gained extra functionality as instruments such as a speedometer were added to them, allowing the driver to see useful data easily whilst driving. This idea of presenting information in such a way was then moved forward again in the cockpit of fighter planes where every bit of useful information was needed by the pilot as part of their decision making process. 

By the 70s and 80s Decision Support Systems were trying to implement digital dashboards as a way to help decision making in businesses and in 1990 the Oxford English Dictionary recognised the dashboard as a “screen giving a graphical summary of several types of information, typically used to give an overview of a business organisation”. 

Within the tech industry during the 90s big changes were taking place—companies were moving away from from treating each server as a precious thing that they knew and loved and had given a bespoke name, often based on Greek mythology or characters from The Simpsons. This was because, partly driven by the rise of the internet, distributed systems were gaining in popularity.  By the late 1990s Service-Oriented Architecture existed, which is a forerunner of modern microservices architecture. As a result, monitoring moved from manually checking each server to providing a way to visualise KPIs for all servers in one place, and the way we did this was with.....the dashboard!

When I started my journey with monitoring and observability tools around 2007 / 2008 I loved the idea of dashboards. I worked in a central team that helped other teams implement monitoring tools and I would tell them they needed to get themselves a dashboard so they could understand their service. But as I started to get more involved with actual incidents, and try to use these dashboards in anger, I found that I was continually facing challenges. Whilst all that data looked great on a big screen I wasn't able to translate the data the dashboard was displaying into why a service was broken or what needed to be done to fix it. It turned out I wasn't alone in this, and  as a result I started seeing the same pattern happen over and over again, like some sort of groundhog day. It went something like this: 

  • Incident occurs, teams look at their dashboards
  • No-one understands what's wrong, despite investigating every spike they see on every dashboard
  • The investigation continues and eventually someone finds a piece of information, not on any dashboard, that gives insight into what happened
  • The incident is resolved
  • This piece of information gets added to the dashboard
  • Everyone feels reassured that we'll definitely catch similar issues next time

The problem with this is that services rarely break in exactly the same way twice and as a result that priceless nugget of information from the last incident is unlikely to be useful for the next one. The teams were just adding more and more metrics to their dashboards, often with little context of what the metric was or what it tells them. As a result they ended up with a *lot* of metrics and every time they looked at their dashboards they had no clue what was a strong signal (if there was one) and what was noise. A lot of metrics on a screen during an incident leads to a lot of wasted time investigating the wrong thing - 'Metric X had a spike around the same time, maybe that's related to our problem?'. I'm clearly not alone in experiencing this as Lili Cosic describes the exact same thing in her excellent blog on observability

It's worth noting that sometimes this pattern would change and people would create a *new* dashboard rather than adding metrics to the existing one. This was no better though as all it meant was that people didn't know which dashboard to look at. Ever tried to find the right dashboard for a particular service when you've got hundreds to choose from? I have and it's not fun, particularly when (as we all know) naming things is hard and you end up with something like `Service X dashboard` next to `Service X dashboard updated` followed by `Service X dashboard 2.0`.

Unfortunately in our incidents people were turning to the same data over and over again, displayed in a convenient fashion, whilst the services they supported were constantly evolving and breaking in new and varied ways. They were looking at data for previous problems expecting to find the solution for an entirely new and different problem. 

This led to teams suffering from the 'watermelon effect' where the dashboards showed everything as green, yet beneath the surface everything was very much red! I can't count the number of times I saw incidents where teams were stuck because according to their view of the world everything was fine but the customer complaints were still rolling in. 

With the advent of Application Performance Management (APM) this wasn't supposed to happen! We had all the service information, we were showing the real-time data that related to actual user experience, we had evolved from the old days of using infrastructure metrics like CPU and memory to try and approximate what the experience was. But alas, the dashboard culture had taken over and removed the need for getting more experience in debugging systems, building up in-depth knowledge of the way the different components interacted with each other. Complex systems and interactions were being simplified into 'a single pane of glass' and much of the context and detail was being lost along the way. No wonder people were struggling to work out what on earth was going on!

The way the incidents actually got solved was by involving experts in these services who had their own mental models of how things worked, who knew how to interrogate the service for more information based on their models. They were able to systematically work through the signals and debug the issue. They never even looked at a dashboard! They had built a kind of muscle memory where there seemed this instinctive ability to build a hypothesis on what had happened, find the data needed to prove this and consequently take corrective action. It quickly became apparent to me that the level of expertise and experience that a person had with the service, the quality of their mental models, and their willingness and ability to explore the data they were seeing was more important to resolving issues than any dashboard could ever be. 

One of the speakers at WTF is SRE is Charity Majors, someone whose views and expertise have not only helped define mine and many others' thoughts on observability, but has been largely responsible for observability being a thing in the tech world at all! When I first heard her say things like metrics dashboards should die I thought 'I'm not a fan of them but she can't mean this can she?' but by the time I came across tweets like this one I was a convert:

Dashboards absolutely have their place in the tech industry, they're great for telling a specific story to a specific audience, and I see a lot of value in reviewing an SLO dashboard each week for example but they're not for debugging. We've reached a point where the ability to neatly summarise everything we need in one place is no longer possible. The very reason for using dashboards in the first place is no longer achievable because there's simply too much going on in a modern service to ever be able to fit it all into one screen. Not only that but with every deployment new features are being released, new potential causes of failure are added, and as an industry we're deploying more frequently than ever! 

Let's instead focus on the state of our service now, using the data relevant for this incident. Let's be willing to explore our data, and not wed ourselves to a few metrics we once ordained as the most important. If we find a new metric that was useful this time, save the query by all means but don't add it to a dashboard! Let's face it, it may never be useful to us again. All in all let's stop looking at the same tired data every time we get paged. 

New call-to-action

Leave your Comment