Mark Coleman is blogging live from Velocity 2014 Barcelona for Container Solutions.
Ryan’s talk “It’s 3AM, do you know why you got paged?” looks at strategies for dealing with the increasing amount of monitoring information coming from production systems. Ryan notes that he has proved that “he sleeps the deepest” the night after being on call and attributes this to ‘alert fatigue’.
Part of this alert fatigue stems from the fact that we have high volume, low quality alerts which mean that operations engineers are in a constant state of “being on edge”.
Ryan suggests that the way to mitigate this issues is to add ‘context’ to our alerts. There are some simple ways to do this, colour is one. Everybody is familiar with a traffic light, green means go, red means stop. Interestingly Ryan also notes that a traffic light provides information on two layers, the primary, colour, and the secondary, the position of the lights. A colour blind driver can still know when the light is green, because the green light is always at the bottom.
Apart from making alerts easier to understand with context such as colour or symbols, Ryan notes that “computers can augment”. Augmentation, Ryan says, is increasing the capability of a person to approach a complex problem situation.
Ryan introduced the idea of OODA loops (Observe, Orient, Decide, Act) and suggests that by including better contextual information in our alerts, we can help operations engineers to tighten this loop. An important axis of context here is historical. Some alerts are much easier to understand in their historical context.
The talk then moved more towards specifics looking at using nagios-herald to add contextual information to alerts before they are sent out. nagios-herald can run commands on the machine that is registering an alert, and can also query other data sources such as logstash, or graphite. Ryan showed a couple of examples of how context taken from various data sources could make seemingly illegible alerts into easy to understand alerts.
The most interesting question in this area, I think, is how far can we improve our monitoring before we need to start handing the Decide and Action steps of the OODA loop over to computers too. There are some projects looking at this problem area already.