When something unexpected happens in a system, you probably want to know about it - and if it is causing a change in your buffer, you definitely want to do something about it.
Alerts give you the context you need to act when unexpected things occur and sufficient lead time to be effective when you act.
A good alert shouldn’t tell you something that you already know. If an application is down, you don’t need to be constantly reminded of it. You just need sufficient context to bring it back up.
Contrasting alerting to monitoring, we could say that while monitoring gives us data, alerting gives us information.
Alerts can be broadly categorised into 2 categories.
The first are alerts that are surfaced when something is anomalous. These alerts are created when a behaviour or change happens that deviates from what is normal or expected in your system.
As an example, you might have an anomaly detection query in place, that looks for servers or applications that are performing outside of their normal “safe” bounds (either by responding to requests slowly, or using excessive amounts of computing resources). When this query finds an anomaly (as defined by you), it can surface an alert with the sufficient context of where the anomaly is occurring, and what you might need to do to resolve it.
The second type of alert category is alerts that are surfaced when someone needs to do something. These alerts lend themselves well to change detection and change management processes, where systemic changes require an operator to do something.
Alerts can serve many purposes - but at the core of their utility, is their ability to enable your organisation to respond.
In instances when something goes wrong (your buffer or margin of safety has run out due to an incident), an alert can give you the impetus to react to failure.
Further than reactive responses, alerts can let you know when your buffer or safety margin is close to running out. This allows you to proactively respond to incidents to prevent failure.
Finally, alerts can serve as informational tools that give the correct operators context on occurrences that they need to know about.
Perhaps there is a critical event that has impacted the underlying infrastructure your systems run on (a region or zone outage). Alerting in this scenario can give operators enough context so that they have action failover or recovery scenarios.
Another use case for alerting would be for follow-ups. In scenarios where a change management process is protracted and requires follow-up after a period of time has elapsed (or some other piece of the puzzle has fallen into place), alerts can let the correct people know, at the correct time - so that they can do what they need to do.
To ensure that your alerts are actionable and high context, it is useful to alert at the component level. By doing this, you can give an operator all the tools they need to do something about the alert.
As an example, envision a scenario where a database component is creating a large amount of errors related to write failures. It might also have failed several health checks in the last hour, and was recently changed to synchronise with a replica.
An good component-level alert might be:
Database-01 error rate is increasing, there were 8 health check failures in the last 60 minutes, and the last change was <commit-id>.
An alert like the above, sent to a team that manages and supports the database, would provide context related to:
Providing them with the tools to do something about the alert.
Creating good alerts is a complex and challenging task. There are a number of things that can go wrong when designing, and using alerts.
Alert fatigue can occur when there are too many alerts, causing operators to grow tired and desensitzed to the severity and urgency of the alerts they are receiving. This can be a symptom of the bulk of alerts being false alarms, causing them to be ignored, or get a cursory glance at best. Ultimately this leads to delayed responses, or cascading failure due to missed alerts.
Alerts become subject to alert fatigue when they become a normal, repetitve part of an operators life.
The opposite issue, in contrast, is alerting too little. This can happen as a knee-jerk response to alert fatigue, where all alerts are turned off. In this scenario, key incidents are overlooked, and similar failure modes to alert fatigue are played out.
Lastly, alerts can delivered with low context. Low context alerts are less-actionable and can lead to wasted time on context-gathering, with the knock-on effect of delayed response times.
Given the above issues that might occur with alerting, it is important to have a strong criteria for creating and responding to alerts.
In general, alerts should be:
1. Something actionable
a) When a alert is received, the recipient should be able to do something about it.
2. Something impactful
a) An alert should be related to an incident that is having material impact on something important.
By this criteria, alerts are actually one of two things:
1. Incidents
a) Something I need to do now
b) Something that I need to follow up on to ensure that I am moving forward with my work or roadmap
2. Changes
a) Something that changed that I need to do something about.
When something changes (or a failure occurs), this should trigger the creation of an incident, and an alert. Once an incident has been creating in your incident tracking system, the alert should be silenced (whilst still tracking it’s occurrence). Once the work is completed to resolve the incident, the alert should be restored.
With Prometheus, it is possible to manage the lifecycle of alerts using AlertManager. You can split and group alerts by context, and then silence them if needed.
Devopsdays 2018 - Rachael Byrne Don’t be a bystander, be an Incident Commander!
Alerting Principles - PagerDuty Incident Response Documentation