You can't improve what you don't measure. Tracking incident metrics over time tells you whether your incident management process is getting better, staying flat, or degrading.
Mean Time to Detect (MTTD)
The average time between an incident beginning and your team becoming aware of it. This metric reflects the effectiveness of your monitoring. A decreasing MTTD means your monitoring is catching problems faster. A high MTTD (incidents consistently detected by customers before your monitors) indicates gaps in your monitoring coverage. Investing in comprehensive uptime and performance monitoring is the most direct way to reduce MTTD.
Mean Time to Acknowledge (MTTA)
The average time between an alert firing and a human acknowledging it. This metric reflects the effectiveness of your escalation policies and on-call practices. A high MTTA suggests that alerts are being ignored, the on-call process is not working, or alerts are firing at such volume that the important ones get buried.
Mean Time to Resolve (MTTR)
The average time from incident detection to resolution. This is the headline metric for incident management effectiveness. MTTR is influenced by detection speed, diagnosis efficiency, fix implementation time, and verification that the fix works. Reducing MTTR requires improvements across all of these stages.
Be careful with how you calculate MTTR. Averaging across all severity levels obscures important distinctions. Track MTTR separately for each severity level. A SEV-4 that takes two days to resolve in the normal development cycle is very different from a SEV-1 that takes two days -- the first is acceptable, the second is a serious problem.
Incident frequency
Track the number of incidents per week or month, broken down by severity. This metric tells you whether your systems are becoming more or less reliable over time. An increasing trend in SEV-1 and SEV-2 incidents should trigger a strategic discussion about reliability investment, not just tactical responses to individual incidents.
Repeat incident rate
What percentage of your incidents have the same root cause as a previous incident? A high repeat rate indicates that your post-mortem process is failing -- either action items aren't being generated, aren't being completed, or aren't addressing the actual root cause. This is one of the most actionable incident metrics because the path to improvement is clear: complete your post-mortem action items.
Customer-reported vs. monitor-detected ratio
What percentage of incidents are first detected by your monitoring versus reported by customers? Your goal should be to detect the vast majority of incidents automatically before any customer notices. Track this ratio over time as a measure of monitoring effectiveness.
Building a feedback loop
Review these metrics monthly with your engineering team. Identify trends, celebrate improvements, and allocate resources to address deteriorating metrics. Connect incident metrics to your broader engineering goals -- if you're investing in reliability, these numbers should reflect it. If they don't, either the investment is misdirected or insufficient.
Incident management is ultimately a practice, not a one-time setup. The organisations that handle incidents most effectively are those that treat every incident as a learning opportunity, invest consistently in their detection and response capabilities, and measure their performance honestly. Combined with robust monitoring through tools like your understanding of downtime costs, a strong incident management process becomes the foundation of operational excellence.