After having a priority 1 issue in production, we conduct a post-mortem analysis.
Some best practices (ground rules)
We try to see the issue as an holistic situation where we can learn more than expected.
From | To | |
---|---|---|
Seeking Single Root Cause for an outage | ➡ | Addressing the multiple systemic contributing factors to an outage |
Prevention (only) | ➡ | Breaking down TTR (time to resolve components) |
Blameful post-mortems | ➡ | Blameless post-mortems |
Surface Level post incident reviews | ➡ | Strategic Level understanding and improvement |
Mean Time to Recover (MTTR) Metric
MTTR = MTTD (Detect) + MTTI (Isolate) + MTTF (Fix)