After having a priority 1 issue in production, we conduct a post-mortem analysis.
Some best practices (ground rules)
We try to see the issue as an holistic situation where we can learn more than expected.
|Seeking Single Root Cause for an outage||➡||Addressing the multiple systemic contributing factors to an outage|
|Prevention (only)||➡||Breaking down TTR (time to resolve components)|
|Blameful post-mortems||➡||Blameless post-mortems|
|Surface Level post incident reviews||➡||Strategic Level understanding and improvement|
Mean Time to Recover (MTTR) Metric
MTTR = MTTD (Detect) + MTTI (Isolate) + MTTF (Fix)