After having a priority 1 issue in production, we conduct a post-mortem analysis.
Some best practices (ground rules)
We try to see the issue as an holistic situation where we can learn more than expected.
| From | To | |
|---|---|---|
| Seeking Single Root Cause for an outage | ➡ | Addressing the multiple systemic contributing factors to an outage |
| Prevention (only) | ➡ | Breaking down TTR (time to resolve components) |
| Blameful post-mortems | ➡ | Blameless post-mortems |
| Surface Level post incident reviews | ➡ | Strategic Level understanding and improvement |
Mean Time to Recover (MTTR) Metric
MTTR = MTTD (Detect) + MTTI (Isolate) + MTTF (Fix)