After having a priority 1 issue in production, we conduct a post-mortem analysis.

Some best practices (ground rules)

We try to see the issue as an holistic situation where we can learn more than expected.

From		To
Seeking Single Root Cause for an outage	➡	Addressing the multiple systemic contributing factors to an outage
Prevention (only)	➡	Breaking down TTR (time to resolve components)
Blameful post-mortems	➡	Blameless post-mortems
Surface Level post incident reviews	➡	Strategic Level understanding and improvement

Mean Time to Recover (MTTR) Metric

MTTR = MTTD (Detect) + MTTI (Isolate) + MTTF (Fix)