Mean Time to Recovery measures the average duration between detecting a production incident and fully restoring normal service operation. It reflects an organization's incident response maturity, including alerting effectiveness, runbook quality, and team coordination under pressure. Shorter MTTR directly reduces the cumulative user impact of failures and is a key indicator of operational resilience.
Time to restore service after a failure. Elite level: <1 hour.
IMPROVED: Monorepos make rollbacks easier as the full change context is captured in a single commit. Combined with feature flags, recovery is faster because the exact scope of impact across all services is immediately visible and actionable.
AMPLIFIED: When the author of a change is asleep during a production incident, the on-call responder in another TZ lacks critical context. This knowledge gap at each boundary multiplies recovery time, often leaving domain experts unreachable for hours.
Diagnosis time scales with system complexity. Meta invests heavily in observability because finding root cause across thousands of services requires specialized tooling.
Slow CI directly delays deployment of hotfixes and rollbacks. Fast pipelines enable sub-hour recovery; slow ones can extend incidents by hours.
Toggle off vs full pipeline re-run.
Fast detection + diagnosis.
Incident handoffs between TZ shifts take hours. Responding TZ lacks context from the TZ that caused the issue.
Well-written runbooks and incident context enable cross-TZ incident handoffs without waiting for the originating TZ.