Mean Time to Recovery (MTTR)

Delivery & Pipeline IMPROVED IN MONOREPO AMPLIFIED DISTRIBUTED

Mean Time to Recovery (MTTR)

Mean Time to Recovery measures the average duration between detecting a production incident and fully restoring normal service operation. It reflects an organization's incident response maturity, including alerting effectiveness, runbook quality, and team coordination under pressure. Shorter MTTR directly reduces the cumulative user impact of failures and is a key indicator of operational resilience.

Time to restore service after a failure. Elite level: <1 hour.

MONOREPO CONTEXT

IMPROVED: Monorepos make rollbacks easier as the full change context is captured in a single commit. Combined with feature flags, recovery is faster because the exact scope of impact across all services is immediately visible and actionable.

DISTRIBUTED CONTEXT

AMPLIFIED: When the author of a change is asleep during a production incident, the on-call responder in another TZ lacks critical context. This knowledge gap at each boundary multiplies recovery time, often leaving domain experts unreachable for hours.

Scale Impact

👤 Solo / Pair (1–3)

0.4

👥 Team (4–15)

0.5

🏢 Department (15–100)

0.8

🏛️ Organization (100+)

Diagnosis time scales with system complexity. Meta invests heavily in observability because finding root cause across thousands of services requires specialized tooling.