Technical Product Management Course · by Stanislav Belyaev
EN RU

Mean Time to Recovery (MTTR)

1 outgoing · 5 incoming · 6 total connections

Map Detail
Delivery & Pipeline IMPROVED IN MONOREPO AMPLIFIED DISTRIBUTED

Mean Time to Recovery (MTTR)

Mean Time to Recovery measures the average duration between detecting a production incident and fully restoring normal service operation. It reflects an organization's incident response maturity, including alerting effectiveness, runbook quality, and team coordination under pressure. Shorter MTTR directly reduces the cumulative user impact of failures and is a key indicator of operational resilience.

Time to restore service after a failure. Elite level: <1 hour.

MONOREPO CONTEXT

IMPROVED: Monorepos make rollbacks easier as the full change context is captured in a single commit. Combined with feature flags, recovery is faster because the exact scope of impact across all services is immediately visible and actionable.

DISTRIBUTED CONTEXT

AMPLIFIED: When the author of a change is asleep during a production incident, the on-call responder in another TZ lacks critical context. This knowledge gap at each boundary multiplies recovery time, often leaving domain experts unreachable for hours.

Scale Impact
👤 Solo / Pair (1–3)
0.4
👥 Team (4–15)
0.5
🏢 Department (15–100)
0.8
🏛️ Organization (100+)
1

Diagnosis time scales with system complexity. Meta invests heavily in observability because finding root cause across thousands of services requires specialized tooling.

1
Influences
5
Influenced by

→ Influences

On-Call Burden

Slow recovery = longer incidents.

More person-hours
Harness MTTR Blog

← Influenced by

CI/CD Pipeline Speed

Slow CI directly delays deployment of hotfixes and rollbacks. Fast pipelines enable sub-hour recovery; slow ones can extend incidents by hours.

DORA: elite <1hr MTTR requires fast CI
DORA Metrics - Failed Deployment Recovery Time
Feature Flags

Toggle off vs full pipeline re-run.

Seconds vs hours
LaunchDarkly Feature Flags Blog
Observability Quality

Fast detection + diagnosis.

80% fewer incidents
DORA State of DevOps 2021
Handoff Latency

Incident handoffs between TZ shifts take hours. Responding TZ lacks context from the TZ that caused the issue.

MTTR measured in shifts, not hours
Google SRE, PagerDuty incident management
Async Comm Quality

Well-written runbooks and incident context enable cross-TZ incident handoffs without waiting for the originating TZ.

Google SRE: clear communication critical for incident response; Lowe's: 82% MTTR improvement with better processes
Google SRE, incident management research
Metrics map by Stanislav Belyaev · Analysis powered by Anthropic Claude Opus 4.6 · All data validated by human experts