Technical Product Management Course · by Stanislav Belyaev
EN RU

Incident Frequency

4 outgoing · 9 incoming · 13 total connections

Map Detail
Operational AMPLIFIED DISTRIBUTED

Incident Frequency

Incident Frequency measures how often production systems experience outages, degradations, or other events that impact users or require engineering intervention. It reflects overall system reliability and the effectiveness of preventive measures such as testing, monitoring, and change management. A rising incident trend signals systemic quality issues that need to be addressed at their root cause.

Rate of production failures. Each incident destroys 2–3 hours of productive time.

MONOREPO CONTEXT

Mixed. Monorepos can increase incident frequency if shared library changes aren't properly tested against all consumers. But atomic rollbacks and better observability across services can reduce MTTR.

DISTRIBUTED CONTEXT

AMPLIFIED: Incidents during one TZ's off-hours mean the on-call responder lacks context from the team that caused the issue. Handoff gaps between TZ shifts can delay incident resolution by hours.

Scale Impact
👤 Solo / Pair (1–3)
0.3
👥 Team (4–15)
0.5
🏢 Department (15–100)
0.7
🏛️ Organization (100+)
1

More deployments plus more services equals more incidents. Incident frequency rises 16% YoY at scale. Each incident destroys 2-3 hours of team productivity.

4
Influences
9
Influenced by

→ Influences

On-Call Burden

More incidents = more pages.

General industry trends
High CriticalDIST
Context Switching

Each incident = unplanned switch. 2–3 hrs each.

2–3 hrs destroyed
Context switching research, UC Irvine
Distributed: Incidents during off-hours create morning firefighting that destroys the entire first work block.
Developer Satisfaction

Firefighting culture → demoralization.

#1 departure reason
Organizational culture research
Technical Debt

Hotfixes under pressure → shortcuts.

Hotfix debt
Technical debt management research

← Influenced by

Test Flakiness

Eroded trust → ignored failures → bugs reach production.

84% of post-submit failures are false alarms
Google Testing Blog - Flaky Tests at Google
PR Size

Undetected defects reach production. Size warnings → 35% fewer defects.

Microsoft data
Microsoft, SmartBear/Cisco, PropelCode
Technical Debt

Debt-prone bugs: incomplete fixes introduce new defects. Defect ratio rises with accumulated debt.

arXiv: debt-prone bugs pattern; Stripe: 23-42% capacity lost to debt
Academic research on debt-prone bugs
High MediumMONO
Dependency Mgmt

Unpatched CVEs in outdated deps.

Avg breach: $4.2M
OWASP / IBM Cost of Data Breach
Monorepo: Centralized version control means security patches can be applied repo-wide in one commit.
Observability Quality

Proactive monitoring prevents escalation.

Reactive → proactive
New Relic DORA Case Study
Change Failure Rate (CFR)

More failures → more incidents.

Direct causal
GitLab DORA Metrics Documentation
Code Ownership Clarity

Owned code gets maintained. Unowned shared libs accumulate bugs. Clear ownership → faster incident routing.

Reduces 'orphaned' code
Aviator, web.codeowners.com, Harness
Shared Lib Blast Radius

A bug in widely-shared code can cause cascading failures across many services simultaneously.

Single point of failure risk
Google SRE Workbook, Etsy Engineering
AI Security Vuln Rate

AI-generated security vulnerabilities lead to production incidents.

Veracode: 45% AI code has OWASP Top 10 vulns; Georgetown CSET: 68-73% contain vulnerabilities
Veracode 2025 & Georgetown CSET
Metrics map by Stanislav Belyaev · Analysis powered by Anthropic Claude Opus 4.6 · All data validated by human experts