Technical Product Management Course · by Stanislav Belyaev

EN RU

Incident Frequency

4 outgoing · 9 incoming · 13 total connections

Operational AMPLIFIED DISTRIBUTED

Incident Frequency

Incident Frequency measures how often production systems experience outages, degradations, or other events that impact users or require engineering intervention. It reflects overall system reliability and the effectiveness of preventive measures such as testing, monitoring, and change management. A rising incident trend signals systemic quality issues that need to be addressed at their root cause.

Rate of production failures. Each incident destroys 2–3 hours of productive time.

MONOREPO CONTEXT

Mixed. Monorepos can increase incident frequency if shared library changes aren't properly tested against all consumers. But atomic rollbacks and better observability across services can reduce MTTR.

DISTRIBUTED CONTEXT

AMPLIFIED: Incidents during one TZ's off-hours mean the on-call responder lacks context from the team that caused the issue. Handoff gaps between TZ shifts can delay incident resolution by hours.

Scale Impact

👤 Solo / Pair (1–3)

0.3

👥 Team (4–15)

0.5

🏢 Department (15–100)

0.7

🏛️ Organization (100+)

1

More deployments plus more services equals more incidents. Incident frequency rises 16% YoY at scale. Each incident destroys 2-3 hours of team productivity.

4

Influences

9

Influenced by

→ Influences

▲ On-Call Burden

More incidents = more pages.

General industry trends

High → CriticalDIST

▲ Context Switching

Each incident = unplanned switch. 2–3 hrs each.

2–3 hrs destroyed

Context switching research, UC Irvine

Distributed: Incidents during off-hours create morning firefighting that destroys the entire first work block.

▼ Developer Satisfaction

Firefighting culture → demoralization.

#1 departure reason

Organizational culture research

▲ Technical Debt

Hotfixes under pressure → shortcuts.

Technical debt management research

← Influenced by

▲ Test Flakiness

Eroded trust → ignored failures → bugs reach production.

84% of post-submit failures are false alarms

Google Testing Blog - Flaky Tests at Google

Undetected defects reach production. Size warnings → 35% fewer defects.

Microsoft, SmartBear/Cisco, PropelCode

▲ Technical Debt

Debt-prone bugs: incomplete fixes introduce new defects. Defect ratio rises with accumulated debt.

arXiv: debt-prone bugs pattern; Stripe: 23-42% capacity lost to debt

Academic research on debt-prone bugs

High → MediumMONO

▲ Dependency Mgmt

Unpatched CVEs in outdated deps.

Avg breach: $4.2M

OWASP / IBM Cost of Data Breach

Monorepo: Centralized version control means security patches can be applied repo-wide in one commit.

▼ Observability Quality

Proactive monitoring prevents escalation.

Reactive → proactive

New Relic DORA Case Study

▲ Change Failure Rate (CFR)

More failures → more incidents.

GitLab DORA Metrics Documentation

▼ Code Ownership Clarity

Owned code gets maintained. Unowned shared libs accumulate bugs. Clear ownership → faster incident routing.

Reduces 'orphaned' code

Aviator, web.codeowners.com, Harness

▲ Shared Lib Blast Radius

A bug in widely-shared code can cause cascading failures across many services simultaneously.

Single point of failure risk

Google SRE Workbook, Etsy Engineering

▲ AI Security Vuln Rate

AI-generated security vulnerabilities lead to production incidents.

Veracode: 45% AI code has OWASP Top 10 vulns; Georgetown CSET: 68-73% contain vulnerabilities

Veracode 2025 & Georgetown CSET

Metrics map by Stanislav Belyaev · Analysis powered by Anthropic Claude Opus 4.6 · All data validated by human experts