Technical Product Management Course · by Stanislav Belyaev
EN RU

Test Flakiness

8 outgoing · 2 incoming · 10 total connections

Map Detail
Testing & Quality AMPLIFIED IN MONOREPO

Test Flakiness

Test Flakiness measures the percentage of test executions that produce non-deterministic results — passing on one run and failing on another without any code change. Flaky tests erode developer trust in the test suite, cause unnecessary investigation, and can mask real regressions. Tracking and reducing flakiness is essential for maintaining a reliable and actionable CI signal.

Rate of non-deterministic test results. Google: 16% of tests are prone to flakiness.

MONOREPO CONTEXT

AMPLIFIED: Broader dependency graphs mean more tests run per change, multiplying flake exposure. A change touching a shared lib triggers tests across all consumers — if any are flaky, the whole build fails. Impact on merge queues is especially devastating.

DISTRIBUTED CONTEXT

Flakiness impact is amplified indirectly: recovery from flake-induced merge queue failure adds 12-24h instead of minutes, because the developer is asleep when the failure occurs.

Scale Impact
👤 Solo / Pair (1–3)
0.2
👥 Team (4–15)
0.4
🏢 Department (15–100)
0.8
🏛️ Organization (100+)
1

Flaky test failures scale exponentially: more builds × more flaky tests × higher execution frequency. Atlassian processes 350M+ test executions/day to manage this.

8
Influences
2
Influenced by

→ Influences

High CriticalMONO
PR Size

Devs batch changes to minimize flaky gate exposure.

Google: 2-16% compute resources on flaky reruns
Google Research - Continuous Integration Testing
Monorepo: In monorepos, flaky tests in ANY consumer of a shared lib block the PR — massively amplifying batching incentive.
High CriticalMONO
CI/CD Pipeline Speed

Reruns consume pipeline capacity, multiplying CI time 2–3×.

Slack: 20% main branch stability
Google + industrial case study (Leinen et al. 2023)
Monorepo: Broader dependency graphs mean more tests run per change, multiplying flake exposure exponentially.
Merge Queue Wait

One flaky failure cascading-resets all queued PRs.

10%+ flake rate makes queues unusable
GitHub Merge Queue Documentation & Trunk.io
Monorepo: Expedia: 20+ PRs competing simultaneously — flaky tests in any affected project reset the entire queue.
High CriticalMONO
Change Lead Time

Reruns + larger PRs + queue delays compound across pipeline.

Affects 3+ delivery stages
Katalon + Gradle research
Developer Satisfaction

Phantom failures create frustration and erode CI trust.

Google: 16% flaky, 84% false alarms; Atlassian: 21% master failures from flakes
Mozilla Foundation - Understanding Flaky Tests Research
Incident Frequency

Eroded trust → ignored failures → bugs reach production.

84% of post-submit failures are false alarms
Google Testing Blog - Flaky Tests at Google
Code Coverage

Teams disable flaky tests, creating persistent coverage gaps.

Common in large codebases
Gradle + Atlassian research
PRs Completed per Week

Each flaky failure requires re-run. At 16% flakiness (Google), most PRs hit at least one flake.

Google: 84% pass→fail are flaky; reruns add pipeline cost
Google flaky test analysis
Monorepo: Broader dependency graphs in monorepos mean more tests run per PR = more flake exposure. Exponentially impacts throughput.
Distributed: Flaky failures overnight create 12-24h delays instead of immediate retries. Amplifies impact dramatically.

← Influenced by

Medium HighMONO
Test Suite Exec Time

Large tests: 14% flaky vs small: 0.5%.

Google data
Google Testing Blog (April 2017)
Monorepo: Monorepo affected-project detection can pull in large integration tests from other teams, amplifying flake exposure.
Affected-Project Detection

Fewer tests run = fewer opportunities for flaky failures per PR.

Proportional to test reduction
Logical inference + test execution data
Metrics map by Stanislav Belyaev · Analysis powered by Anthropic Claude Opus 4.6 · All data validated by human experts