Technical Product Management Course · by Stanislav Belyaev

EN RU

Test Flakiness

8 outgoing · 2 incoming · 10 total connections

Testing & Quality AMPLIFIED IN MONOREPO

Test Flakiness

Test Flakiness measures the percentage of test executions that produce non-deterministic results — passing on one run and failing on another without any code change. Flaky tests erode developer trust in the test suite, cause unnecessary investigation, and can mask real regressions. Tracking and reducing flakiness is essential for maintaining a reliable and actionable CI signal.

Rate of non-deterministic test results. Google: 16% of tests are prone to flakiness.

MONOREPO CONTEXT

AMPLIFIED: Broader dependency graphs mean more tests run per change, multiplying flake exposure. A change touching a shared lib triggers tests across all consumers — if any are flaky, the whole build fails. Impact on merge queues is especially devastating.

DISTRIBUTED CONTEXT

Flakiness impact is amplified indirectly: recovery from flake-induced merge queue failure adds 12-24h instead of minutes, because the developer is asleep when the failure occurs.

Scale Impact

👤 Solo / Pair (1–3)

0.2

👥 Team (4–15)

0.4

🏢 Department (15–100)

0.8

🏛️ Organization (100+)

1

Flaky test failures scale exponentially: more builds × more flaky tests × higher execution frequency. Atlassian processes 350M+ test executions/day to manage this.

8

Influences

2

Influenced by

→ Influences

High → CriticalMONO

Devs batch changes to minimize flaky gate exposure.

Google: 2-16% compute resources on flaky reruns

Google Research - Continuous Integration Testing

Monorepo: In monorepos, flaky tests in ANY consumer of a shared lib block the PR — massively amplifying batching incentive.

High → CriticalMONO

▼ CI/CD Pipeline Speed

Reruns consume pipeline capacity, multiplying CI time 2–3×.

Slack: 20% main branch stability

Google + industrial case study (Leinen et al. 2023)

Monorepo: Broader dependency graphs mean more tests run per change, multiplying flake exposure exponentially.

▲ Merge Queue Wait

One flaky failure cascading-resets all queued PRs.

10%+ flake rate makes queues unusable

GitHub Merge Queue Documentation & Trunk.io

Monorepo: Expedia: 20+ PRs competing simultaneously — flaky tests in any affected project reset the entire queue.

High → CriticalMONO

▲ Change Lead Time

Reruns + larger PRs + queue delays compound across pipeline.

Affects 3+ delivery stages

Katalon + Gradle research

▼ Developer Satisfaction

Phantom failures create frustration and erode CI trust.

Google: 16% flaky, 84% false alarms; Atlassian: 21% master failures from flakes

Mozilla Foundation - Understanding Flaky Tests Research

▲ Incident Frequency

Eroded trust → ignored failures → bugs reach production.

84% of post-submit failures are false alarms

Google Testing Blog - Flaky Tests at Google

▼ Code Coverage

Teams disable flaky tests, creating persistent coverage gaps.

Common in large codebases

Gradle + Atlassian research

▼ PRs Completed per Week

Each flaky failure requires re-run. At 16% flakiness (Google), most PRs hit at least one flake.

Google: 84% pass→fail are flaky; reruns add pipeline cost

Google flaky test analysis

Monorepo: Broader dependency graphs in monorepos mean more tests run per PR = more flake exposure. Exponentially impacts throughput.

Distributed: Flaky failures overnight create 12-24h delays instead of immediate retries. Amplifies impact dramatically.

← Influenced by

Medium → HighMONO

▲ Test Suite Exec Time

Large tests: 14% flaky vs small: 0.5%.

Google Testing Blog (April 2017)

Monorepo: Monorepo affected-project detection can pull in large integration tests from other teams, amplifying flake exposure.

▼ Affected-Project Detection

Fewer tests run = fewer opportunities for flaky failures per PR.

Proportional to test reduction

Logical inference + test execution data

Metrics map by Stanislav Belyaev · Analysis powered by Anthropic Claude Opus 4.6 · All data validated by human experts