Test Flakiness measures the percentage of test executions that produce non-deterministic results — passing on one run and failing on another without any code change. Flaky tests erode developer trust in the test suite, cause unnecessary investigation, and can mask real regressions. Tracking and reducing flakiness is essential for maintaining a reliable and actionable CI signal.
Rate of non-deterministic test results. Google: 16% of tests are prone to flakiness.
AMPLIFIED: Broader dependency graphs mean more tests run per change, multiplying flake exposure. A change touching a shared lib triggers tests across all consumers — if any are flaky, the whole build fails. Impact on merge queues is especially devastating.
Flakiness impact is amplified indirectly: recovery from flake-induced merge queue failure adds 12-24h instead of minutes, because the developer is asleep when the failure occurs.
Flaky test failures scale exponentially: more builds × more flaky tests × higher execution frequency. Atlassian processes 350M+ test executions/day to manage this.
Devs batch changes to minimize flaky gate exposure.
Reruns consume pipeline capacity, multiplying CI time 2–3×.
One flaky failure cascading-resets all queued PRs.
Reruns + larger PRs + queue delays compound across pipeline.
Phantom failures create frustration and erode CI trust.
Eroded trust → ignored failures → bugs reach production.
Teams disable flaky tests, creating persistent coverage gaps.
Each flaky failure requires re-run. At 16% flakiness (Google), most PRs hit at least one flake.
Large tests: 14% flaky vs small: 0.5%.
Fewer tests run = fewer opportunities for flaky failures per PR.