Back to Cookbook
Flaky Root Cause Hunter
Find the real source of nondeterminism, not just the symptom
Diagnose flaky tests by categorizing nondeterminism (timing, order-dependence, shared state, network, resource contention) and applying targeted fixes.
CommunitySubmitted by CommunityWork12 min
INGREDIENTS
🐙GitHub
PROMPT
Create a skill called "Flaky Root Cause Hunter". Given: - A flaky test name/file and recent failure logs - The test type (unit/integration/e2e) and environment (CI/local) Output: - A likely root-cause category and how to confirm it - Concrete fix patterns for that category - A verification plan (stress reruns, seed capture, isolation)
How It Works
This recipe structures flake debugging so teams stop inflating timeouts and start removing
nondeterminism from tests and environments.
Triggers
- Quarantined flaky tests accumulate
- Retries/timeouts are used as the primary fix
- Failures are hard to reproduce locally
Steps
- Re-run the single test repeatedly and record failure rate + modes.
- Categorize:
- race/timing,
- order dependence,
- shared global state,
- network/IO instability,
- resource contention.
- Apply fixes:
- hermetic test data,
- explicit waits and deterministic clocks,
- isolate shared state,
- remove external network dependencies or mock safely.
- Add instrumentation to tests (timestamps, retries count, random seeds).
- Confirm fix by stress reruns and remove quarantine tag.
Expected Outcome
- Reduced flaky rate and restored trust in CI.
- Less time spent rerunning, more time shipping.
Example Inputs
- "This test fails 1/20 runs only on CI."
- "E2E flake occurs when CI is slow."
- "Order-dependent failures in a shared DB test suite."
Tips
- If you can't explain why the test failed, you haven't fixed the flake yet.
Tags:#flaky-tests#testing#debugging#ci-cd