Experiment Referee
Prevent stakeholders from calling experiments too early
Generates automated A/B test reports with required sample sizes, current statistical significance, confidence intervals, and a clear "keep running" or "ready to call" recommendation. Stops premature peeking from killing valid experiments.
INGREDIENTS
PROMPT
Create a skill called "Experiment Referee". For each A/B test I configure: (1) Before launch: Calculate required sample size given the primary metric, baseline conversion rate, minimum detectable effect, and desired power (default 80%). (2) During the test: Generate periodic reports showing current sample sizes per variant, observed effect size, p-value, confidence interval, and a clear verdict: "NOT SIGNIFICANT — keep running (estimated X days remaining)" or "SIGNIFICANT — ready to call." (3) For segment analysis: Apply Bonferroni or Benjamini-Hochberg correction for multiple comparisons and warn when segments are underpowered. (4) Include a non-technical summary for stakeholders that explains the conclusion without statistical jargon. (5) Alert me when an experiment reaches significance. Never recommend stopping a test before it reaches the pre-calculated sample size unless using a valid sequential testing framework.
How It Works
Product managers peek at A/B test results on day 2 and want to ship the
"winning" variant. But it's not significant yet — the p-value is 0.3 and
the sample size is 20% of what's needed. This skill generates experiment
reports that make the statistical guardrails impossible to ignore.
What You Get
- Required sample size calculation before the test starts
- Automated daily/weekly experiment status reports
- Clear "not significant yet — keep running" vs. "ready to call" verdicts
- Confidence intervals visualized (not just p-values)
- Segment-level breakdowns with proper multiple comparison corrections
- A non-technical summary stakeholders can understand
- Alerts when an experiment reaches significance
Setup Steps
- Ask your Claw to create an "Experiment Referee" skill with the prompt below
- Provide experiment parameters (variants, primary metric, minimum detectable effect)
- Connect to your experiment data source
- Configure the report schedule and distribution
Tips
- The pre-test sample size calculation is the most important step — don't skip it
- Share the "keep running" reports proactively before stakeholders ask
- The non-technical summary uses language like "We need 2 more weeks of data" rather than statistical jargon
- Supports sequential testing if you need valid early stopping