Back to Cookbook

Experiment Referee

Prevent stakeholders from calling experiments too early

Generates automated A/B test reports with required sample sizes, current statistical significance, confidence intervals, and a clear "keep running" or "ready to call" recommendation. Stops premature peeking from killing valid experiments.

CommunitySubmitted by CommunityWork5 min setup

INGREDIENTS

💬Slack

PROMPT

Create a skill called "Experiment Referee". For each A/B test I configure: (1) Before launch: Calculate required sample size given the primary metric, baseline conversion rate, minimum detectable effect, and desired power (default 80%). (2) During the test: Generate periodic reports showing current sample sizes per variant, observed effect size, p-value, confidence interval, and a clear verdict: "NOT SIGNIFICANT — keep running (estimated X days remaining)" or "SIGNIFICANT — ready to call." (3) For segment analysis: Apply Bonferroni or Benjamini-Hochberg correction for multiple comparisons and warn when segments are underpowered. (4) Include a non-technical summary for stakeholders that explains the conclusion without statistical jargon. (5) Alert me when an experiment reaches significance. Never recommend stopping a test before it reaches the pre-calculated sample size unless using a valid sequential testing framework.

How It Works

Product managers peek at A/B test results on day 2 and want to ship the

"winning" variant. But it's not significant yet — the p-value is 0.3 and

the sample size is 20% of what's needed. This skill generates experiment

reports that make the statistical guardrails impossible to ignore.

What You Get

  • Required sample size calculation before the test starts
  • Automated daily/weekly experiment status reports
  • Clear "not significant yet — keep running" vs. "ready to call" verdicts
  • Confidence intervals visualized (not just p-values)
  • Segment-level breakdowns with proper multiple comparison corrections
  • A non-technical summary stakeholders can understand
  • Alerts when an experiment reaches significance

Setup Steps

  1. Ask your Claw to create an "Experiment Referee" skill with the prompt below
  2. Provide experiment parameters (variants, primary metric, minimum detectable effect)
  3. Connect to your experiment data source
  4. Configure the report schedule and distribution

Tips

  • The pre-test sample size calculation is the most important step — don't skip it
  • Share the "keep running" reports proactively before stakeholders ask
  • The non-technical summary uses language like "We need 2 more weeks of data" rather than statistical jargon
  • Supports sequential testing if you need valid early stopping
Tags:#experimentation#ab-testing#statistics#reporting