The test-duration question that decides whether your result is real
A/B testing results that look like winners at day 3 are statistical artifacts most of the time. The 2025 Ronny Kohavi / Airbnb study of 7,800 completed product tests found that roughly 67% of tests that showed statistical significance at day 3 reverted to null by day 14. Peeking — reading a test before it has gathered sufficient sample — inflates the false-positive rate above the nominal alpha you set. The fix is to calculate required sample size before launching the test and commit to a read date, not a significance threshold. This estimator does exactly that.
The required sample size depends on four variables: (1) your baseline conversion rate, (2) the minimum detectable effect (MDE) you want to reliably measure, (3) your significance level alpha, (4) your statistical power. The estimator wires these together into a visitor count, then divides by your daily traffic to give you days-to-run. Commit to the full duration before launching; kill the test only if you hit a pre-agreed "disaster threshold" (e.g., variant is clearly harming revenue).
Why MDE matters more than most marketers realize
MDE is the smallest effect size your test is powered to detect reliably. A test powered to detect a 20% relative lift requires substantially fewer samples than one powered to detect a 5% relative lift — roughly 16x fewer, because sample size scales with 1/MDE². If you set MDE too high (20%+) you will miss real 8% effects as "not significant"; if you set it too low (2%) you will run tests for months without reaching significance, losing iteration velocity. The 2026 practical MDE for most conversion-rate optimization work: 10–15% relative. For revenue-per-visitor tests where variance is higher, 15–25%. For dramatic redesigns where you expect big effects, 20%+.
| MDE 5% (relative) | Huge samples required | Only for high-traffic mature accounts |
| MDE 10% (relative) | Standard CRO testing | Typical baseline for landing page tests |
| MDE 15% (relative) | Medium-traffic accounts | Practical for most tests |
| MDE 20% (relative) | Redesigns, big offers | Low sample, fast reads |
| Alpha 0.05 | Standard significance threshold | 5% false-positive rate |
| Alpha 0.10 | Exploratory testing | Use for directional, not final |
| Power 0.80 | Standard | 80% chance of detecting true effect |
| Power 0.90 | High-stakes tests | Use when cost of missing is high |
The peeking problem quantified
Peeking at an A/B test daily during its run, and declaring a winner the first time p < 0.05 appears, inflates the actual false-positive rate from 5% to roughly 28% (documented in Kohavi et al.'s seminal paper). That means over a quarter of your "winning" tests are actually noise. The two valid ways to avoid peeking: (1) pre-register the sample size and read only once at the pre-committed visitor count, or (2) use a sequential testing framework (Bayesian posterior probability, mSPRT) that corrects for continuous monitoring. Most marketing teams are not using sequential frameworks, so discipline around read dates is the practical fix.
Real-world traffic requirements
For a 3% baseline CVR and 10% MDE at alpha 0.05 and power 0.80, you need roughly 16,000 visitors per arm (32,000 total). At 2,000 daily visitors that's 16 days — exactly one full biweekly cycle. At a 1% baseline CVR (typical for cold B2B traffic) the requirement balloons to ~48,000 per arm — 48 days at the same traffic level. This is why small B2B accounts struggle with proper A/B testing: the traffic isn't there. The answer isn't to shortcut the test; it's to test upstream metrics (CTR, landing page engagement) that converge faster, or to accept a wider MDE that matches your traffic reality.
| 3% CVR, 10% MDE, 2k/day traffic | 16 days | Ideal DTC cadence |
| 3% CVR, 10% MDE, 500/day traffic | 64 days | Too slow — widen MDE |
| 1% CVR, 10% MDE, 2k/day traffic | 48 days | B2B cold traffic reality |
| 1% CVR, 20% MDE, 2k/day traffic | 12 days | Accept larger detectable effect |
| 5% CVR, 10% MDE, 2k/day traffic | 9 days | Higher-intent traffic |
Sequential testing: the modern answer
Tools like Optimizely (post-2022), VWO's Bayesian mode, and homegrown Bayesian implementations in BigQuery / dbt now allow continuous monitoring without false-positive inflation. The Bayesian approach: at each read, compute the posterior probability that variant B is better than variant A, plus the expected loss if you pick the wrong variant. Ship when P(B better) > 0.95 AND expected loss < 0.5% absolute. This is cleaner math than frequentist for teams with stats sophistication. For teams without it, the frequentist pre-commit-and-read-once approach is safer and produces similar real-world outcomes over 20+ test cycles.
What to test first if you're traffic-constrained
Prioritize tests in this order: (1) elements that affect the largest share of traffic (header CTA, hero headline), (2) elements with the widest expected effect size (offer changes, pricing), (3) elements that can be tested as CTR proxies (email subject lines, ad headlines) which reach significance 5–10x faster than CVR tests. Stop running tiny button-color tests — they require huge samples to detect tiny effects and are almost never business-critical.
Related tools
- A/B SignificanceAnalyze statistical significance for conversion tests — p-value, lift, p…
- LP CVR LiftRevenue impact of lifting landing page conversion rate by 0.1%–2% across…
- Creative AnalyzerAnalyze ad creatives by hook rate, hold rate, and action rate to rank wi…
- Funnel ConversionModel a full marketing funnel — visitors through to customers at each st…
The practical testing framework I use with clients
- Define hypothesis. Specific change, specific predicted direction, specific target metric.
- Compute required sample size. Use this estimator. Round up to the nearest full weekly cycle.
- Pre-register. Write the hypothesis, sample size, alpha/power, and read date in a shared doc.
- Launch with 50/50 split. Balanced allocation is optimal unless you have a clear reason to skew.
- Monitor for disasters only. Check daily for 3x-worse-than-control disasters; otherwise wait for read date.
- Read at pre-committed date. Run the significance test. Pass significance + MDE met = ship. Below MDE = keep control.
- Post-ship validate. Monitor the winner for 30 days in production to catch novelty effects.