The test-duration question that decides whether your result is real
A/B testing results that look like winners at day 3 are statistical artifacts most of the time. The 2025 Ronny Kohavi / Airbnb study of 7,800 completed product tests found that roughly 67% of tests that showed statistical significance at day 3 reverted to null by day 14. Peeking — reading a test before it has gathered sufficient sample — inflates the false-positive rate above the nominal alpha you set. The fix is to calculate required sample size before launching the test and commit to a read date, not a significance threshold. This estimator does exactly that.
The required sample size depends on four variables: (1) your baseline conversion rate, (2) the minimum detectable effect (MDE) you want to reliably measure, (3) your significance level alpha, (4) your statistical power. The estimator wires these together into a visitor count, then divides by your daily traffic to give you days-to-run. Commit to the full duration before launching; kill the test only if you hit a pre-agreed "disaster threshold" (e.g., variant is clearly harming revenue).
Why MDE matters more than most marketers realize
MDE is the smallest effect size your test is powered to detect reliably. A test powered to detect a 20% relative lift requires substantially fewer samples than one powered to detect a 5% relative lift — roughly 16x fewer, because sample size scales with 1/MDE². If you set MDE too high (20%+) you will miss real 8% effects as "not significant"; if you set it too low (2%) you will run tests for months without reaching significance, losing iteration velocity. The 2026 practical MDE for most conversion-rate optimization work: 10–15% relative. For revenue-per-visitor tests where variance is higher, 15–25%. For dramatic redesigns where you expect big effects, 20%+.
| MDE 5% (relative) | Huge samples required | Only for high-traffic mature accounts |
| MDE 10% (relative) | Standard CRO testing | Typical baseline for landing page tests |
| MDE 15% (relative) | Medium-traffic accounts | Practical for most tests |
| MDE 20% (relative) | Redesigns, big offers | Low sample, fast reads |
| Alpha 0.05 | Standard significance threshold | 5% false-positive rate |
| Alpha 0.10 | Exploratory testing | Use for directional, not final |
| Power 0.80 | Standard | 80% chance of detecting true effect |
| Power 0.90 | High-stakes tests | Use when cost of missing is high |
The peeking problem quantified
Peeking at an A/B test daily during its run, and declaring a winner the first time p < 0.05 appears, inflates the actual false-positive rate from 5% to roughly 28% (documented in Kohavi et al.'s seminal paper). That means over a quarter of your "winning" tests are actually noise. The two valid ways to avoid peeking: (1) pre-register the sample size and read only once at the pre-committed visitor count, or (2) use a sequential testing framework (Bayesian posterior probability, mSPRT) that corrects for continuous monitoring. Most marketing teams are not using sequential frameworks, so discipline around read dates is the practical fix.
Real-world traffic requirements
For a 3% baseline CVR and 10% MDE at alpha 0.05 and power 0.80, you need roughly 16,000 visitors per arm (32,000 total). At 2,000 daily visitors that's 16 days — exactly one full biweekly cycle. At a 1% baseline CVR (typical for cold B2B traffic) the requirement balloons to ~48,000 per arm — 48 days at the same traffic level. This is why small B2B accounts struggle with proper A/B testing: the traffic isn't there. The answer isn't to shortcut the test; it's to test upstream metrics (CTR, landing page engagement) that converge faster, or to accept a wider MDE that matches your traffic reality.
| 3% CVR, 10% MDE, 2k/day traffic | 16 days | Ideal DTC cadence |
| 3% CVR, 10% MDE, 500/day traffic | 64 days | Too slow — widen MDE |
| 1% CVR, 10% MDE, 2k/day traffic | 48 days | B2B cold traffic reality |
| 1% CVR, 20% MDE, 2k/day traffic | 12 days | Accept larger detectable effect |
| 5% CVR, 10% MDE, 2k/day traffic | 9 days | Higher-intent traffic |
Sequential testing: the modern answer
Tools like Optimizely (post-2022), VWO's Bayesian mode, and homegrown Bayesian implementations in BigQuery / dbt now allow continuous monitoring without false-positive inflation. The Bayesian approach: at each read, compute the posterior probability that variant B is better than variant A, plus the expected loss if you pick the wrong variant. Ship when P(B better) > 0.95 AND expected loss < 0.5% absolute. This is cleaner math than frequentist for teams with stats sophistication. For teams without it, the frequentist pre-commit-and-read-once approach is safer and produces similar real-world outcomes over 20+ test cycles.
What to test first if you're traffic-constrained
Prioritize tests in this order: (1) elements that affect the largest share of traffic (header CTA, hero headline), (2) elements with the widest expected effect size (offer changes, pricing), (3) elements that can be tested as CTR proxies (email subject lines, ad headlines) which reach significance 5–10x faster than CVR tests. Stop running tiny button-color tests — they require huge samples to detect tiny effects and are almost never business-critical.
Related tools
- A/B SignificanceAnalyze statistical significance for conversion tests — p-value, lift, p…
- LP CVR LiftRevenue impact of lifting landing page conversion rate by 0.1%–2% across…
- Creative AnalyzerAnalyze ad creative by hook, hold, and action rate. Rank winners before …
- Funnel ConversionModel a full marketing funnel — visitors through to customers at each st…
Real-world example: proper test duration prevents a $240k mistake
A B2B SaaS company ran an A/B test on their pricing page in Q4 2025. Their baseline CVR was 2.1% (trial signups per visitor). They added a new pricing table design as the variant. At day 4, the variant showed a 3.8% CVR vs. 2.1% control — an 81% relative lift, p=0.03. The team was ready to ship. We ran the duration estimator instead: with 2.1% baseline, 10% MDE target, and their 800 daily visitors to that page, they needed 26 days total. Day 4 had 32 conversions in the variant — far below the 3,200 needed for adequate power.
We held the test. By day 14, the variant was at 2.6% vs 2.1% — still a lift, but the "81% lift" had compressed to 24%. By day 26, the variant was at 2.3% vs 2.1% — a modest 9% lift, not significant at alpha 0.05. The right call: keep control, iterate on a different pricing angle. The premature read would have shipped an untested change and likely caused a downstream CVR regression. The revenue at stake: their average trial-to-paid rate was 18% at $3,600 ACV. Misreading the test and shipping a regressing variant could have cost $180–$240k in annualized ARR.
Building a testing calendar that avoids common traps
- Map test-dead periods. Mark BFCM week, major holidays, and any planned product or pricing changes as off-limits for A/B tests. Tests running during disrupted baselines produce invalid results that get acted on incorrectly.
- Sequence tests, don't stack them. If two pages both need testing, run them consecutively on a shared audience. Simultaneous tests on different pages don't interfere (different URLs), but simultaneous tests on the same page with different elements do — the interaction effect confounds both results.
- Budget traffic before testing. Before scheduling a test, run the duration estimator. If the required duration exceeds 60 days given your traffic, either accept a wider MDE, wait for traffic to grow, or test an upstream metric (CTR on an email subject line) that reaches significance faster.
- Create a test log. Every completed test (wins and losses) goes in a shared document with: hypothesis, change made, sample size, duration, result, and the winner shipped. After 20 tests, this log reveals which frameworks reliably win for your audience and which don't — making future tests faster and more accurately powered.
The practical testing framework I use with clients
- Define hypothesis. Specific change, specific predicted direction, specific target metric.
- Compute required sample size. Use this estimator. Round up to the nearest full weekly cycle.
- Pre-register. Write the hypothesis, sample size, alpha/power, and read date in a shared doc.
- Launch with 50/50 split. Balanced allocation is optimal unless you have a clear reason to skew.
- Monitor for disasters only. Check daily for 3x-worse-than-control disasters; otherwise wait for read date.
- Read at pre-committed date. Run the significance test. Pass significance + MDE met = ship. Below MDE = keep control.
- Post-ship validate. Monitor the winner for 30 days in production to catch novelty effects.