A/B Test Duration Estimator

The test-duration question that decides whether your result is real

A/B testing results that look like winners at day 3 are statistical artifacts most of the time. The 2025 Ronny Kohavi / Airbnb study of 7,800 completed product tests found that roughly 67% of tests that showed statistical significance at day 3 reverted to null by day 14. Peeking — reading a test before it has gathered sufficient sample — inflates the false-positive rate above the nominal alpha you set. The fix is to calculate required sample size before launching the test and commit to a read date, not a significance threshold. This estimator does exactly that.

The required sample size depends on four variables: (1) your baseline conversion rate, (2) the minimum detectable effect (MDE) you want to reliably measure, (3) your significance level alpha, (4) your statistical power. The estimator wires these together into a visitor count, then divides by your daily traffic to give you days-to-run. Commit to the full duration before launching; kill the test only if you hit a pre-agreed "disaster threshold" (e.g., variant is clearly harming revenue).

Why MDE matters more than most marketers realize

MDE is the smallest effect size your test is powered to detect reliably. A test powered to detect a 20% relative lift requires substantially fewer samples than one powered to detect a 5% relative lift — roughly 16x fewer, because sample size scales with 1/MDE². If you set MDE too high (20%+) you will miss real 8% effects as "not significant"; if you set it too low (2%) you will run tests for months without reaching significance, losing iteration velocity. The 2026 practical MDE for most conversion-rate optimization work: 10–15% relative. For revenue-per-visitor tests where variance is higher, 15–25%. For dramatic redesigns where you expect big effects, 20%+.

MDE 5% (relative)	Huge samples required	Only for high-traffic mature accounts
MDE 10% (relative)	Standard CRO testing	Typical baseline for landing page tests
MDE 15% (relative)	Medium-traffic accounts	Practical for most tests
MDE 20% (relative)	Redesigns, big offers	Low sample, fast reads
Alpha 0.05	Standard significance threshold	5% false-positive rate
Alpha 0.10	Exploratory testing	Use for directional, not final
Power 0.80	Standard	80% chance of detecting true effect
Power 0.90	High-stakes tests	Use when cost of missing is high

The peeking problem quantified

Peeking at an A/B test daily during its run, and declaring a winner the first time p < 0.05 appears, inflates the actual false-positive rate from 5% to roughly 28% (documented in Kohavi et al.'s seminal paper). That means over a quarter of your "winning" tests are actually noise. The two valid ways to avoid peeking: (1) pre-register the sample size and read only once at the pre-committed visitor count, or (2) use a sequential testing framework (Bayesian posterior probability, mSPRT) that corrects for continuous monitoring. Most marketing teams are not using sequential frameworks, so discipline around read dates is the practical fix.

Real-world traffic requirements

For a 3% baseline CVR and 10% MDE at alpha 0.05 and power 0.80, you need roughly 16,000 visitors per arm (32,000 total). At 2,000 daily visitors that's 16 days — exactly one full biweekly cycle. At a 1% baseline CVR (typical for cold B2B traffic) the requirement balloons to ~48,000 per arm — 48 days at the same traffic level. This is why small B2B accounts struggle with proper A/B testing: the traffic isn't there. The answer isn't to shortcut the test; it's to test upstream metrics (CTR, landing page engagement) that converge faster, or to accept a wider MDE that matches your traffic reality.

3% CVR, 10% MDE, 2k/day traffic	16 days	Ideal DTC cadence
3% CVR, 10% MDE, 500/day traffic	64 days	Too slow — widen MDE
1% CVR, 10% MDE, 2k/day traffic	48 days	B2B cold traffic reality
1% CVR, 20% MDE, 2k/day traffic	12 days	Accept larger detectable effect
5% CVR, 10% MDE, 2k/day traffic	9 days	Higher-intent traffic

Sequential testing: the modern answer

Tools like Optimizely (post-2022), VWO's Bayesian mode, and homegrown Bayesian implementations in BigQuery / dbt now allow continuous monitoring without false-positive inflation. The Bayesian approach: at each read, compute the posterior probability that variant B is better than variant A, plus the expected loss if you pick the wrong variant. Ship when P(B better) > 0.95 AND expected loss < 0.5% absolute. This is cleaner math than frequentist for teams with stats sophistication. For teams without it, the frequentist pre-commit-and-read-once approach is safer and produces similar real-world outcomes over 20+ test cycles.

What to test first if you're traffic-constrained

Prioritize tests in this order: (1) elements that affect the largest share of traffic (header CTA, hero headline), (2) elements with the widest expected effect size (offer changes, pricing), (3) elements that can be tested as CTR proxies (email subject lines, ad headlines) which reach significance 5–10x faster than CVR tests. Stop running tiny button-color tests — they require huge samples to detect tiny effects and are almost never business-critical.

Related tools

Real-world example: proper test duration prevents a $240k mistake

A B2B SaaS company ran an A/B test on their pricing page in Q4 2025. Their baseline CVR was 2.1% (trial signups per visitor). They added a new pricing table design as the variant. At day 4, the variant showed a 3.8% CVR vs. 2.1% control — an 81% relative lift, p=0.03. The team was ready to ship. We ran the duration estimator instead: with 2.1% baseline, 10% MDE target, and their 800 daily visitors to that page, they needed 26 days total. Day 4 had 32 conversions in the variant — far below the 3,200 needed for adequate power.

We held the test. By day 14, the variant was at 2.6% vs 2.1% — still a lift, but the "81% lift" had compressed to 24%. By day 26, the variant was at 2.3% vs 2.1% — a modest 9% lift, not significant at alpha 0.05. The right call: keep control, iterate on a different pricing angle. The premature read would have shipped an untested change and likely caused a downstream CVR regression. The revenue at stake: their average trial-to-paid rate was 18% at $3,600 ACV. Misreading the test and shipping a regressing variant could have cost $180–$240k in annualized ARR.

Building a testing calendar that avoids common traps

Map test-dead periods. Mark BFCM week, major holidays, and any planned product or pricing changes as off-limits for A/B tests. Tests running during disrupted baselines produce invalid results that get acted on incorrectly.
Sequence tests, don't stack them. If two pages both need testing, run them consecutively on a shared audience. Simultaneous tests on different pages don't interfere (different URLs), but simultaneous tests on the same page with different elements do — the interaction effect confounds both results.
Budget traffic before testing. Before scheduling a test, run the duration estimator. If the required duration exceeds 60 days given your traffic, either accept a wider MDE, wait for traffic to grow, or test an upstream metric (CTR on an email subject line) that reaches significance faster.
Create a test log. Every completed test (wins and losses) goes in a shared document with: hypothesis, change made, sample size, duration, result, and the winner shipped. After 20 tests, this log reveals which frameworks reliably win for your audience and which don't — making future tests faster and more accurately powered.

The practical testing framework I use with clients

Define hypothesis. Specific change, specific predicted direction, specific target metric.
Compute required sample size. Use this estimator. Round up to the nearest full weekly cycle.
Pre-register. Write the hypothesis, sample size, alpha/power, and read date in a shared doc.
Launch with 50/50 split. Balanced allocation is optimal unless you have a clear reason to skew.
Monitor for disasters only. Check daily for 3x-worse-than-control disasters; otherwise wait for read date.
Read at pre-committed date. Run the significance test. Pass significance + MDE met = ship. Below MDE = keep control.
Post-ship validate. Monitor the winner for 30 days in production to catch novelty effects.

Frequently asked questions

Q1.What alpha and power should I use?

For standard conversion-rate optimization: alpha 0.05, power 0.80. These are 50-year industry standards. Use alpha 0.01 or power 0.90 only for high-stakes decisions (pricing changes, checkout rewrites) where a false positive or false negative is expensive.

Q2.Can I run two tests on the same page simultaneously?

Only if they are on independent elements (different parts of the page, different traffic sources). Two tests on overlapping elements introduce interaction effects and invalidate both. Use a multivariate design or sequence the tests.

Q3.How do I handle holidays and sales during a test?

Pause the test or invalidate the affected days. Black Friday, Cyber Monday, Christmas week, and major company announcements all disrupt baseline behavior enough to bias any in-progress test. Plan test windows to avoid these anchors.

Q4.What if my test hits significance way before the read date?

Wait anyway. 'Stopping for success' inflates false-positive rate the same way peeking does. Document the early significance as a secondary data point but read officially at the pre-committed date. If the result persists to the read date, you have a real winner.

Q5.How do I know my MDE is realistic?

Look at your last 5 completed tests. What was the median lift of the winners? Use 70-80% of that as your MDE floor. Setting MDE higher than your historical winners means you will miss real effects; setting it lower guarantees months-long tests.

Q6.Should I test at the page level or campaign level?

Both have a role. Page-level tests (CRO) isolate one variable cleanly and attribute effects precisely. Campaign-level tests (geo holdouts, PSA tests) measure true incrementality including halo effects. Use page-level for iterative optimization; campaign-level for big strategic questions (is this channel working?).

Q7.What's the minimum traffic needed to run useful A/B tests?

At a 2% baseline CVR and 10% MDE, you need roughly 16,000 visitors per arm — 32,000 total. At 500 visitors/day, that's 64 days per test. At this traffic level, run only tests expected to produce 20%+ relative lift (set a wider MDE). Below 300 visitors/day, A/B testing on conversion rates is impractical — test upstream metrics instead (email open rate, ad CTR) where you can gather enough data faster.

Q8.How does novelty effect bias test results?

Novelty effect: users are more likely to click on or engage with something new because it's different, not because it's better. This inflates early test results. The fix: run the test long enough (at least 2 full weeks) to let novelty decay, then read results. For major redesigns (new navigation, complete page restructure), consider a 4-6 week test window and compare week-1 performance to week-3+ performance to isolate novelty from real lift.

Inputs

Cumulative traffic toward target