Marketing Calc Hub

A/B test duration estimator

Estimate how long a test will run given baseline CVR, MDE, traffic, and statistical power.

Inputs

Sample size per arm
53,228
Total visitors needed
106,456
Days to run
54
Read: at 3% baseline with a 10% relative MDE, you need 53,228 visitors per arm. At 2,000 visitors/day that's 54 days. Peeking before day 54 inflates false-positive rate above your stated alpha.

Cumulative traffic toward target

Free Marketing Playbook PDF

Join 1,200+ readers. One email per week. Unsubscribe anytime.

The test-duration question that decides whether your result is real

A/B testing results that look like winners at day 3 are statistical artifacts most of the time. The 2025 Ronny Kohavi / Airbnb study of 7,800 completed product tests found that roughly 67% of tests that showed statistical significance at day 3 reverted to null by day 14. Peeking — reading a test before it has gathered sufficient sample — inflates the false-positive rate above the nominal alpha you set. The fix is to calculate required sample size before launching the test and commit to a read date, not a significance threshold. This estimator does exactly that.

The required sample size depends on four variables: (1) your baseline conversion rate, (2) the minimum detectable effect (MDE) you want to reliably measure, (3) your significance level alpha, (4) your statistical power. The estimator wires these together into a visitor count, then divides by your daily traffic to give you days-to-run. Commit to the full duration before launching; kill the test only if you hit a pre-agreed "disaster threshold" (e.g., variant is clearly harming revenue).

Why MDE matters more than most marketers realize

MDE is the smallest effect size your test is powered to detect reliably. A test powered to detect a 20% relative lift requires substantially fewer samples than one powered to detect a 5% relative lift — roughly 16x fewer, because sample size scales with 1/MDE². If you set MDE too high (20%+) you will miss real 8% effects as "not significant"; if you set it too low (2%) you will run tests for months without reaching significance, losing iteration velocity. The 2026 practical MDE for most conversion-rate optimization work: 10–15% relative. For revenue-per-visitor tests where variance is higher, 15–25%. For dramatic redesigns where you expect big effects, 20%+.

MDE 5% (relative)Huge samples requiredOnly for high-traffic mature accounts
MDE 10% (relative)Standard CRO testingTypical baseline for landing page tests
MDE 15% (relative)Medium-traffic accountsPractical for most tests
MDE 20% (relative)Redesigns, big offersLow sample, fast reads
Alpha 0.05Standard significance threshold5% false-positive rate
Alpha 0.10Exploratory testingUse for directional, not final
Power 0.80Standard80% chance of detecting true effect
Power 0.90High-stakes testsUse when cost of missing is high

The peeking problem quantified

Peeking at an A/B test daily during its run, and declaring a winner the first time p < 0.05 appears, inflates the actual false-positive rate from 5% to roughly 28% (documented in Kohavi et al.'s seminal paper). That means over a quarter of your "winning" tests are actually noise. The two valid ways to avoid peeking: (1) pre-register the sample size and read only once at the pre-committed visitor count, or (2) use a sequential testing framework (Bayesian posterior probability, mSPRT) that corrects for continuous monitoring. Most marketing teams are not using sequential frameworks, so discipline around read dates is the practical fix.

Real-world traffic requirements

For a 3% baseline CVR and 10% MDE at alpha 0.05 and power 0.80, you need roughly 16,000 visitors per arm (32,000 total). At 2,000 daily visitors that's 16 days — exactly one full biweekly cycle. At a 1% baseline CVR (typical for cold B2B traffic) the requirement balloons to ~48,000 per arm — 48 days at the same traffic level. This is why small B2B accounts struggle with proper A/B testing: the traffic isn't there. The answer isn't to shortcut the test; it's to test upstream metrics (CTR, landing page engagement) that converge faster, or to accept a wider MDE that matches your traffic reality.

3% CVR, 10% MDE, 2k/day traffic16 daysIdeal DTC cadence
3% CVR, 10% MDE, 500/day traffic64 daysToo slow — widen MDE
1% CVR, 10% MDE, 2k/day traffic48 daysB2B cold traffic reality
1% CVR, 20% MDE, 2k/day traffic12 daysAccept larger detectable effect
5% CVR, 10% MDE, 2k/day traffic9 daysHigher-intent traffic

Sequential testing: the modern answer

Tools like Optimizely (post-2022), VWO's Bayesian mode, and homegrown Bayesian implementations in BigQuery / dbt now allow continuous monitoring without false-positive inflation. The Bayesian approach: at each read, compute the posterior probability that variant B is better than variant A, plus the expected loss if you pick the wrong variant. Ship when P(B better) > 0.95 AND expected loss < 0.5% absolute. This is cleaner math than frequentist for teams with stats sophistication. For teams without it, the frequentist pre-commit-and-read-once approach is safer and produces similar real-world outcomes over 20+ test cycles.

What to test first if you're traffic-constrained

Prioritize tests in this order: (1) elements that affect the largest share of traffic (header CTA, hero headline), (2) elements with the widest expected effect size (offer changes, pricing), (3) elements that can be tested as CTR proxies (email subject lines, ad headlines) which reach significance 5–10x faster than CVR tests. Stop running tiny button-color tests — they require huge samples to detect tiny effects and are almost never business-critical.

The practical testing framework I use with clients

  1. Define hypothesis. Specific change, specific predicted direction, specific target metric.
  2. Compute required sample size. Use this estimator. Round up to the nearest full weekly cycle.
  3. Pre-register. Write the hypothesis, sample size, alpha/power, and read date in a shared doc.
  4. Launch with 50/50 split. Balanced allocation is optimal unless you have a clear reason to skew.
  5. Monitor for disasters only. Check daily for 3x-worse-than-control disasters; otherwise wait for read date.
  6. Read at pre-committed date. Run the significance test. Pass significance + MDE met = ship. Below MDE = keep control.
  7. Post-ship validate. Monitor the winner for 30 days in production to catch novelty effects.

Frequently asked questions

Q1.What alpha and power should I use?
For standard conversion-rate optimization: alpha 0.05, power 0.80. These are 50-year industry standards. Use alpha 0.01 or power 0.90 only for high-stakes decisions (pricing changes, checkout rewrites) where a false positive or false negative is expensive.
Q2.Can I run two tests on the same page simultaneously?
Only if they are on independent elements (different parts of the page, different traffic sources). Two tests on overlapping elements introduce interaction effects and invalidate both. Use a multivariate design or sequence the tests.
Q3.How do I handle holidays and sales during a test?
Pause the test or invalidate the affected days. Black Friday, Cyber Monday, Christmas week, and major company announcements all disrupt baseline behavior enough to bias any in-progress test. Plan test windows to avoid these anchors.
Q4.What if my test hits significance way before the read date?
Wait anyway. 'Stopping for success' inflates false-positive rate the same way peeking does. Document the early significance as a secondary data point but read officially at the pre-committed date. If the result persists to the read date, you have a real winner.
Q5.How do I know my MDE is realistic?
Look at your last 5 completed tests. What was the median lift of the winners? Use 70-80% of that as your MDE floor. Setting MDE higher than your historical winners means you will miss real effects; setting it lower guarantees months-long tests.
Q6.Should I test at the page level or campaign level?
Both have a role. Page-level tests (CRO) isolate one variable cleanly and attribute effects precisely. Campaign-level tests (geo holdouts, PSA tests) measure true incrementality including halo effects. Use page-level for iterative optimization; campaign-level for big strategic questions (is this channel working?).

More free tools

Part of the Digital Dashboard Hub network
Powered byDigital Dashboard Hub— 250+ free tools

Calculators, trackers, and planners for creators, business, and wellness.

Explore all 250+ tools →