How long should I run a test?

Minimum one full business cycle (usually 2 weeks). Longer if weekend behavior differs from weekday.

Why p<0.05 and not p<0.01?

0.05 is convention. 0.01 is stricter but requires much larger samples. For high-stakes decisions, use 0.01.

Can I peek at results early?

No — it inflates false positives. Set your sample size upfront and wait. Or use sequential testing methods.

What's the difference between Bayesian and frequentist?

Frequentist (this tool): 'there's a 4% chance this is noise.' Bayesian: 'there's a 96% chance B is better.' Frequentist is more common; Bayesian is more intuitive.

A/B Test Significance Analyzer — p-Value, Lift, Sample Size

A/B testing: the math is easy, the discipline is hard

The math behind statistical significance hasn't changed in 100 years — it's a Z-test on proportions for conversion-rate tests, or a T-test on continuous metrics for revenue. What breaks 80% of the A/B tests I audit isn't the math; it's the process around the math. Teams peek at results early. They stop tests when they "look good." They run too many tests simultaneously on overlapping audiences. They declare winners at n=200 conversions per variant because "the lift is huge." And then the "winner" reverts to baseline the following quarter and everyone wonders what happened.

This calculator gives you the frequentist p-value (the 95% confidence industry standard) and tells you whether the test is conclusive. Below, I'll walk through the rules that turn a calculator result into a real business decision — and the traps that turn significance into self-deception.

Sample size planning (do this before you launch)

The #1 reason tests don't reach significance is under-powered sample sizes. Use a power-analysis calculator (Evan Miller's sample-size calculator is the industry standard) and plug in: baseline conversion rate, minimum detectable effect (MDE), and statistical power (usually 80%). Typical B2C ecommerce test with 3% baseline conversion and 10% MDE requires ~14,000 visitors per variant. At 8,000 daily visits total split 50/50, that's ~3.5 days to adequately powered — but most teams need 14–21 days to cover a full business cycle regardless.

Baseline 2%, MDE 20%	~4,300 / variant	Small traffic friendly
Baseline 3%, MDE 10%	~14,100 / variant	Tight tests get big
Baseline 5%, MDE 5%	~31,500 / variant	Most realistic on-site
Baseline 10%, MDE 5%	~15,000 / variant	CTA-button tests
Power target	80%	Standard; 90% for critical tests
Significance threshold	p < 0.05	0.01 for high-stakes

The 10 rules of honest A/B testing

Decide sample size and test duration BEFORE launching. Document the commitment. This prevents early-stopping temptation.
Run for a full business cycle. Minimum 7 full days. For B2B, usually 14–21 days to capture weekday-vs-weekend behavior. Longer if your cycle is monthly (subscription renewals, salary-pay cadence).
Don't peek. Checking early and stopping when "significance is reached" inflates your false-positive rate from 5% to as high as 30%. Academic papers have been written on this (Johari et al., 2017).
One variable at a time. If you change the headline AND the hero image AND the CTA color, a win tells you nothing about which change worked.
Equal traffic allocation. Don't run 70/30 unless you have a specific reason (e.g., exposure-limited test). 50/50 maximizes statistical power.
Watch for interaction effects. If you run 5 tests simultaneously on the same page, they can interact. Use a proper experimentation platform (Optimizely, VWO, Statsig, GrowthBook) that handles this.
Segment post-hoc only to generate hypotheses, not to declare winners. "It won for mobile but lost for desktop" could be real or could be p-hacking. Re-run the test segmented if the segment effect matters.
Declare ties. If p > 0.20 after adequate sample, the variants are truly equivalent. Ship the cheaper/simpler one.
Check for sample ratio mismatch. If variant A got 50,000 and variant B got 48,000 visitors, your randomization is broken. Throw out the test.
Replicate critical winners. Before committing engineering resources to permanent implementation, re-run the winning variant against control for another cycle. "Regression to the mean" is real.

Minimum detectable effect: the number that decides everything

MDE is the smallest lift you can reliably detect given your traffic. If you have 5,000 visits per variant per month and a 2% baseline, you can only detect 30%+ lifts with reasonable power. Most "real" lifts from landing-page tweaks are 5–15%, so that test is permanently under-powered. Two responses: (1) only test bold changes that plausibly move the needle 30%+ (different hero concepts, different offer structure), or (2) aggregate smaller pages into one test by running a site-wide change.

Most small-business owners I consult with are testing at traffic levels where only massive wins are detectable. They burn 12 weeks on a test that looks inconclusive and conclude "A/B testing doesn't work for us." The real diagnosis: their traffic is too low for the effect sizes they're trying to measure. Either test bigger changes or test less often with longer cycles.

Frequentist vs. Bayesian — which to use

This tool is frequentist (p-value based, the classical approach). Bayesian testing — used natively by VWO, Statsig, GrowthBook — answers a different question: "what's the probability that B is actually better than A?" rather than "if there was no real difference, how likely is this much variation?"

Both are valid. Frequentist is standard in academic and regulated settings. Bayesian is more intuitive for business stakeholders ("B has a 94% chance of being better") and handles early stopping more gracefully. If you're running a modern experimentation platform, Bayesian with proper priors is usually the better default. For hand-calculated tests (this tool), frequentist is simpler and well-understood.

Related tools

What to test vs. what to leave alone

Tests have diminishing returns. The first few tests you run on a new landing page or email template will produce big wins — that's typically where the page has glaring problems. By test 10–15, you're into the 3–8% lift range. By test 30+, you're often running tests that cost more in engineering and ops time than the lift recovers.

Prioritize tests by ICE score (Impact, Confidence, Ease). High-traffic pages with clear hypotheses backed by session-replay data or user interviews. Avoid: low-traffic pages, tests driven by "let's just try X," and aesthetic tweaks (button color, font size) that rarely produce meaningful lifts.

Frequently asked questions

Q1.Why did my 'winning' test revert to baseline after shipping?

Most likely: you stopped early (before reaching planned sample size), you ran the test during an unusual traffic period (holiday, one-time promo), or you fell for regression to the mean on a lucky variant. Re-run the test for a full cycle before shipping permanent changes.

Q2.What sample size do I need?

Depends on baseline conversion rate and the lift you want to detect. 3% baseline + 10% detectable lift requires ~14,000 visitors per variant at 80% power. Use Evan Miller's sample-size calculator before launching to plan properly.

Q3.Can I run multiple A/B tests at once?

Yes, on different pages or audiences. Simultaneous tests on the same page can interact; use a proper experimentation platform if you need to. If tests are on independent pages with independent audiences, simultaneous is fine.

Q4.What p-value should I use?

0.05 (95% confidence) is the industry default. For high-stakes decisions (pricing, homepage redesigns, major policy changes), use 0.01 (99% confidence) to cut false-positive rate. For low-stakes (button color on a low-traffic page), 0.10 is sometimes acceptable.

Q5.What if my test is inconclusive?

Either run longer (increase sample size) or accept the null. An inconclusive result is valid data — it means the two variants are effectively equivalent given your traffic. Ship the cheaper or simpler option and move on.

Q6.How does iOS ATT affect A/B testing?

For on-site tests (landing pages, funnel), not at all — cookies are first-party. For ad-platform tests (Meta A/B testing across campaigns), ATT reduces attribution accuracy and inflates variance. Use geo-holdout or lift-study methodology instead of platform-reported conversions.

Q7.What testing tools should I use in 2026?

Experimentation platforms: Optimizely at $36k-$240k/year enterprise (legacy king), VWO at $199-$2,000+/month (strong Bayesian), Statsig at $150-$2,500+/month + free tier (developer-first, CUPED variance reduction built in), GrowthBook from $0 open-source or $300-$2,500/month cloud, LaunchDarkly at $10-$20/user/month for feature flags + limited A/B. For survey-based pre-test qualitative: Hotjar $80-$350/month, Maze at $99-$399/month. Skip Google Optimize — sunset September 2023.

Q8.Explain MDE selection mathematically

MDE (minimum detectable effect) = the smallest true effect size you can detect with acceptable statistical power at your sample size. For a conversion test with baseline p, power β=80%, significance α=0.05, and sample size n per variant, MDE ≈ 2.8 × sqrt(2p(1-p)/n) / p (relative lift). Lower MDE requires larger n quadratically — cutting MDE from 10% to 5% requires 4x the sample. Pick MDE at the smallest lift that would change your product decision, not at 'what I'd like to find.'

Q9.Why does peeking bias inflate false positives?

Frequentist p-values assume you look once at the planned sample size. Each additional peek is another chance the random walk dips below p=0.05. Peeking 5 times inflates your true false-positive rate from 5% to ~14%; peeking daily for 30 days takes it to 30%+ (Johari et al., 2017). If you must peek, use sequential-testing methods (e.g., SPRT, group-sequential boundaries) or switch to Bayesian which is peeking-invariant.

Q10.How do I interpret p > 0.05?

It does NOT mean 'no effect' — it means 'the data is consistent with no effect given the variance you observed.' If you ran an adequately powered test and still got p > 0.05, accept the null (variants are equivalent for practical purposes). If you ran an under-powered test, p > 0.05 is ambiguous — you might have a real effect you couldn't detect. Check your post-hoc power to distinguish.

Three A/B test archetypes with full sample-size and MDE math

Archetype 1: DTC landing-page headline test (high traffic)

Baseline add-to-cart rate: 4.2%. Desired MDE: 8% relative lift (practical threshold — anything smaller doesn't change roadmap). Power 80%, significance 0.05. Required sample per variant: ~28,400 sessions. At 6,000 daily paid sessions split 50/50 = 3,000/variant/day, adequately powered in ~9.5 days. Plan to run 14 days to cover the full weekly cycle. Expected outcome: if true lift is 10%+, test reaches significance in the planned window 83% of the time. Tool: VWO at $199/month for SMB tier or Optimizely at $4k/month for enterprise.

Archetype 2: B2B SaaS signup-flow test (low traffic)

Baseline signup rate: 2.1%. Traffic 800 sessions/day. Desired MDE 15% relative lift. Required sample per variant: ~29,000 sessions. At 400/variant/day, 72 days to power — which is too long given SaaS release cycles. Solutions: (1) increase MDE to 25% (only big wins detectable — requires ~10,400 per variant, 26 days), (2) switch to Bayesian inference via Statsig (can call practical significance earlier with proper priors), (3) use CUPED variance reduction (Microsoft Research methodology) to cut required sample ~30%. Most PLG SaaS teams run (2) + (3) combined.

Archetype 3: Email subject-line test (Klaviyo, 50k list)

Baseline open rate 32%. Klaviyo native A/B test on 20% of list (10,000 recipients, 5,000 per variant). Desired MDE on open rate: 5 percentage points (37% vs 32%). With n=5,000 per variant, MDE at 80% power and α=0.05 is ~1.8 percentage points — so 5-point MDE is over-powered and you will detect smaller effects reliably. Klaviyo automatically sends the winner to the remaining 80% after a 4-hour significance window. Cost: already bundled in Klaviyo $150/month at that list size. This is the cleanest, cheapest A/B testing that most DTC brands never fully leverage.

Testing-stack reference, April 2026

Optimizely Web Personalization	$36k–$240k/year	Enterprise, MVT + personalization
VWO Growth plan	$199/month	SMB Bayesian + heatmaps
VWO Enterprise	$2,000+/month	Multivariate + advanced segmentation
Statsig free tier	$0	1M events/month, full features
Statsig Pro	$150+/month	CUPED, sequential testing
GrowthBook open source	$0 self-hosted	Cloud from $300/month
LaunchDarkly Starter	$10/user/month	Feature flags + basic experimentation
Adobe Target	$75k–$250k/year	Legacy enterprise
Evan Miller sample-size calc	Free web	Industry default planning tool

Decision framework: which experimentation platform fits your org

Pick based on team composition and scale. Under 500k monthly visitors with 1 marketer and 0 engineers: VWO for SMB — visual editor lets non-technical users ship tests without engineering bottleneck. 500k-5M monthly visitors with 2-3 engineers: Statsig or GrowthBook — both expose feature-flag API, and engineers will ship experiments as part of normal releases. Above 5M monthly visitors with dedicated experimentation team: Optimizely, Statsig Enterprise, or build in-house on top of LaunchDarkly + Mixpanel. Skip the "free with GA4" path (GA4 experiments were sunset in 2023 alongside Optimize) — you need a real platform with proper sequential testing and variance reduction if you want to run more than 5 tests/month without false-positive drift.

Three A/B-testing archetypes with April 2026 math

DTC CPG landing-page tests (AOV $42, $38 CAC target). Meta DTC CPM $19 drives 18k sessions/mo to a headline-test variant. Baseline CVR 3.4%, MDE 15% = 180 conversions per arm needed. At $500/day spend per arm, reads in 5 days for 80% power frequentist. Bayesian Beta-Binomial with Beta(34, 966) informative prior from 90-day baseline cuts required sample by 22% to ~140 conversions per arm. Klaviyo Growth ($150/mo) captures signups off the winning variant for downstream attribution.

SMB SaaS trial-onboarding tests ($5,400 ACV, $420 CAC). HubSpot Pro ($800/mo) wires conversions to stage data. Baseline trial-to-paid 24%, MDE 10% = 800 trials per arm. At 45 trials/day from Google Search CPC $6.50, reads in 18 days frequentist. Bayesian with informative prior reads in 13 days. LinkedIn Sponsored Content CPC $8–$14 feeds retargeting variant of test.

B2B mid-market demo-page tests ($39,600 ACV, $2,400 CAC). LinkedIn CPC $11 drives 1,200 sessions/mo to demo page. Baseline demo-request rate 6%, MDE 20% = 620 sessions per arm. At 40 sessions/day per arm, reads in 15 days frequentist. Ahrefs Advanced ($449/mo) mines competitor demo pages for test hypotheses. Enterprise HubSpot ($3,600/mo) ties the winning variant back to closed-won pipeline.

Frequentist vs Bayesian decision framework

Frequentist is correct when the business requires a fixed-sample test with clean false-positive control — regulated industries, pricing tests where governance matters, anything going to a board deck. Bayesian is better when tests run continuously and stakeholders need probability-to-beat-control readable without statistics training. Bayesian Beta-Binomial with an informative prior (Beta(α, β) where α+β reflects the sample-equivalent weight of your priors) typically reads 20–30% faster than frequentist while maintaining equivalent decision accuracy. Never mix: if you run Bayesian but report frequentist p-values, you're double-counting the test and inviting peeking bias. Pick one methodology per account and stick with it across the year.

A/B test significance analyzer

Results

Visualization

What p-value means

Common A/B test mistakes

Sample size rule of thumb

Frequently asked questions

A/B testing: the math is easy, the discipline is hard

Sample size planning (do this before you launch)

The 10 rules of honest A/B testing

Minimum detectable effect: the number that decides everything

Frequentist vs. Bayesian — which to use

Related tools

What to test vs. what to leave alone

Frequently asked questions

Three A/B test archetypes with full sample-size and MDE math

Archetype 1: DTC landing-page headline test (high traffic)

Archetype 2: B2B SaaS signup-flow test (low traffic)

Archetype 3: Email subject-line test (Klaviyo, 50k list)

Testing-stack reference, April 2026

Decision framework: which experimentation platform fits your org

Three A/B-testing archetypes with April 2026 math

Frequentist vs Bayesian decision framework

Track marketing ROI, CAC, LTV, and revenue per channel

More free tools