Check statistical significance for conversion tests โ p-value, lift, and sample size check.
Results
Variant A conv rate
5.00%
Variant B conv rate
6.25%
Lift
25.0%
B wins
p-value
0.0153
โ Statistically significant
Insight: Declare the winner with 98.5% confidence.
Visualization
What p-value means
p < 0.05 is the industry standard for 'significant' (95% confidence). A 0.04 p-value means a 4% chance the observed difference is random noise.
Common A/B test mistakes
Stopping early (when the test happens to look good). Low sample size. Multiple simultaneous tests interfering. Running different ad spend across variants.
Sample size rule of thumb
You need at least 1000 conversions per variant for small lifts (<10%) to be detectable. For 20%+ lifts, 300 conversions per variant is usually enough.
Get weekly marketing insights
Join 1,200+ readers. One email per week. Unsubscribe anytime.
Frequently asked questions
1.How long should I run a test?
Minimum one full business cycle (usually 2 weeks). Longer if weekend behavior differs from weekday.
2.Why p<0.05 and not p<0.01?
0.05 is convention. 0.01 is stricter but requires much larger samples. For high-stakes decisions, use 0.01.
3.What if my test is inconclusive?
Run it longer, increase traffic, or design a bigger experiment. Don't ship 'inconclusive' wins โ they revert.
4.Can I peek at results early?
No โ it inflates false positives. Set your sample size upfront and wait. Or use sequential testing methods.
5.What's the difference between Bayesian and frequentist?
Frequentist (this tool): 'there's a 4% chance this is noise.' Bayesian: 'there's a 96% chance B is better.' Frequentist is more common; Bayesian is more intuitive.
A/B testing: the math is easy, the discipline is hard
The math behind statistical significance hasn't changed in 100 years โ it's a Z-test on proportions for conversion-rate tests, or a T-test on continuous metrics for revenue. What breaks 80% of the A/B tests I audit isn't the math; it's the process around the math. Teams peek at results early. They stop tests when they "look good." They run too many tests simultaneously on overlapping audiences. They declare winners at n=200 conversions per variant because "the lift is huge." And then the "winner" reverts to baseline the following quarter and everyone wonders what happened.
This calculator gives you the frequentist p-value (the 95% confidence industry standard) and tells you whether the test is conclusive. Below, I'll walk through the rules that turn a calculator result into a real business decision โ and the traps that turn significance into self-deception.
Sample size planning (do this before you launch)
The #1 reason tests don't reach significance is under-powered sample sizes. Use a power-analysis calculator (Evan Miller's sample-size calculator is the industry standard) and plug in: baseline conversion rate, minimum detectable effect (MDE), and statistical power (usually 80%). Typical B2C ecommerce test with 3% baseline conversion and 10% MDE requires ~14,000 visitors per variant. At 8,000 daily visits total split 50/50, that's ~3.5 days to adequately powered โ but most teams need 14โ21 days to cover a full business cycle regardless.
Baseline 2%, MDE 20%
~4,300 / variant
Small traffic friendly
Baseline 3%, MDE 10%
~14,100 / variant
Tight tests get big
Baseline 5%, MDE 5%
~31,500 / variant
Most realistic on-site
Baseline 10%, MDE 5%
~15,000 / variant
CTA-button tests
Power target
80%
Standard; 90% for critical tests
Significance threshold
p < 0.05
0.01 for high-stakes
The 10 rules of honest A/B testing
Decide sample size and test duration BEFORE launching. Document the commitment. This prevents early-stopping temptation.
Run for a full business cycle. Minimum 7 full days. For B2B, usually 14โ21 days to capture weekday-vs-weekend behavior. Longer if your cycle is monthly (subscription renewals, salary-pay cadence).
Don't peek. Checking early and stopping when "significance is reached" inflates your false-positive rate from 5% to as high as 30%. Academic papers have been written on this (Johari et al., 2017).
One variable at a time. If you change the headline AND the hero image AND the CTA color, a win tells you nothing about which change worked.
Equal traffic allocation. Don't run 70/30 unless you have a specific reason (e.g., exposure-limited test). 50/50 maximizes statistical power.
Watch for interaction effects. If you run 5 tests simultaneously on the same page, they can interact. Use a proper experimentation platform (Optimizely, VWO, Statsig, GrowthBook) that handles this.
Segment post-hoc only to generate hypotheses, not to declare winners. "It won for mobile but lost for desktop" could be real or could be p-hacking. Re-run the test segmented if the segment effect matters.
Declare ties. If p > 0.20 after adequate sample, the variants are truly equivalent. Ship the cheaper/simpler one.
Check for sample ratio mismatch. If variant A got 50,000 and variant B got 48,000 visitors, your randomization is broken. Throw out the test.
Replicate critical winners. Before committing engineering resources to permanent implementation, re-run the winning variant against control for another cycle. "Regression to the mean" is real.
Minimum detectable effect: the number that decides everything
MDE is the smallest lift you can reliably detect given your traffic. If you have 5,000 visits per variant per month and a 2% baseline, you can only detect 30%+ lifts with reasonable power. Most "real" lifts from landing-page tweaks are 5โ15%, so that test is permanently under-powered. Two responses: (1) only test bold changes that plausibly move the needle 30%+ (different hero concepts, different offer structure), or (2) aggregate smaller pages into one test by running a site-wide change.
Most small-business owners I consult with are testing at traffic levels where only massive wins are detectable. They burn 12 weeks on a test that looks inconclusive and conclude "A/B testing doesn't work for us." The real diagnosis: their traffic is too low for the effect sizes they're trying to measure. Either test bigger changes or test less often with longer cycles.
Frequentist vs. Bayesian โ which to use
This tool is frequentist (p-value based, the classical approach). Bayesian testing โ used natively by VWO, Statsig, GrowthBook โ answers a different question: "what's the probability that B is actually better than A?" rather than "if there was no real difference, how likely is this much variation?"
Both are valid. Frequentist is standard in academic and regulated settings. Bayesian is more intuitive for business stakeholders ("B has a 94% chance of being better") and handles early stopping more gracefully. If you're running a modern experimentation platform, Bayesian with proper priors is usually the better default. For hand-calculated tests (this tool), frequentist is simpler and well-understood.
Tests have diminishing returns. The first few tests you run on a new landing page or email template will produce big wins โ that's typically where the page has glaring problems. By test 10โ15, you're into the 3โ8% lift range. By test 30+, you're often running tests that cost more in engineering and ops time than the lift recovers.
Prioritize tests by ICE score (Impact, Confidence, Ease). High-traffic pages with clear hypotheses backed by session-replay data or user interviews. Avoid: low-traffic pages, tests driven by "let's just try X," and aesthetic tweaks (button color, font size) that rarely produce meaningful lifts.
Frequently asked questions
Q1.Why did my 'winning' test revert to baseline after shipping?
Most likely: you stopped early (before reaching planned sample size), you ran the test during an unusual traffic period (holiday, one-time promo), or you fell for regression to the mean on a lucky variant. Re-run the test for a full cycle before shipping permanent changes.
Q2.What sample size do I need?
Depends on baseline conversion rate and the lift you want to detect. 3% baseline + 10% detectable lift requires ~14,000 visitors per variant at 80% power. Use Evan Miller's sample-size calculator before launching to plan properly.
Q3.Can I run multiple A/B tests at once?
Yes, on different pages or audiences. Simultaneous tests on the same page can interact; use a proper experimentation platform if you need to. If tests are on independent pages with independent audiences, simultaneous is fine.
Q4.What p-value should I use?
0.05 (95% confidence) is the industry default. For high-stakes decisions (pricing, homepage redesigns, major policy changes), use 0.01 (99% confidence) to cut false-positive rate. For low-stakes (button color on a low-traffic page), 0.10 is sometimes acceptable.
Q5.What if my test is inconclusive?
Either run longer (increase sample size) or accept the null. An inconclusive result is valid data โ it means the two variants are effectively equivalent given your traffic. Ship the cheaper or simpler option and move on.
Q6.How does iOS ATT affect A/B testing?
For on-site tests (landing pages, funnel), not at all โ cookies are first-party. For ad-platform tests (Meta A/B testing across campaigns), ATT reduces attribution accuracy and inflates variance. Use geo-holdout or lift-study methodology instead of platform-reported conversions.
Q7.What testing tools should I use in 2026?
Experimentation platforms: Optimizely at $36k-$240k/year enterprise (legacy king), VWO at $199-$2,000+/month (strong Bayesian), Statsig at $150-$2,500+/month + free tier (developer-first, CUPED variance reduction built in), GrowthBook from $0 open-source or $300-$2,500/month cloud, LaunchDarkly at $10-$20/user/month for feature flags + limited A/B. For survey-based pre-test qualitative: Hotjar $80-$350/month, Maze at $99-$399/month. Skip Google Optimize โ sunset September 2023.
Q8.Explain MDE selection mathematically
MDE (minimum detectable effect) = the smallest true effect size you can detect with acceptable statistical power at your sample size. For a conversion test with baseline p, power ฮฒ=80%, significance ฮฑ=0.05, and sample size n per variant, MDE โ 2.8 ร sqrt(2p(1-p)/n) / p (relative lift). Lower MDE requires larger n quadratically โ cutting MDE from 10% to 5% requires 4x the sample. Pick MDE at the smallest lift that would change your product decision, not at 'what I'd like to find.'
Q9.Why does peeking bias inflate false positives?
Frequentist p-values assume you look once at the planned sample size. Each additional peek is another chance the random walk dips below p=0.05. Peeking 5 times inflates your true false-positive rate from 5% to ~14%; peeking daily for 30 days takes it to 30%+ (Johari et al., 2017). If you must peek, use sequential-testing methods (e.g., SPRT, group-sequential boundaries) or switch to Bayesian which is peeking-invariant.
Q10.How do I interpret p > 0.05?
It does NOT mean 'no effect' โ it means 'the data is consistent with no effect given the variance you observed.' If you ran an adequately powered test and still got p > 0.05, accept the null (variants are equivalent for practical purposes). If you ran an under-powered test, p > 0.05 is ambiguous โ you might have a real effect you couldn't detect. Check your post-hoc power to distinguish.
Three A/B test archetypes with full sample-size and MDE math
Archetype 1: DTC landing-page headline test (high traffic)
Baseline add-to-cart rate: 4.2%. Desired MDE: 8% relative lift (practical threshold โ anything smaller doesn't change roadmap). Power 80%, significance 0.05. Required sample per variant: ~28,400 sessions. At 6,000 daily paid sessions split 50/50 = 3,000/variant/day, adequately powered in ~9.5 days. Plan to run 14 days to cover the full weekly cycle. Expected outcome: if true lift is 10%+, test reaches significance in the planned window 83% of the time. Tool: VWO at $199/month for SMB tier or Optimizely at $4k/month for enterprise.
Archetype 2: B2B SaaS signup-flow test (low traffic)
Baseline signup rate: 2.1%. Traffic 800 sessions/day. Desired MDE 15% relative lift. Required sample per variant: ~29,000 sessions. At 400/variant/day, 72 days to power โ which is too long given SaaS release cycles. Solutions: (1) increase MDE to 25% (only big wins detectable โ requires ~10,400 per variant, 26 days), (2) switch to Bayesian inference via Statsig (can call practical significance earlier with proper priors), (3) use CUPED variance reduction (Microsoft Research methodology) to cut required sample ~30%. Most PLG SaaS teams run (2) + (3) combined.
Archetype 3: Email subject-line test (Klaviyo, 50k list)
Baseline open rate 32%. Klaviyo native A/B test on 20% of list (10,000 recipients, 5,000 per variant). Desired MDE on open rate: 5 percentage points (37% vs 32%). With n=5,000 per variant, MDE at 80% power and ฮฑ=0.05 is ~1.8 percentage points โ so 5-point MDE is over-powered and you will detect smaller effects reliably. Klaviyo automatically sends the winner to the remaining 80% after a 4-hour significance window. Cost: already bundled in Klaviyo $150/month at that list size. This is the cleanest, cheapest A/B testing that most DTC brands never fully leverage.
Testing-stack reference, April 2026
Optimizely Web Personalization
$36kโ$240k/year
Enterprise, MVT + personalization
VWO Growth plan
$199/month
SMB Bayesian + heatmaps
VWO Enterprise
$2,000+/month
Multivariate + advanced segmentation
Statsig free tier
$0
1M events/month, full features
Statsig Pro
$150+/month
CUPED, sequential testing
GrowthBook open source
$0 self-hosted
Cloud from $300/month
LaunchDarkly Starter
$10/user/month
Feature flags + basic experimentation
Adobe Target
$75kโ$250k/year
Legacy enterprise
Evan Miller sample-size calc
Free web
Industry default planning tool
Decision framework: which experimentation platform fits your org
Pick based on team composition and scale. Under 500k monthly visitors with 1 marketer and 0 engineers: VWO for SMB โ visual editor lets non-technical users ship tests without engineering bottleneck. 500k-5M monthly visitors with 2-3 engineers: Statsig or GrowthBook โ both expose feature-flag API, and engineers will ship experiments as part of normal releases. Above 5M monthly visitors with dedicated experimentation team: Optimizely, Statsig Enterprise, or build in-house on top of LaunchDarkly + Mixpanel. Skip the "free with GA4" path (GA4 experiments were sunset in 2023 alongside Optimize) โ you need a real platform with proper sequential testing and variance reduction if you want to run more than 5 tests/month without false-positive drift.