A/B Testing in Web Analytics
A/B testing (split testing) compares two or more versions of a page, UI element, or marketing asset to find the variant that performs best. Instead of deciding by opinion, A/B tests give statistically reliable evidence about which changes actually move business metrics.
How A/B Testing Works
A/B testing is a controlled experiment. Traffic is randomly split between a control (A) and one or more variants (B, C, D). Each visitor sees one version. The analytics system records their behavior and conversions.
Process components:
- Randomization: visitors are split between variants at random
- Variable isolation: only one element changes per experiment
- Statistical analysis: results are evaluated for reliability
- Experience consistency: a returning visitor always sees the same variant
Traffic distribution example
An e-commerce store tests a new "Add to Cart" button:
- Variant A (control): blue button, "Add to Cart", 50% of traffic
- Variant B (test): green button, "Buy Now", 50% of traffic
After 10,000 visits:
- Variant A: 250 conversions (5% conversion rate)
- Variant B: 325 conversions (6.5% conversion rate)
Improvement: +30% with 95% statistical significance
Statistical Significance and Sample Size
Calculating Required Sample Size
Sample size depends on several factors:
| Parameter | Description | Typical Value |
|---|---|---|
| Baseline conversion | Current conversion rate | 2-5% for e-commerce |
| Minimum Detectable Effect (MDE) | Smallest improvement worth detecting | 10-20% relative change |
| Statistical power | Probability of detecting a real effect | 80% (standard) |
| Significance level | Acceptable false-positive rate | 5% (p-value < 0.05) |
Test Duration
Minimum test duration is set by sample size and by business cycles:
- Weekly cycles: cover at least one full week
- Seasonality: account for holidays, weekends, month boundaries
- Statistical validity: don't stop at the first sign of significance
Peeking problem
Repeatedly checking results during a test inflates the false-positive rate. Checking significance 20 times during a test pushes the actual Type I error rate from 5% to 64%. Set test duration in advance and read results only after completion.
Types of A/B Tests
Classic A/B Test
Two versions, one element changed:
- Page headline
- Button color or text
- Product image
- Price or discount
A/B/n Testing
Several variants tested at once:
- Needs more traffic to reach significance
- Finds the optimal variant faster
- Higher false-discovery risk from multiple comparisons
Multivariate Testing (MVT)
Combinations of multiple elements at once:
- Headline (2 variants) × Image (3 variants) × Button (2 variants) = 12 combinations
- Needs significant traffic
- Reveals interactions between elements
Split URL Testing
Comparing radically different page versions on different URLs:
- Full page redesigns
- Different content structure
- Alternative user flows
Choosing test type
- Low traffic (< 1000 conversions/month): classic A/B with large changes
- Medium traffic (1000-10,000 conversions/month): A/B/n with 3-4 variants
- High traffic (> 10,000 conversions/month): MVT for combination optimization
Metrics to Track
Primary Metrics
Core metrics tied to business goals:
- Conversion rate: share of visitors completing the target action
- Average Order Value (AOV): average purchase amount
- Revenue Per Visitor (RPV): total revenue / number of visitors
- Lead count: forms, subscriptions, registrations completed
Secondary Metrics
Diagnostic metrics for understanding the change:
- Bounce rate: did landing-page interest shift
- Time on page: engagement impact
- Scroll depth: content consumption
- Element clicks: micro-conversions and interactions
Guardrail Metrics
Metrics that must not deteriorate:
- Page load speed: a redesign shouldn't slow the site
- Return rate: more conversions shouldn't mean more refunds
- Newsletter unsubscribes: aggressive CTAs shouldn't backfire
OEC metric system
Overall Evaluation Criterion (OEC), a composite metric for the full picture:
Coefficient weights are set from business priorities.
Common Mistakes
Technical Errors
Flicker effect:
Users briefly see the original version before it's replaced with the test version. Distorts results and degrades UX.
Variant leakage:
One user sees different variants on repeat visits because of cookie clearing or cross-device use.
Unbalanced distribution:
Uneven split between variants caused by code bugs or caching.
Methodological Errors
Testing without a hypothesis:
"Let's try a red button" is not a hypothesis. Better: "A red button will increase conversion by 15% because it creates urgency."
Stopping early on significance:
Regression to the mean. Early positive results often shrink as the test continues.
Ignoring segmentation:
Overall result may be flat while the new version wins on mobile and loses on desktop.
Simpson's Paradox example
Overall test result:
- Variant A: 1000 conversions from 10,000 visits (10%)
- Variant B: 950 conversions from 10,000 visits (9.5%)
Variant A looks better. With segmentation:
New users:
- Variant A: 200 from 3,000 (6.7%)
- Variant B: 250 from 3,000 (8.3%)
Returning users:
- Variant A: 800 from 7,000 (11.4%)
- Variant B: 700 from 7,000 (10%)
Variant B wins for new users, the key growth segment.
Alternative Approaches
Bayesian Approach
Compared to the frequentist approach with p-values, Bayesian testing:
- Reports the probability that B is better than A
- Allows stopping the test at any point
- Incorporates prior knowledge about the metric
- Provides interval estimates of the expected improvement
Multi-Armed Bandits
Algorithms that dynamically shift traffic toward winning variants:
- Epsilon-greedy: random variant with probability ε, otherwise the best
- Thompson Sampling: variants chosen proportionally to their probability of being best
- Upper Confidence Bound: balances exploration and exploitation
Advantages:
- Minimizes losses from exposing users to losing variants
- Self-optimizing without manual intervention
Disadvantages:
- Harder to interpret
- Not suitable for tests with delayed conversions
- More complex to implement
A/B Testing Infrastructure
Analytics System Requirements
A working A/B testing platform should provide:
Technical infrastructure:
- Random user assignment via hash functions
- Variant persistence through cookies or user ID
- Minimal latency on variant decision (< 50ms)
- Bot and automated traffic protection
Analytics capabilities:
- Real-time significance calculation
- Result segmentation by audience
- Multiple metrics tracked at once
- Anomaly and data-quality detection
Integration with Other Systems
A/B testing doesn't live in a vacuum:
- CRM systems: variant sync with customer profiles
- Email platforms: consistency between site and newsletters
- Personalization systems: test results feed into recommendation algorithms
- Data Warehouse: raw export for deep analysis
Data architecture
Minimum dataset for each test event:
Culture of Experimentation
Test Prioritization
Not all test ideas are worth the same. ICE framework:
- Impact: potential metric improvement (1-10)
- Confidence: probability of a positive result (1-10)
- Ease: implementation simplicity (1-10)
Final priority = (Impact × Confidence × Ease) / 30
Experiment Documentation
Each test deserves a record.
Before the test:
- Hypothesis and reasoning
- Experiment design and variants
- Target metrics and MDE
- Success criteria
After the test:
- Actual results across all metrics
- Segment analysis
- Conclusions and recommendations
- Next steps
Scaling the Testing Program
A/B testing maturity goes through stages:
- Initial: 1-2 tests per month, focus on obvious improvements
- Developing: 5-10 tests per month, systematic approach
- Mature: 20+ tests in parallel, data-driven decision culture
- Advanced: automation, machine learning, personalization
A/B testing turns product development from guesswork into a systematic process of data-backed improvements. Every test is a chance to understand the audience better and ship more effective experiences.
Tracking A/B tests with Statable
Statable does not run experiments. It does not split traffic, assign variants, or compute statistical significance. For that, integrate a dedicated experiment platform (LaunchDarkly, GrowthBook, Optimizely, PostHog, or your own server-side splitter).
What Statable does well is measure the test once it is running. The recommended pattern: when your experiment platform assigns a variant, attach the variant name as a custom property on every event for that visitor. For example:
<script>
// After your experiment platform tells you the variant:
window.statable.t('Pricing Page Loaded', { ab_variant: window.experimentVariant });
</script>
Or apply the property page-wide via the script tag:
<script defer
src="https://statable.com/js/YOUR_SITE_ID/s.js"
data-statable-ab-variant="control"></script>
The prop_ab_variant filter then lets you slice every dashboard report by variant: bounce rate per variant, conversion rate per variant, traffic source per variant, and so on. Combine with Goals to compare goal completion rates between variants.
Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.