A/B Testing in Web Analytics

A/B testing (split testing) compares two or more versions of a page, UI element, or marketing asset to find the variant that performs best. Instead of deciding by opinion, A/B tests give statistically reliable evidence about which changes actually move business metrics.

How A/B Testing Works

A/B testing is a controlled experiment. Traffic is randomly split between a control (A) and one or more variants (B, C, D). Each visitor sees one version. The analytics system records their behavior and conversions.

Process components:

Randomization: visitors are split between variants at random
Variable isolation: only one element changes per experiment
Statistical analysis: results are evaluated for reliability
Experience consistency: a returning visitor always sees the same variant

Traffic distribution example

An e-commerce store tests a new "Add to Cart" button:

Variant A (control): blue button, "Add to Cart", 50% of traffic
Variant B (test): green button, "Buy Now", 50% of traffic

After 10,000 visits:

Variant A: 250 conversions (5% conversion rate)
Variant B: 325 conversions (6.5% conversion rate)

Improvement: +30% with 95% statistical significance

Statistical Significance and Sample Size

Calculating Required Sample Size

Sample size depends on several factors:

Parameter	Description	Typical Value
Baseline conversion	Current conversion rate	2-5% for e-commerce
Minimum Detectable Effect (MDE)	Smallest improvement worth detecting	10-20% relative change
Statistical power	Probability of detecting a real effect	80% (standard)
Significance level	Acceptable false-positive rate	5% (p-value < 0.05)

Test Duration

Minimum test duration is set by sample size and by business cycles:

Weekly cycles: cover at least one full week
Seasonality: account for holidays, weekends, month boundaries
Statistical validity: don't stop at the first sign of significance

Peeking problem

Repeatedly checking results during a test inflates the false-positive rate. Checking significance 20 times during a test pushes the actual Type I error rate from 5% to 64%. Set test duration in advance and read results only after completion.

Types of A/B Tests

Classic A/B Test

Two versions, one element changed:

Page headline
Button color or text
Product image
Price or discount

A/B/n Testing

Several variants tested at once:

Needs more traffic to reach significance
Finds the optimal variant faster
Higher false-discovery risk from multiple comparisons

Multivariate Testing (MVT)

Combinations of multiple elements at once:

Headline (2 variants) × Image (3 variants) × Button (2 variants) = 12 combinations
Needs significant traffic
Reveals interactions between elements

Split URL Testing

Comparing radically different page versions on different URLs:

Full page redesigns
Different content structure
Alternative user flows

Choosing test type

Low traffic (< 1000 conversions/month): classic A/B with large changes
Medium traffic (1000-10,000 conversions/month): A/B/n with 3-4 variants
High traffic (> 10,000 conversions/month): MVT for combination optimization

Metrics to Track

Primary Metrics

Core metrics tied to business goals:

Conversion rate: share of visitors completing the target action
Average Order Value (AOV): average purchase amount
Revenue Per Visitor (RPV): total revenue / number of visitors
Lead count: forms, subscriptions, registrations completed

Secondary Metrics

Diagnostic metrics for understanding the change:

Bounce rate: did landing-page interest shift
Time on page: engagement impact
Scroll depth: content consumption
Element clicks: micro-conversions and interactions

Guardrail Metrics

Metrics that must not deteriorate:

Page load speed: a redesign shouldn't slow the site
Return rate: more conversions shouldn't mean more refunds
Newsletter unsubscribes: aggressive CTAs shouldn't backfire

OEC metric system

Overall Evaluation Criterion (OEC), a composite metric for the full picture:

OEC = Conversion × 0.5 + (AOV / 1000) × 0.3 + (1 - Bounce Rate) × 0.2

Coefficient weights are set from business priorities.

Common Mistakes

Technical Errors

Flicker effect:

Users briefly see the original version before it's replaced with the test version. Distorts results and degrades UX.

Variant leakage:

One user sees different variants on repeat visits because of cookie clearing or cross-device use.

Unbalanced distribution:

Uneven split between variants caused by code bugs or caching.

Methodological Errors

Testing without a hypothesis:

"Let's try a red button" is not a hypothesis. Better: "A red button will increase conversion by 15% because it creates urgency."

Stopping early on significance:

Regression to the mean. Early positive results often shrink as the test continues.

Ignoring segmentation:

Overall result may be flat while the new version wins on mobile and loses on desktop.

Simpson's Paradox example

Overall test result:

Variant A: 1000 conversions from 10,000 visits (10%)
Variant B: 950 conversions from 10,000 visits (9.5%)

Variant A looks better. With segmentation:

New users:

Variant A: 200 from 3,000 (6.7%)
Variant B: 250 from 3,000 (8.3%)

Returning users:

Variant A: 800 from 7,000 (11.4%)
Variant B: 700 from 7,000 (10%)

Variant B wins for new users, the key growth segment.

Alternative Approaches

Bayesian Approach

Compared to the frequentist approach with p-values, Bayesian testing:

Reports the probability that B is better than A
Allows stopping the test at any point
Incorporates prior knowledge about the metric
Provides interval estimates of the expected improvement

Multi-Armed Bandits

Algorithms that dynamically shift traffic toward winning variants:

Epsilon-greedy: random variant with probability ε, otherwise the best
Thompson Sampling: variants chosen proportionally to their probability of being best
Upper Confidence Bound: balances exploration and exploitation

Advantages:

Minimizes losses from exposing users to losing variants
Self-optimizing without manual intervention

Disadvantages:

Harder to interpret
Not suitable for tests with delayed conversions
More complex to implement

A/B Testing Infrastructure

Analytics System Requirements

A working A/B testing platform should provide:

Technical infrastructure:

Random user assignment via hash functions
Variant persistence through cookies or user ID
Minimal latency on variant decision (< 50ms)
Bot and automated traffic protection

Analytics capabilities:

Real-time significance calculation
Result segmentation by audience
Multiple metrics tracked at once
Anomaly and data-quality detection

Integration with Other Systems

A/B testing doesn't live in a vacuum:

CRM systems: variant sync with customer profiles
Email platforms: consistency between site and newsletters
Personalization systems: test results feed into recommendation algorithms
Data Warehouse: raw export for deep analysis

Data architecture

Minimum dataset for each test event:

{
  "user_id": "usr_123456",
  "session_id": "ses_789012",
  "timestamp": "2024-01-15T10:30:00Z",
  "experiment_id": "exp_homepage_hero",
  "variant": "B",
  "event_type": "pageview",
  "metadata": {
    "device": "mobile",
    "source": "organic",
    "new_user": false
  }
}

Culture of Experimentation

Test Prioritization

Not all test ideas are worth the same. ICE framework:

Impact: potential metric improvement (1-10)
Confidence: probability of a positive result (1-10)
Ease: implementation simplicity (1-10)

Final priority = (Impact × Confidence × Ease) / 30

Experiment Documentation

Each test deserves a record.

Before the test:

Hypothesis and reasoning
Experiment design and variants
Target metrics and MDE
Success criteria

After the test:

Actual results across all metrics
Segment analysis
Conclusions and recommendations
Next steps

Scaling the Testing Program

A/B testing maturity goes through stages:

Initial: 1-2 tests per month, focus on obvious improvements
Developing: 5-10 tests per month, systematic approach
Mature: 20+ tests in parallel, data-driven decision culture
Advanced: automation, machine learning, personalization

A/B testing turns product development from guesswork into a systematic process of data-backed improvements. Every test is a chance to understand the audience better and ship more effective experiences.

Tracking A/B tests with Statable

Statable does not run experiments. It does not split traffic, assign variants, or compute statistical significance. For that, integrate a dedicated experiment platform (LaunchDarkly, GrowthBook, Optimizely, PostHog, or your own server-side splitter).

What Statable does well is measure the test once it is running. The recommended pattern: when your experiment platform assigns a variant, attach the variant name as a custom property on every event for that visitor. For example:

<script>
  // After your experiment platform tells you the variant:
  window.statable.t('Pricing Page Loaded', { ab_variant: window.experimentVariant });
</script>

Or apply the property page-wide via the script tag:

<script defer
  src="https://statable.com/js/YOUR_SITE_ID/s.js"
  data-statable-ab-variant="control"></script>

The prop_ab_variant filter then lets you slice every dashboard report by variant: bounce rate per variant, conversion rate per variant, traffic source per variant, and so on. Combine with Goals to compare goal completion rates between variants.

Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.