Skip to content

A/B Testing in Web Analytics

A/B testing (split testing) compares two or more versions of a page, UI element, or marketing asset to find the variant that performs best. Instead of deciding by opinion, A/B tests give statistically reliable evidence about which changes actually move business metrics.

How A/B Testing Works

A/B testing is a controlled experiment. Traffic is randomly split between a control (A) and one or more variants (B, C, D). Each visitor sees one version. The analytics system records their behavior and conversions.

Process components:

  • Randomization: visitors are split between variants at random
  • Variable isolation: only one element changes per experiment
  • Statistical analysis: results are evaluated for reliability
  • Experience consistency: a returning visitor always sees the same variant

Traffic distribution example

An e-commerce store tests a new "Add to Cart" button:

  • Variant A (control): blue button, "Add to Cart", 50% of traffic
  • Variant B (test): green button, "Buy Now", 50% of traffic

After 10,000 visits:

  • Variant A: 250 conversions (5% conversion rate)
  • Variant B: 325 conversions (6.5% conversion rate)

Improvement: +30% with 95% statistical significance

Statistical Significance and Sample Size

Calculating Required Sample Size

Sample size depends on several factors:

ParameterDescriptionTypical Value
Baseline conversionCurrent conversion rate2-5% for e-commerce
Minimum Detectable Effect (MDE)Smallest improvement worth detecting10-20% relative change
Statistical powerProbability of detecting a real effect80% (standard)
Significance levelAcceptable false-positive rate5% (p-value < 0.05)

Test Duration

Minimum test duration is set by sample size and by business cycles:

  • Weekly cycles: cover at least one full week
  • Seasonality: account for holidays, weekends, month boundaries
  • Statistical validity: don't stop at the first sign of significance

Peeking problem

Repeatedly checking results during a test inflates the false-positive rate. Checking significance 20 times during a test pushes the actual Type I error rate from 5% to 64%. Set test duration in advance and read results only after completion.

Types of A/B Tests

Classic A/B Test

Two versions, one element changed:

  • Page headline
  • Button color or text
  • Product image
  • Price or discount

A/B/n Testing

Several variants tested at once:

  • Needs more traffic to reach significance
  • Finds the optimal variant faster
  • Higher false-discovery risk from multiple comparisons

Multivariate Testing (MVT)

Combinations of multiple elements at once:

  • Headline (2 variants) × Image (3 variants) × Button (2 variants) = 12 combinations
  • Needs significant traffic
  • Reveals interactions between elements

Split URL Testing

Comparing radically different page versions on different URLs:

  • Full page redesigns
  • Different content structure
  • Alternative user flows

Choosing test type

  • Low traffic (< 1000 conversions/month): classic A/B with large changes
  • Medium traffic (1000-10,000 conversions/month): A/B/n with 3-4 variants
  • High traffic (> 10,000 conversions/month): MVT for combination optimization

Metrics to Track

Primary Metrics

Core metrics tied to business goals:

  • Conversion rate: share of visitors completing the target action
  • Average Order Value (AOV): average purchase amount
  • Revenue Per Visitor (RPV): total revenue / number of visitors
  • Lead count: forms, subscriptions, registrations completed

Secondary Metrics

Diagnostic metrics for understanding the change:

  • Bounce rate: did landing-page interest shift
  • Time on page: engagement impact
  • Scroll depth: content consumption
  • Element clicks: micro-conversions and interactions

Guardrail Metrics

Metrics that must not deteriorate:

  • Page load speed: a redesign shouldn't slow the site
  • Return rate: more conversions shouldn't mean more refunds
  • Newsletter unsubscribes: aggressive CTAs shouldn't backfire

OEC metric system

Overall Evaluation Criterion (OEC), a composite metric for the full picture:

OEC = Conversion × 0.5 + (AOV / 1000) × 0.3 + (1 - Bounce Rate) × 0.2

Coefficient weights are set from business priorities.

Common Mistakes

Technical Errors

Flicker effect:

Users briefly see the original version before it's replaced with the test version. Distorts results and degrades UX.

Variant leakage:

One user sees different variants on repeat visits because of cookie clearing or cross-device use.

Unbalanced distribution:

Uneven split between variants caused by code bugs or caching.

Methodological Errors

Testing without a hypothesis:

"Let's try a red button" is not a hypothesis. Better: "A red button will increase conversion by 15% because it creates urgency."

Stopping early on significance:

Regression to the mean. Early positive results often shrink as the test continues.

Ignoring segmentation:

Overall result may be flat while the new version wins on mobile and loses on desktop.

Simpson's Paradox example

Overall test result:

  • Variant A: 1000 conversions from 10,000 visits (10%)
  • Variant B: 950 conversions from 10,000 visits (9.5%)

Variant A looks better. With segmentation:

New users:

  • Variant A: 200 from 3,000 (6.7%)
  • Variant B: 250 from 3,000 (8.3%)

Returning users:

  • Variant A: 800 from 7,000 (11.4%)
  • Variant B: 700 from 7,000 (10%)

Variant B wins for new users, the key growth segment.

Alternative Approaches

Bayesian Approach

Compared to the frequentist approach with p-values, Bayesian testing:

  • Reports the probability that B is better than A
  • Allows stopping the test at any point
  • Incorporates prior knowledge about the metric
  • Provides interval estimates of the expected improvement

Multi-Armed Bandits

Algorithms that dynamically shift traffic toward winning variants:

  • Epsilon-greedy: random variant with probability ε, otherwise the best
  • Thompson Sampling: variants chosen proportionally to their probability of being best
  • Upper Confidence Bound: balances exploration and exploitation

Advantages:

  • Minimizes losses from exposing users to losing variants
  • Self-optimizing without manual intervention

Disadvantages:

  • Harder to interpret
  • Not suitable for tests with delayed conversions
  • More complex to implement

A/B Testing Infrastructure

Analytics System Requirements

A working A/B testing platform should provide:

Technical infrastructure:

  • Random user assignment via hash functions
  • Variant persistence through cookies or user ID
  • Minimal latency on variant decision (< 50ms)
  • Bot and automated traffic protection

Analytics capabilities:

  • Real-time significance calculation
  • Result segmentation by audience
  • Multiple metrics tracked at once
  • Anomaly and data-quality detection

Integration with Other Systems

A/B testing doesn't live in a vacuum:

  • CRM systems: variant sync with customer profiles
  • Email platforms: consistency between site and newsletters
  • Personalization systems: test results feed into recommendation algorithms
  • Data Warehouse: raw export for deep analysis

Data architecture

Minimum dataset for each test event:

{
  "user_id": "usr_123456",
  "session_id": "ses_789012",
  "timestamp": "2024-01-15T10:30:00Z",
  "experiment_id": "exp_homepage_hero",
  "variant": "B",
  "event_type": "pageview",
  "metadata": {
    "device": "mobile",
    "source": "organic",
    "new_user": false
  }
}

Culture of Experimentation

Test Prioritization

Not all test ideas are worth the same. ICE framework:

  • Impact: potential metric improvement (1-10)
  • Confidence: probability of a positive result (1-10)
  • Ease: implementation simplicity (1-10)

Final priority = (Impact × Confidence × Ease) / 30

Experiment Documentation

Each test deserves a record.

Before the test:

  • Hypothesis and reasoning
  • Experiment design and variants
  • Target metrics and MDE
  • Success criteria

After the test:

  • Actual results across all metrics
  • Segment analysis
  • Conclusions and recommendations
  • Next steps

Scaling the Testing Program

A/B testing maturity goes through stages:

  1. Initial: 1-2 tests per month, focus on obvious improvements
  2. Developing: 5-10 tests per month, systematic approach
  3. Mature: 20+ tests in parallel, data-driven decision culture
  4. Advanced: automation, machine learning, personalization

A/B testing turns product development from guesswork into a systematic process of data-backed improvements. Every test is a chance to understand the audience better and ship more effective experiences.

Tracking A/B tests with Statable

Statable does not run experiments. It does not split traffic, assign variants, or compute statistical significance. For that, integrate a dedicated experiment platform (LaunchDarkly, GrowthBook, Optimizely, PostHog, or your own server-side splitter).

What Statable does well is measure the test once it is running. The recommended pattern: when your experiment platform assigns a variant, attach the variant name as a custom property on every event for that visitor. For example:

<script>
  // After your experiment platform tells you the variant:
  window.statable.t('Pricing Page Loaded', { ab_variant: window.experimentVariant });
</script>

Or apply the property page-wide via the script tag:

<script defer
  src="https://statable.com/js/YOUR_SITE_ID/s.js"
  data-statable-ab-variant="control"></script>

The prop_ab_variant filter then lets you slice every dashboard report by variant: bounce rate per variant, conversion rate per variant, traffic source per variant, and so on. Combine with Goals to compare goal completion rates between variants.


Ready to take control of your web analytics? Try Statable free for 30 days — no credit card required, full feature access, GDPR-compliant by default. Start your free trial or view a live demo.