Online Experiments

TL;DR

Online experiments measure causal impact under live traffic. They are the bridge between offline ML metrics and real product outcomes. A good experimentation platform controls assignment, exposure logging, metric computation, guardrails, sample ratio checks, segmentation, and rollout decisions. Without this discipline, teams mistake correlation, novelty effects, or broken logging for model improvement.

Experiment Lifecycle

An experiment is not just a traffic split. It is a measurement system with a decision rule.

A/B Test, Canary, Shadow, or Bandit?

Technique	Primary question	Traffic impact	Best for
Shadow	Can it run safely?	No user decision impact	Runtime validation
Canary	Is it safe enough to continue?	Small impact	Release safety
A/B test	Is it better?	Controlled user/entity impact	Product and model quality
Interleaving	Which ranker wins per query/session?	Mixed results in one surface	Search/recommendation rankers
Bandit	Which variant should get more traffic while learning?	Adaptive impact	Fast optimization with clear reward

Use the weakest technique that answers the decision. Do not use a bandit when you need an unbiased product readout.

Randomization Unit

The unit of randomization must match the interference pattern.

Unit	Use when	Risk
Request	Stateless prediction	Same user sees inconsistent behavior
User	Personalized product surface	Household/team effects ignored
Session	Short-lived experience	Cross-session learning contamination
Entity	Marketplace item, merchant, creator	User experience mixes treatment/control
Cluster/region	Network or marketplace interference	Needs more traffic and longer duration

For recommender systems, user-level assignment is common, but marketplace and social systems often have interference: one user's treatment changes another user's inventory or content exposure.

Metric Hierarchy

Examples:

Primary: conversion, retention, fraud loss, successful task completion.
Guardrails: latency, error rate, complaint rate, refund rate, manual review volume.
Diagnostics: feature miss rate, score distribution, ranking depth, cache hit rate.
Slices: new users, geography, device, language, high-risk tenants, cold-start items.

If the primary metric wins but guardrails fail, the experiment should not ship.

Exposure Logging

Correct exposure logging is the foundation of experiment analysis.

Log:

Experiment ID and variant.
Assignment unit and stable identifier.
Exposure timestamp.
Model and policy version.
Surface or placement.
Candidate set and rank when relevant.
Eligibility reason and filters.
Downstream outcomes with event time.

Assignment without exposure can overcount users who were eligible but never saw the treatment.

Common Statistical Checks

Sample Ratio Mismatch

Expected split is 50/50 but observed exposure is 60/40. This usually indicates assignment, eligibility, logging, or caching bugs.

Novelty Effects

Users react strongly because a feature is new, not because it is better long term.

Carryover Effects

Treatment changes user behavior after the experiment ends or contaminates later sessions.

Multiple Comparisons

If teams inspect many metrics and slices, some will look significant by chance.

Peeking and Early Stopping

Stopping the first time a metric looks good inflates false positives unless the test is designed for sequential analysis.

Experiment Platform Architecture

The SDK should make assignment and exposure logging hard to separate. Broken logging creates more bad launches than weak statistical tests.

ML-Specific Experiment Issues

Delayed Labels

Fraud, credit, churn, and abuse outcomes may arrive days or weeks later. Use short-term proxies for monitoring, but wait for delayed labels before declaring long-term quality.

Training Contamination

If treatment traffic enters the next training dataset, it can contaminate future control comparisons.

Feedback Loop Changes

Recommendation and ranking models change what data the system collects. Holdout traffic and exploration logs help measure whether the model is learning from its own bias.

Segment-Specific Regressions

Aggregate wins can hide regressions in important groups. Require slice review before full launch.

Decision Matrix

Result	Decision
Primary wins, guardrails pass, slices pass	Ramp or launch
Primary wins, guardrail fails	Do not launch; fix safety or cost
Primary neutral, diagnostics improve	Keep learning; maybe ship only if operational benefit matters
Primary loses, one slice wins	Consider targeted rollout only if policy and sample size support it
SRM detected	Invalidate result until root cause is fixed
Delayed labels unavailable	Continue canary or limit authority until labels mature

Failure Modes

Metric Mismatch

The experiment optimizes clicks, but the business needs retention or trust.

Mitigation: define metric hierarchy before launch and include long-term guardrails.

Interference

Control and treatment users affect each other through shared inventory, social graphs, or marketplaces.

Mitigation: cluster randomization, marketplace-level analysis, or experiments isolated by segment.

Logging Drift

The treatment changes what events are logged, making outcomes incomparable.

Mitigation: event contract tests, invariant metrics, and preflight validation with shadow traffic.

Experiment Debt

Old experiments keep running, flags stack up, and assignment behavior becomes unreadable.

Mitigation: expiration dates, ownership, cleanup automation, and experiment registry review.

Key Takeaways

Online experiments measure causal impact; canaries measure rollout safety.
Randomization unit must match the system's interference pattern.
Exposure logging is part of the product surface, not an analytics afterthought.
Guardrails and slice metrics prevent aggregate wins from becoming production regressions.
Delayed labels and feedback loops make ML experiments slower and more operationally complex than ordinary UI tests.

Online Experiments ​

TL;DR ​

Experiment Lifecycle ​

A/B Test, Canary, Shadow, or Bandit? ​

Randomization Unit ​

Metric Hierarchy ​

Exposure Logging ​

Common Statistical Checks ​

Sample Ratio Mismatch ​

Novelty Effects ​

Carryover Effects ​

Multiple Comparisons ​

Peeking and Early Stopping ​

Experiment Platform Architecture ​

ML-Specific Experiment Issues ​

Delayed Labels ​

Training Contamination ​

Feedback Loop Changes ​

Segment-Specific Regressions ​

Decision Matrix ​

Failure Modes ​

Metric Mismatch ​

Interference ​

Logging Drift ​

Experiment Debt ​

Key Takeaways ​

References ​