The experiment concluded. Variant B beat variant A by 4.2% on conversions, p < 0.01. Now what?
For most growth teams, the expensive work starts here. Someone drafts a debrief in a Google Doc. Someone else forgets to check whether the pricing test running in parallel may have skewed the read. Nobody connects this win to the three failed checkout experiments from last quarter that pointed in the same direction the whole time.
The gap between getting a result and knowing what to do next is the most expensive bottleneck in modern experimentation programs. According to Optimizely's analysis of more than 127,000 experiments, roughly 12% of tests produce a statistically significant win on their primary metric[2] — which means about 88% require careful interpretation to extract any value at all. Your specific win rate depends on design rigor and product maturity. The interpretation problem applies regardless.
The structural read: most teams optimize the wrong constraint. They invest in running more tests when the binding constraint is interpreting the ones they already shipped.
Spotify's engineering team named the failure mode and built around it. Their Experiments with Learning (EwL) framework reports that roughly 64% of experiments produce actionable learning[1] — even when the test does not "win." Caught regressions and well-powered neutral results carry real business signal. Capturing that signal requires every result to land inside a context that most teams do not maintain.
That is the role of what we are calling the Experiment Interpreter — an agent wedged between your experimentation platform and your strategy process, doing the cross-referencing humans skip.
What the Interpreter Actually Does
Six functions that compress the path from concluded result to the next test.
- [01]
Pull the concluded result
When an experiment hits significance or its predetermined runtime, the agent ingests the full payload: variant performance, confidence intervals, segment breakdowns, guardrail metric impacts. The whole record, not the headline.
- [02]
Retrieve the 20 most related historical experiments
A context retrieval query (the architectural decision that determines whether this whole system works) surfaces past tests sharing feature areas, metrics, segments, or hypotheses with the concluded one.
- [03]
Cross-reference release events
The agent pulls the deployment log to identify code releases, feature launches, and infra changes inside the experiment window. Anything shipped during the test is a candidate confound — name it before it shows up in the debrief.
- [04]
Generate a causal debrief
Restating that variant B won is not analysis. The agent triangulates the current result against historical patterns and release context to name the probable causal mechanism — and flag the alternatives it cannot rule out.
- [05]
Flag interactions with active experiments
Every currently-running experiment gets checked for audience overlap and metric collision. Interaction effects are rare. When they happen, they invalidate both reads. The agent surfaces the risk before the data gets corrupted, not after.
- [06]
Recommend the next three experiments
Based on the debrief, the gaps surfaced by historical review, and the team's roadmap, the agent proposes three follow-ups ranked by expected learning value. The output is a prioritized queue, not a brainstorm.
Retrieval Query Design Is the Whole Game
Pull back the wrong experiments and the debrief is noise. This is the leverage point most teams underestimate.
The agent's value compounds or collapses on one decision: the quality of the related experiments it retrieves. Pull irrelevant tests and the debrief turns into noise. Miss the critical predecessor and you repeat the mistake. The retrieval query is the hardest engineering problem in this system — and the one most teams hand-wave past on the way to building the LLM call.
Looking at how Statsig, Optimizely, and GrowthBook handle experiment metadata, a workable retrieval strategy combines five dimensions of relatedness. Weight them deliberately.
| Dimension | Signal Source | Weight | Match Example |
|---|---|---|---|
| Feature area | Page/component tags, feature flags | High | Two tests both target the checkout flow |
| Metric overlap | Primary and guardrail metric sets | High | Both measure cart abandonment rate |
| User segment | Audience targeting rules | Medium | Both target mobile users in the NA region |
| Hypothesis cluster | Semantic similarity of hypothesis text | Medium | Both test whether urgency cues lift conversion |
| Temporal proximity | Experiment date ranges | Low | Ran inside a 90-day window of each other |
The weighting is load-bearing. Two experiments that share the same feature area and the same primary metric are almost certainly related, even if they ran a year apart. Two experiments that ran the same week on different parts of the product almost certainly are not.
In practice, the retrieval query works as a hybrid: a structured filter on feature area and metrics to find candidates, then semantic similarity on hypothesis descriptions to rank them. SQL WHERE clauses for the hard filters. Vector search for the fuzzy matching. Treat the structured filter as the primary mechanism. Treat embeddings as a re-ranker.
context-retrieval.ts// Two-phase retrieval. Structured filter first. Semantic re-rank second.
interface ExperimentContext {
featureArea: string[]; // e.g., ["checkout", "payment-form"]
primaryMetrics: string[]; // e.g., ["conversion_rate", "aov"]
guardMetrics: string[]; // e.g., ["bounce_rate", "latency_p99"]
segments: string[]; // e.g., ["mobile", "na-region"]
hypothesis: string; // Free-text hypothesis description
dateRange: { start: Date; end: Date };
}
async function findRelatedExperiments(
concluded: ExperimentContext,
limit: number = 20
): Promise<Experiment[]> {
// Phase 1: structured filter does 80% of the work
const candidates = await db.experiments.findMany({
where: {
OR: [
{ featureArea: { hasSome: concluded.featureArea } },
{ primaryMetrics: { hasSome: concluded.primaryMetrics } },
{ guardMetrics: { hasSome: concluded.guardMetrics } },
],
status: "completed",
id: { not: concluded.id },
},
orderBy: { completedAt: "desc" },
take: 100,
});
// Phase 2: semantic similarity re-ranks the candidate set
const hypothesisEmbedding = await embed(concluded.hypothesis);
const ranked = candidates
.map(exp => ({
...exp,
score: cosineSimilarity(hypothesisEmbedding, exp.hypothesisEmbedding)
+ (overlapScore(concluded.featureArea, exp.featureArea) * 2.0)
+ (overlapScore(concluded.primaryMetrics, exp.primaryMetrics) * 2.0)
+ (overlapScore(concluded.segments, exp.segments) * 1.0)
+ (recencyBonus(exp.completedAt) * 0.5),
}))
.sort((a, b) => b.score - a.score);
return ranked.slice(0, limit);
}Debriefs That Name the Mechanism, Not Just the Winner
Reporting which variant won is restating the result. Naming why is the work.
A debrief that says "the green button outperformed the blue one by 4.2%" is not analysis. It is paraphrase. The strategist needs three things: why the result happened, what it confirms or contradicts from prior tests, and where the team should put the next dollar.
The Interpreter builds causal narratives by triangulating three evidence layers.
Direct experimental evidence
Statistical significance and effect size of the concluded test
Segment-level breakdowns showing where the effect was strongest
Guardrail metrics confirming no negative side effects
Historical pattern evidence
Prior experiments in the same feature area that moved the same metrics
Consistent direction of effect across multiple related tests
Previous failures that narrow the candidate causal mechanism
Confounding context evidence
Release events inside the experiment window that could explain the lift
Seasonal or external factors active during the test period
Concurrent experiments with overlapping audiences or metrics
The operative word in causal inference is triangulation. No single experiment proves a mechanism. When multiple lines of evidence converge on the same explanation, confidence rises. The agent's job is to do that triangulation automatically, every time, without requiring someone to remember.
A concrete read. The pricing page experiment showed a 7% lift in trial signups when the annual plan moved to the left column. A naive debrief stops there. The Interpreter retrieves three related tests from the past six months: a test that made the annual plan visually larger (3% lift), a test that added a "most popular" badge to the annual plan (5% lift), and a test that changed the annual price (neutral). The pattern resolves: users respond to visual prominence of the annual option, not to price. That mechanism reshapes what gets tested next.
A failure mode we hit early. The agent triangulated a confident causal story from three experiments that all ran during a holiday season. Seasonal effect was the real driver. Adding release-event cross-referencing — and flagging calendar anomalies as candidate confounds — caught this class of error. Drift toward over-confident causal claims is the default state when the agent has no view of the calendar.
Interactions Are Rare. When They Happen They Invalidate Both Reads.
Overlapping experiments are necessary for velocity and corrosive when they collide silently.
Growth teams at any throughput run dozens of experiments in parallel. Statsig has written extensively on the tension between experiment velocity and interaction risk[3]. The pragmatic consensus: overlap is fine most of the time, and the rare interactions that do occur produce wildly wrong conclusions if nobody is looking for them.
Interaction effects appear when the combined impact of two experiments diverges from the sum of their individual effects[4]. If experiment A lifts conversion by 3% and experiment B lifts it by 2%, an additive world expects 5%. The interaction might produce 8% — or 0%. The failure mode is not frequency. It is invisibility. Interactions do not announce themselves.
Results assumed to be independent — the assumption is never tested
Conflicting reads attributed to random variation and shipped anyway
Shipped features collide in production and produce regressions nobody predicted
Post-mortems surface the interaction weeks after the fact
Audience overlap between active experiments flagged in real time
Statistical tests identify non-additive effects across concurrent tests
Interaction risks surface before the experiment concludes, not after
Every debrief carries an interaction analysis as a default section
The Interpreter checks three dimensions of interaction risk for every active pair: audience overlap (are the same users in both?), metric collision (do both target the same primary metric?), and feature adjacency (do the changes touch related parts of the experience?). When all three score high, the pair gets flagged and the debrief carries an interaction analysis. The check runs by default. The check is not optional.
Recommendations: From Analyst to Strategist
Reporting what happened is half the work. Naming the highest-leverage next test is the other half.
Recommendations are where the Interpreter shifts register. Reporting becomes proposing. The mechanism: identify the highest-value knowledge gaps surfaced by the debrief, and rank candidate next tests against the team's roadmap.
The scoring weighs four factors.
Back to the pricing-page case. After the Interpreter resolves that visual prominence — not price — drives annual plan adoption, it proposes three follow-ups. First, a comparison table that visually favors the annual plan, to test whether the mechanism amplifies. Second, the same prominence treatment on the upgrade page for existing free users, to test whether the mechanism transfers. Third, removing the monthly option entirely, to test whether choice reduction helps or hurts.
Each recommendation carries a hypothesis grounded in the historical evidence, a predicted effect range, and a list of metrics to track. The team can evaluate and launch the next experiment within hours of reading the debrief — not days inside a planning meeting that exists to make planning feel like progress.
Wiring the Interpreter Into the Existing Stack
Event-driven, stateless, no new platform. Slots into the experimentation tools you already run.
The Interpreter runs as an event-driven agent. Your experimentation platform emits a webhook when a test hits its stopping criteria. The agent picks up the event and orchestrates a pipeline: result ingestion, context retrieval, release event cross-reference, interaction check, debrief generation, recommendation scoring. Stateless. Restartable. No new platform.
Most experimentation platforms — Statsig, LaunchDarkly, Optimizely, GrowthBook — emit webhooks or expose API polling for status changes. The agent runs as a serverless function on the event, with a vector database for hypothesis embeddings and a structured store for experiment metadata. The whole topology is unremarkable. That is the point.
Experiment Interpreter Project Structure
treeexperiment-interpreter/
├── src/
│ ├── triggers/
│ │ ├── webhook-handler.ts
│ │ └── polling-adapter.ts
│ ├── retrieval/
│ │ ├── context-query.ts
│ │ ├── embedding-service.ts
│ │ └── release-events.ts
│ ├── analysis/
│ │ ├── causal-debrief.ts
│ │ ├── interaction-detector.ts
│ │ └── segment-analysis.ts
│ ├── recommendations/
│ │ ├── gap-identifier.ts
│ │ ├── scoring-engine.ts
│ │ └── hypothesis-generator.ts
│ └── integrations/
│ ├── statsig.ts
│ ├── optimizely.ts
│ ├── growthbook.ts
│ └── launchdarkly.ts
├── schema.prisma
└── vector-config.tsRules That Keep the Interpreter Trustworthy
The constraints that determine whether the team trusts the output six months in.
Operational Rules
Never auto-ship based on agent recommendations
The Interpreter generates debriefs and proposes follow-ups. A human owns every launch decision. The agent handles analysis. Authority stays with the operator.
Refuse debriefs on experiments with less than 80% statistical power
Underpowered experiments produce unreliable patterns. The agent should not generate causal narratives from noisy data — it should halt and surface the power problem.
Tag every causal claim with a confidence level
Distinguish triangulated evidence from single-experiment observations. Label claims 'high confidence,' 'moderate,' or 'speculative.' Confidence theater — claims without a level — gets blocked at template validation.
Re-index the experiment database after every 50 new experiments
Embedding drift and stale feature-area tags degrade retrieval quality. Drift is the default state. Regular re-indexing is the only mechanism that reverses it.
Log every retrieval query and its results for audit
When a debrief gets challenged, you need to reconstruct exactly which historical experiments informed the analysis and why they were ranked where they were. Without it, the system is unauditable and the team stops trusting it.
Proving the Interpreter Earns Its Keep
You shipped the agent. Now prove it changed something. The metrics that matter are not about the agent's speed. They are about decision quality and learning velocity on the team that depends on it.
Value Metrics to Track
Time from experiment conclusion to strategic decision under 24 hours
Above 80% of debriefs cite at least one related historical experiment
Every high-risk interaction pair flagged before the experiment completes
Above 60% follow-up experiment launch rate within 7 days of debrief
Team-reported usefulness above 4 of 5 on generated debriefs
50% reduction in repeated experiment hypotheses across quarters
Spotify's read is instructive. Their shift from measuring experiment velocity to measuring learning rate exposed the same pattern across programs: the quality of insight per experiment matters more than the count of experiments shipped[1]. The Interpreter attacks the quality dimension directly by ensuring no result gets read in isolation.
Practitioners cited by GrowthMethod report that growth teams measuring learning velocity — validated insights per quarter relative to experiment duration — outperform teams optimizing purely for experiment throughput[6]. The Interpreter compresses time-from-result-to-insight from days to minutes. The realized gain depends on retrieval quality and team adoption, neither of which arrive automatically.
Practical Rollout: Four Phases, In Order
- [01]
Tag your experiment backlog
sql-- Structured metadata is the precondition. No tags, no retrieval. ALTER TABLE experiments ADD COLUMN feature_area TEXT[]; ALTER TABLE experiments ADD COLUMN hypothesis_embedding VECTOR(1536); -- Backfill feature areas from existing tags UPDATE experiments SET feature_area = ARRAY( SELECT tag FROM experiment_tags WHERE experiment_tags.experiment_id = experiments.id AND tag_type = 'feature_area' ); - [02]
Build the retrieval layer and validate it against 10 recent experiments before anything else
- [03]
Run the agent in read-only mode for 4 weeks. No recommendations, no actions — only debriefs.
- [04]
Enable recommendations only after the team has validated debrief quality on real concluded experiments
Operator Questions
How many historical experiments are needed before the Interpreter adds value?
Fifty completed experiments with usable metadata is the practical floor. Below that, retrieval has too little signal to surface real patterns and the debriefs read like noise. Most teams with an established practice have hundreds — the binding constraint is metadata quality, not count. Start by tagging the 20 most recent experiments with feature areas and hypothesis text. Even a partial index surfaces patterns that would otherwise stay invisible.
Does this replace a data scientist on the growth team?
No. Positioning it that way will burn team trust on day one. The Interpreter automates the mechanical 60% of analysis: result summarization, historical cross-referencing, interaction checking. A data scientist still designs novel statistical approaches, handles edge cases, and validates that the agent's causal reasoning holds up. The point is to remove the routine load so the remaining 40% gets the attention it requires — not to delete the role.
What happens if experiments do not have structured feature-area tags?
Classify the most recent 100 experiments into feature areas by hand. A few hours of work that pays for itself immediately. Use those labels to fine-tune an automatic classifier and run it across the rest of the backlog. Feature-area tags are the single highest-leverage piece of metadata in the system. Skip this step and retrieval quality collapses to the point that debriefs become noise — that is not a hypothetical, it is what we observed when we tried to shortcut it.
How is hallucinated causality prevented?
Every causal claim has to cite specific experiments and data points. The debrief template requires each mechanism to link to at least two supporting experiments, or one experiment plus a release event. Unsupported claims get tagged speculative. Then the agent has to state what evidence would contradict the proposed mechanism — that constraint forces it to reason about alternatives instead of post-hoc rationalizing the result it already saw. Hallucination protection is a template constraint, not a prompt instruction.
The Interpreter changes how growth teams operate at a structural level. Each A/B test stops being an isolated event — run, read, move on — and starts feeding through a lens of accumulated organizational memory. Related experiments surface automatically. Release events get cross-referenced without anyone remembering to check. Interaction risks get flagged before they corrupt the data.
The hardest part is not the agent. It is designing the context retrieval query that defines what "related" means inside your specific product and team. Get that wrong and the system produces confident noise. Get it right and the question of how the team interpreted experiments without it stops having a comfortable answer.
- [1]Spotify Engineering — Experiments with Learning (EwL) Framework(engineering.atspotify.com)↩
- [2]Optimizely — Analysis of 127,000 Experiments(optimizely.com)↩
- [3]Statsig — Interaction Effect Detection in A/B Tests(statsig.com)↩
- [4]Statsig — Embracing Overlapping A/B Tests and the Danger of Isolating Experiments(statsig.com)↩
- [5]Sparkco — Build an Experiment Result Analysis Framework(sparkco.ai)↩
- [6]GrowthMethod — Testing Velocity and Learning Rate(growthmethod.com)↩
- [7]ContentSquare — Agent-to-Agent A/B Testing(contentsquare.com)↩