Monday morning. Someone opens the dashboard. Revenue is up 8% week-over-week. Or down 12%. The team starts hunting for the reason. Marketing claims the uptick. Engineering blames the deploy. The CEO reads a macro headline and nods.
Revenue movements do not have a single cause. They are the intersection of experiments running, features shipping, incidents degrading the product, ad spend drifting, and external forces nobody owns. Forcing the delta onto one factor is not analysis. It is theater. And it actively distorts the next decision.
The fix is structural. An agent fires every Monday before anyone walks into the standup. It pulls the delta, queries five evidence systems in parallel, scores each candidate cause, and ships a ranked hypothesis list with explicit confidence levels and an unexplained remainder. The agent does not produce the answer. It produces a structured decomposition with the uncertainty surfaced — which is the only honest output.
Manual Attribution Loses Every Time
Three structural failures the Monday meeting cannot escape.
The Monday meeting is rigged before anyone speaks. Whoever frames first owns the narrative. Recency bias drags credit toward the most recent change. Confirmation bias routes the explanation toward whatever the room already wants to be true. Marketing sees marketing. Product sees product. Leadership sees vindication.
Forrester named the second failure mode: false precision in analytics.[2] End users never see the formulas behind the dashboard, so they make decisions with confidence the data does not support. "The new landing page drove the 8%" carries an implied certainty that was never measured.
The third failure is timing. By the time the meeting resolves a story, the window for corrective action has closed. The agent has no ego, no recency bias, and no calendar. It runs the same evidence pass every week, every Monday, including the Monday after a long weekend when the data analyst is heads-down on something else.
Whoever speaks first owns the narrative
Single-cause explanations dominate by social default
Burns 45-60 minutes of senior time, every week
No systematic evidence pull
Confidence levels never named
Output drifts with whoever attended
Evidence assembled before any human opens the meeting
Multiple hypotheses ranked, with overlap surfaced
Report lands in Slack before standup
Pulls from experiments, deploys, incidents, spend, macro feeds
Every hypothesis carries an explicit range and confidence label
Same methodology every week — consistency is the point
Anatomy of the Monday Morning Agent
From delta detection to ranked output, the agent runs five investigations in parallel.
- [01]
Pull the Delta and Score It Against the Distribution
The agent queries the revenue source — Stripe, the warehouse, a BI API — and computes the week-over-week delta. It also pulls the 4-week and 13-week trend, then computes a z-score against the 13-week distribution. If the delta sits inside one standard deviation of recent variance, the report flags it as 'within noise floor' and ships a short version. Most weeks land inside the noise floor. Investigating noise as if it were signal is how teams burn senior cycles on stories that explain nothing.
- [02]
Query Five Evidence Systems in Parallel
Five distinct data sources, queried at the same time. A/B test platform for experiments started, stopped, or ramped. Release tracker for features shipped, with an estimate of user reach. Incident log for outages, error spikes, and degradations. Ad platforms for spend deltas by channel. External feeds for holidays, competitor moves, economic shifts. The agent does not pick favorites at this stage. It collects.
- [03]
Score Each Candidate by Reach, Timing, Magnitude, Corroboration
Plausibility is not a vibe. Each candidate gets four numeric inputs. Reach: what fraction of users could this have moved? Timing: did the cause precede the effect at a plausible lag? Magnitude: is the implied impact in the right order of magnitude? Corroboration: do independent signals agree? A pricing test with 95% statistical power and a measured lift scores high. A macro headline with no measurable link to product behavior scores low. The agent is honest about what it can and cannot measure.
- [04]
Rank, Range, and Surface the Unexplained Remainder
The agent ranks candidates by impact range, not by point estimate. Each hypothesis carries a low-to-high revenue impact, a confidence label, and the evidence behind both. When the sum of hypothesis ranges does not cover the delta, the report names the unexplained remainder explicitly. The remainder is the operating mechanism that prevents false-precision narratives from filling the gap.
- [05]
Ship to Slack and Archive for Calibration
The report lands before Monday standup. Every report is archived. When ground truth eventually arrives — a test reaches power, a feature is measured over 90 days, a channel is paused and isolated — the archived hypothesis is graded against the outcome. The calibration loop is the only mechanism that turns 'high confidence' into a labeled probability instead of a color.
Five Evidence Categories. Not All Equally Knowable.
Signal strength is the structural axis. Treat all five evenly and the report lies.
| Category | Data Sources | Signal Strength | Typical Lag |
|---|---|---|---|
| Experiments (A/B tests) | LaunchDarkly, Optimizely, Statsig, internal tools | High — direct measurement available | 0-2 days |
| Features Shipped | GitHub releases, Linear, Jira deploy logs | Medium — reach known, impact estimated | 1-7 days |
| Incidents and Degradations | PagerDuty, Statuspage, error rate dashboards | High — duration and affected users measurable | 0-1 days |
| Marketing Spend Changes | Google Ads, Meta Ads, attribution platforms | Medium — spend known, incremental lift uncertain | 3-14 days |
| External / Macro Events | News APIs, economic feeds, competitor monitors | Low — correlation possible, causation unverified | Variable |
Read the third column. That is the entire structural argument. The A/B platform can tell you, with statistical rigor, that variant B lifted conversion by 3.1%. The news feed cannot tell you whether a competitor's outage drove signups. Treat those signals as equivalent and the agent is producing decoration, not analysis.
The agent should weight evidence by signal strength and say so out loud. When marketing asks whether the new campaign drove the increase, the honest output is: spend rose 15%, the timing aligns, no incrementality test was running, plausibility moderate, confidence low. That sentence is more useful than any point estimate the room would otherwise improvise.
False Precision Is the Failure Mode. Ranges Are the Fix.
Point estimates with no uncertainty are the most expensive output an attribution system can produce.
Behavioral economics has measured the structural cause. Work in Management Science on overconfidence in interval estimates[7] shows the effect is driven by neglecting unknowns. Prompt people to consider what they do not know, and overconfidence drops. The agent should bake this into its output schema, not its tone.
Every hypothesis ships with three explicit fields:
- Impact range — a low-to-high revenue bound, never a point estimate
- Confidence level — high, medium, low, or speculative — each with a defined evidence threshold
- Evidence basis — what data supports the call, and what data is missing
When the sum of impact ranges does not cover the full delta, the agent reports an unexplained remainder. This is not a failure mode. It is the audit trail for everything the system cannot measure at a weekly cadence — word-of-mouth momentum, brand-perception drift, slow product-quality compounding. Suppressing the remainder makes the report look complete at the cost of being misleading.
What the Agent Actually Looks Like in Code
Practical architecture for teams that want this running by end of sprint.
attribution-agent.ts// One file. One job. Decompose the weekly delta into ranked hypotheses.
interface CandidateCause {
category: 'experiment' | 'feature' | 'incident' | 'marketing' | 'external';
description: string;
timing: { start: Date; end?: Date };
reachEstimate: number; // fraction of users affected (0-1)
impactRange: { low: number; high: number }; // bounds, not point estimates
confidence: 'high' | 'medium' | 'low' | 'speculative';
evidence: string[];
dataMissing: string[]; // name what you cannot measure
}
interface AttributionReport {
period: { start: Date; end: Date };
revenueDelta: { absolute: number; percentage: number };
zScore: number; // vs 13-week distribution — noise floor check
hypotheses: CandidateCause[];
unexplainedRemainder: { low: number; high: number };
summaryNarrative: string;
}
async function generateWeeklyAttribution(): Promise<AttributionReport> {
const delta = await pullRevenueDelta();
const candidates = await gatherCandidateCauses(delta.period);
const scored = candidates.map(c => scoreCandidate(c, delta));
const sorted = scored.sort((a, b) => b.impactRange.high - a.impactRange.high);
const explained = computeExplainedRange(sorted);
return {
...delta,
hypotheses: sorted,
unexplainedRemainder: {
low: Math.max(0, delta.absolute - explained.high),
high: Math.max(0, delta.absolute - explained.low),
},
summaryNarrative: buildNarrative(sorted, delta),
};
}The scoring function carries the judgment. Four signals, each normalized to a 0-1 scale, combined into a plausibility score. They are not equally weighted by default. Reach and corroboration earn more than timing alone.
Reach. What fraction of users was exposed? A feature behind a 10% rollout flag cannot explain more than 10% of the movement. An incident that took down checkout for three hours has a fundamentally different blast radius.
Temporal alignment. Did the cause precede the effect at a plausible lag? A campaign that started two days ago aligns. One that started six weeks ago does not, except for slow-burn channels — and the agent should model the expected lag per category. Ad spend has a 3-14 day delay. Incidents hit revenue immediately.
Magnitude plausibility. Is the implied effect size in the right order of magnitude? Total weekly revenue is $500K, ad spend went up by $2K, and the delta is $40K. The math does not support the story. The agent should refuse to inflate the impact range past what reach allows.
Corroboration. Do independent signals agree? Conversion drops during the exact incident window, support tickets spike in the same window — corroboration is strong. A single signal is not corroboration. It is a hypothesis.
What to Adopt Instead of Building From Scratch
The causal inference primitives already exist. Borrow them.
Causal Inference Libraries
DoWhy (Python)[1] — open-source library from Microsoft and PyWhy with built-in root cause analysis and confidence intervals for metric attribution
CausalNex (Python) — Bayesian network reasoning when business variables interact in non-trivial ways
EconML — heterogeneous treatment effects when different segments respond differently to the same change
Commercial Platforms Moving Toward Automated Attribution
Triple Whale Moby — agent that surfaces attribution insights and flags budget reallocation for e-commerce
Statsig — experimentation platform with automated metric impact analysis the attribution agent can ingest directly
HockeyStack Odin — revenue attribution that maps marketing touchpoints to pipeline and revenue outcomes
Data Infrastructure the Agent Cannot Run Without
- ✓
A centralized event log or warehouse (BigQuery, Snowflake, or a well-structured Postgres)
- ✓
API access to the experiment platform, release tracker, incident system, and ad platforms
- ✓
A scheduler — cron, GitHub Actions, Temporal — that runs the agent without anyone remembering to trigger it
What a Real Report Looks Like
A concrete output from a hypothetical SaaS company that saw $47K WoW.
The week below: a B2B SaaS company posts a $47K weekly revenue increase, up 9.2% WoW, with a z-score of 1.8 against the 13-week distribution. Above the noise floor. Worth investigating. Below the threshold where a single cause explanation would be defensible.
| Hypothesis | Impact Range | Confidence | Key Evidence |
|---|---|---|---|
| Pricing page A/B test (variant B won) | +$18K to +$26K | High | Statsig shows 3.1% conversion lift at 95% CI, test ran the full week |
| Enterprise onboarding flow shipped | +$8K to +$15K | Medium | 12 enterprise trials started, 40% above weekly avg, timing aligns |
| Competitor X 4-hour outage on Tuesday | +$3K to +$12K | Low | Signups spiked 2x during outage window, retention unmeasured |
| Google Ads spend +22% WoW | +$2K to +$7K | Medium | Spend confirmed, no incrementality test running |
| End-of-quarter budget flush | +$0 to +$8K | Speculative | March is historically strong, no direct evidence this week |
The hypothesis ranges overlap. Their sum exceeds the delta. That is correct. Attribution is not an accounting closeout where causes must add to 100%. Causes interact. The pricing test and the competitor outage may have amplified each other. Forcing them into non-overlapping buckets is what creates false precision in the first place.
The unexplained remainder is the second mechanism. A range of $0 to $16K is uncomfortable. It is also accurate. Some portion of every weekly movement is driven by forces the system cannot observe at this cadence. Naming that portion is what separates the report from a story.
Without a Calibration Loop, Confidence Labels Are Theater
The agent gets sharper only when archived hypotheses are graded against ground truth.
The agent improves only if the loop closes. After a quarter, walk back through the archived reports. Where ground truth has arrived — a test reached power, a feature was measured over 90 days, a marketing channel was paused and isolated — grade the agent's earlier estimate against what actually happened.
Two metrics carry the loop:
Coverage rate. How often did the true cause appear somewhere in the agent's hypothesis list? If a real cause was never even a candidate, the data-gathering layer has a blind spot. Add the source.
Confidence calibration. When the agent labels a hypothesis 'high confidence,' is it right 80%+ of the time? When it labels 'low,' is it right roughly 30-40%? If high and low are equally accurate, the scoring function is broken. The labels are decoration, not signal.
Quarterly Calibration Review
All attribution reports from the past quarter pulled into one place
Cases with known ground truth identified and isolated
Agent hypothesis rankings graded against actual outcomes
Coverage rate computed across all reports
Confidence calibration curve measured (high / medium / low accuracy)
Blind spots named — true causes that never appeared as candidates
Scoring weights adjusted by category based on historical accuracy
New data sources added for any recurring blind spot
Operating Rules the Agent Cannot Violate
The constraints that prevent the automation from manufacturing certainty.
Attribution Agent Operating Rules
Never ship a single-cause explanation
Revenue movements are multi-causal. Even when one factor dominates, alternatives must surface alongside it, and the unexplained remainder must be stated.
Output ranges. Never point estimates
'Marketing drove $14,200' implies measurement precision that does not exist. Use '$10K-$18K' with a confidence label and the evidence behind both.
Every hypothesis carries an explicit confidence label
High, medium, low, or speculative. Each label maps to a defined evidence threshold. Readers must know the quality of the data behind every line.
Surface the unexplained remainder
It is the audit trail for everything the system cannot measure. Suppressing it produces a report that looks complete and is misleading.
Distinguish correlation from causation in the evidence field
Co-occurrence is not the same as experimental evidence. The agent must label which one supports each hypothesis.
Archive every report
Without the calibration loop, confidence labels are decoration. Storage is what makes them accountable.
What Breaks. Watch for These.
Honest failure modes from teams that have shipped this.
What happens when the agent confidently attributes revenue to the wrong cause?
It will. The mitigation is the calibration loop. When the agent is wrong, trace it: usually a data source was incomplete or the scoring function over-weighted temporal alignment. Adjust the weights and document the failure mode. One pattern catches teams off guard — the agent being wrong confidently is worse than humans being uncertain loudly. A confident-looking report ends the search for alternative explanations. Build prominent uncertainty signaling into the UI before this happens, not after. A 'High' label should display the actual accuracy rate from the calibration history, not just a color.
How are interactions between causes handled?
Two changes that each individually move revenue +5% might combine to +12% or to +3%, depending on whether they amplify or cannibalize. The agent flags when high-plausibility candidates overlap in timing and user population. Interaction effects are genuinely difficult to measure at a weekly cadence. Naming that limitation in the report is the correct output. Pretending otherwise is the failure mode.
What if stakeholders ignore the confidence labels and treat 'low' estimates as fact?
That is a UI problem, not a data problem. Render low-confidence hypotheses visually different — desaturated, badged 'speculative,' or moved to a separate section. Make it structurally harder to quote a low-confidence number without its qualifier. Format is enforcement.
Is this just marketing attribution with extra steps?
Marketing attribution covers marketing touchpoints. Revenue delta attribution covers everything that could move the number — product changes, incidents, competitive dynamics, macro shifts. Marketing attribution is one input. The agent operates above it.
How do you keep the agent from replacing actual thinking?
The agent ships hypotheses, not conclusions. The Monday meeting still runs — but it starts with structured evidence instead of opening anecdotes. The team's job is to decide which hypotheses warrant deeper investigation and action. The agent's job is to make sure the conversation starts from facts rather than memory.
How to Have This Running by End of Sprint
Phased rollout. Value lands in week one.
The full system is not week-one work. The first version is.
Week 1. Pull the revenue delta. Compute the z-score against recent history. Knowing whether this week is signal or noise is itself a meaningful output, before any cause is investigated.
Week 2-3. Add one evidence category. Experiments if the team runs an A/B platform. Incidents if PagerDuty is wired up. Get the data flowing into a structured report.
Week 4-6. Add the remaining categories one at a time. Each new source adds explanatory power and a maintenance cost. Prioritize sources with high signal strength.
Month 2-3. Implement the scoring function and confidence framework. This is the hard part. It will need iteration. Start with simple heuristics and refine against calibration data, not against intuition.
Ongoing. Run the quarterly calibration review. Adjust weights. Add data sources when blind spots show up.
One admission from teams that have shipped this: after three or four months, the scoring model matters less than the discipline of running the process consistently. The benefit is not that the agent is smarter than a senior analyst. It is that the agent runs every week without fail, including the Monday after a long weekend, including crunch weeks, including the months when the analyst is on something else. Consistency beats accuracy. The structured report changes how decisions get made over a quarter, regardless of any single week's verdict.
- [1]Root Cause Analysis With DoWhy: An Open Source Python Library For Causal Machine Learning(aws.amazon.com)↩
- [2]Beware False Precision In Your Analytics(forrester.com)↩
- [3]The Rise Of Agentic Marketing Mix Modeling(sellforte.com)↩
- [4]PyWhy DoWhy: Online Shop Example Notebooks(pywhy.org)↩
- [5]Top Enterprise Marketing Analytics And Attribution Platforms(segmentstream.com)↩
- [6]Causal AI Disruption Across Industries 2025–2026(acalytica.com)↩
- [7]Overconfidence in Interval Estimates — Management Science(pubsonline.informs.org)↩