Manual Monday attribution is the loudest voice winning the narrative. Replace it with an agent that pulls the delta, queries five evidence systems, and ships a ranked hypothesis list with explicit confidence — and an unexplained remainder.
Why the Monday attribution meeting manufactures false certainty — and the structural fix
Five evidence categories ranked by signal strength, with exact data sources for each
How to score candidates by reach, timing, magnitude, and corroboration — not by who speaks first
A working TypeScript schema and Python DoWhy snippet you can drop into a sprint
Confidence calibration: how to turn labeled probabilities into real accountability
Phased rollout — value lands in week one, full system by month two
Monday morning. Someone opens the dashboard. Revenue is up 8% week-over-week. Or down 12%. The team starts hunting for the reason. Marketing claims the uptick. Engineering blames the deploy. The CEO reads a macro headline and nods.
Revenue movements don't have a single cause. They're the intersection of experiments running, features shipping, incidents degrading the product, ad spend drifting, and external forces nobody owns. Forcing the delta onto one factor isn't analysis. It's theater. And it actively distorts the next decision.
The fix is structural. An agent fires every Monday before anyone walks into the standup. It pulls the delta, queries five evidence systems in parallel, scores each candidate cause, and ships a ranked hypothesis list with explicit confidence levels and an unexplained remainder. The agent doesn't produce the answer. It produces a structured decomposition with uncertainty surfaced — which is the only honest output.
Three structural failures no amount of process improvement can fix.
Whoever frames the narrative first owns it. Recency bias drags credit toward the most recent change — the launch that happened Thursday, not the experiment that ramped six weeks ago and hit its traffic target quietly on Saturday. Confirmation bias routes the explanation toward whatever the room already wants true. Marketing sees marketing. Product sees product. Leadership sees vindication.
Forrester named the second failure: false precision in analytics.[2] End users never see the formulas behind the dashboard, so they make decisions with confidence the data doesn't support. "The new landing page drove the 8%" carries an implied certainty that was never measured. Nobody argues with it because the number sounds real.
The third failure is timing. By the time the meeting resolves a story, the correction window is closed. Budgets got allocated Tuesday. The experiment got shut down Thursday because it "already worked." The agent has no ego, no recency bias, no calendar. It runs the same evidence pass every week — including the Monday after a long weekend when the analyst is heads-down on something else.
Whoever speaks first owns the narrative
Single-cause explanations win by social default
Burns 45–60 min of senior time, weekly, for a story
No systematic evidence pull — whoever has a tab open wins
Confidence levels are implied, never named
Output drifts with whoever attended that week
Evidence assembled before any human opens the meeting
Multiple hypotheses ranked; overlaps and interactions surfaced
Report lands in Slack before standup — meeting starts at the second bullet
Pulls from experiments, deploys, incidents, spend, macro feeds
Every hypothesis carries an explicit range and confidence label
Same methodology every week — consistency beats any individual analyst
From delta detection to ranked output — parallel queries, scored by evidence quality.
Query the revenue source — Stripe, the warehouse, a BI API — and compute the week-over-week delta. Pull the 4-week rolling average and the 13-week distribution, then compute a z-score. Use |z| > 1.5 as the investigation threshold for most product contexts; tighten to |z| > 2.0 if your team runs more than one investigation weekly (the Bonferroni correction matters — with 52 annual runs at z = 1.5, you'll get 7–8 false positives per year by chance). If the delta sits inside the threshold, the report flags it as 'within noise floor' and ships a two-sentence summary. Most weeks land inside the noise floor. Investigating noise as if it were signal is how teams burn senior cycles on stories that explain nothing.
Five data sources, queried simultaneously. A/B test platform for experiments started, stopped, or ramped. Release tracker for features shipped, with user reach estimates. Incident log for outages, error spikes, and degradations. Ad platforms for spend deltas by channel and campaign. External feeds for holidays, competitor moves, macro shifts. The agent doesn't pick favorites here. It collects — and records what it couldn't fetch.
Plausibility is not a vibe. Each candidate gets four numeric inputs, each normalized to 0–1. Reach: what fraction of users could this have moved? Timing: did the cause precede the effect at a plausible lag? Magnitude: is the implied effect size in the right order of magnitude? Corroboration: do independent signals agree? A pricing test with 95% statistical power and a measured lift scores high on all four. A macro headline with no measurable product-behavior link scores low on reach, magnitude, and corroboration. The agent is explicit about what it can and cannot measure.
The agent ranks candidates by impact range — low-to-high revenue bounds — not by a point estimate. Each hypothesis carries an impact range, a confidence label, and the evidence behind both. When the sum of hypothesis ranges doesn't cover the full delta, the report names the unexplained remainder explicitly. The remainder is the operating mechanism that prevents false-precision narratives from filling the gap. It also tells you where your data collection has blind spots.
The report lands before Monday standup. Every report is archived — ideally in a append-only store (a Postgres table, a BigQuery partition, an S3 prefix). When ground truth eventually arrives — a test reaches statistical power, a feature gets measured over 90 days, a channel is paused and isolated — the archived hypothesis is graded against the outcome. The calibration loop is the only mechanism that turns 'high confidence' into a labeled probability instead of a color.
Signal strength is the structural axis. Treat all five evenly and the report lies.
| Category | Data Sources | Signal Strength | Typical Lag | What the Agent Can and Cannot Measure |
|---|---|---|---|---|
| Experiments (A/B tests) | LaunchDarkly, Optimizely, Statsig, internal flagging | High — direct measurement available | 0–2 days | CAN: lift, CI, p-value, reached power. CANNOT: interaction with other concurrent tests |
| Features Shipped | GitHub releases, Linear, Jira deploy logs | Medium — reach known, impact estimated | 1–7 days | CAN: rollout %, feature exposure events. CANNOT: behavioral impact without instrumentation |
| Incidents and Degradations | PagerDuty, Statuspage, error rate dashboards | High — duration and affected users measurable | 0–1 days | CAN: duration, error volume, affected user count. CANNOT: downstream churn from poor experience |
| Marketing Spend Changes | Google Ads, Meta Ads, attribution platforms | Medium — spend known, incremental lift uncertain | 3–14 days | CAN: spend delta, impressions, attributed conversions. CANNOT: true incrementality without holdout |
| External / Macro Events | News APIs, economic feeds, competitor monitors | Low — correlation possible, causation unverified | Variable | CAN: co-occurrence timing. CANNOT: causal link without controlled comparison |
Read the third column. That's the entire structural argument.
The A/B platform can tell you, with statistical rigor, that variant B lifted conversion by 3.1%. The news feed cannot tell you whether a competitor's outage drove signups. Treat those signals as equivalent and the agent is producing decoration, not analysis.
The agent should weight evidence by signal strength and say so explicitly. When marketing asks whether the new campaign drove the increase, the honest output is: spend rose 15%, timing aligns, no incrementality test was running, plausibility moderate, confidence low. That sentence is more useful than any point estimate the room would otherwise improvise.
This is also where incrementality tests earn their place. A holdout group study — where a representative cohort is deliberately excluded from a campaign — directly measures the lift the attribution agent can only estimate.[9] If you run no incrementality tests, your marketing rows will always carry 'medium' confidence at best, regardless of how much spend data the agent collects.
Four inputs, each normalized. Combined into a plausibility score that determines rank order.
DoWhy's GCM module handles the cases where influence variables interact — and produces uncertainty estimates the heuristic cannot.
The scoring function above is fast, transparent, and good for weekly cadence. It starts failing when candidate causes interact — when the pricing test and the ad spend increase both hit the same user cohort and you need to decompose which one drove what.
That's where DoWhy's Graphical Causal Model (GCM) module earns its place.[1] You define a directed acyclic graph (DAG) of the causal relationships you believe hold — revenue depends on conversion rate; conversion rate depends on page experience and offer quality; page experience depends on uptime — then fit the model on historical data and call gcm.attribute_anomalies() to decompose an anomalous week. The output is a per-node attribution score: positive scores are contributing factors, negative scores are factors that reduced the anomaly's likelihood.[4]
This isn't a replacement for the weekly heuristic agent. It's a deeper investigation tool for the cases where the heuristic returns a large unexplained remainder or high overlap between candidates.
Point estimates with no uncertainty bounds are the most expensive output an attribution system can produce.
Behavioral economics research in Management Science on overconfidence in interval estimates[7] shows the effect is driven by neglecting unknowns. Prompt people to consider what they don't know, and overconfidence drops. The agent should bake this into its output schema — not its tone.
Every hypothesis ships with three explicit fields:
When the sum of impact ranges doesn't cover the full delta, the agent reports an unexplained remainder. This isn't a failure mode. It's the audit trail for everything the system can't observe at a weekly cadence — word-of-mouth momentum, brand-perception drift, slow product-quality compounding. Suppressing the remainder makes the report look complete at the cost of being misleading.
One file, one job. The data shape prevents the common failures before they ship.
The dataMissing field is not optional. Teams that omit it end up with reports that look authoritative but hide their own blind spots. When a cause is 'medium confidence,' the missing data field should say exactly what would have to be true to promote it to 'high' — an incrementality holdout, a user-level event log, an error rate correlated to the specific timeframe. The reader should know what to go collect next.
For durable scheduling, Temporal is a strong fit for this kind of agent — it handles retries, state persistence across API timeouts, and observable execution history in a way that cron alone doesn't.[10] GitHub Actions works for lower-frequency cadences. Pick the one your team will actually maintain.
A concrete worked example from a hypothetical B2B SaaS week — z-score 1.8, above the noise floor.
The week below: a B2B SaaS company posts a $47K weekly revenue increase, up 9.2% WoW. Z-score of 1.8 against the 13-week distribution. Above the noise floor. Worth investigating. Below the threshold where a single-cause explanation would be defensible.
| Hypothesis | Impact Range | Confidence | Key Evidence | Data Missing |
|---|---|---|---|---|
| Pricing page A/B test (variant B) | +$18K to +$26K | High | Statsig: 3.1% conversion lift at 95% CI, test ran full week, n=14,200 | Long-term retention impact — measured conversion, not LTV |
| Enterprise onboarding flow shipped | +$8K to +$15K | Medium | 12 enterprise trials started, 40% above weekly avg; timing aligns | No controlled comparison — no holdout cohort, no feature flag isolation |
| Competitor X 4-hr outage (Tuesday) | +$3K to +$12K | Low | Signups spiked 2× during outage window; no retention data | No competitor monitor confirming causation; signup cohort may churn |
| Google Ads spend +22% WoW | +$2K to +$7K | Medium | Spend confirmed via API; attributed conversions up 18% | No incrementality holdout running — true lift unmeasured |
| End-of-quarter budget flush | +$0 to +$8K | Speculative | March is historically +12% vs Jan–Feb; no direct evidence this week | No cohort data confirming budget-driven vs. organic purchase intent |
The hypothesis ranges overlap and their sum exceeds the delta. That's correct — and expected. Attribution isn't an accounting closeout where causes must add to 100%. Causes interact. The pricing test and the competitor outage may have amplified each other in the enterprise segment. Forcing them into non-overlapping buckets produces false precision.
The unexplained remainder of $0–$16K is uncomfortable. It's also accurate. Some portion of every weekly movement is driven by forces the system can't observe at this cadence — word-of-mouth, brand drift, slow product-quality compounding. Naming that portion is what separates the report from a story.
The causal inference primitives already exist. The agent's job is orchestration, not invention.
| Tool / Library | Use Case | When to Pick It | Tradeoff |
|---|---|---|---|
| DoWhy (Python)[1] | Causal graph + anomaly attribution | Multi-variable interactions; need confidence intervals on attribution | Requires you to specify and defend the causal DAG; not a black box |
| Google Meridian[8] | Marketing mix modeling (MMM) for media channels | Teams with 2+ years of channel spend history; Bayesian MCMC output | GPU required; 2–4 hr training runs; overkill for weekly cadence unless pre-trained |
| Meta Robyn | Frequentist MMM with Nevergrad optimization | Smaller teams; R-fluent data team; want faster iteration on model shape | Ridge regression — less uncertainty quantification than Meridian's MCMC |
| Statsig / LaunchDarkly API | Experiment data ingest for the attribution agent | Always — experiment data is the highest-signal input the agent has | API rate limits; need to cache results for the weekly run |
| Temporal[10] | Durable orchestration of the weekly agent run | When cron alone isn't enough — you need retries, state, and audit trail | Operational overhead; adds a Temporal cluster to your infrastructure |
| BigQuery / Snowflake | Historical archive for calibration | Always — the calibration loop requires persistent storage of every report | Query costs if you're not careful with partitioning; partition by week |
The agent gets sharper only when archived hypotheses are graded against ground truth.
The agent improves only if the loop closes. After a quarter, walk back through the archived reports. Where ground truth has arrived — a test reached power, a feature was measured over 90 days, a channel was paused and isolated — grade the agent's earlier estimate against what actually happened.
Two metrics carry the loop:
Coverage rate. How often did the true cause appear somewhere in the agent's hypothesis list? If a real cause was never even a candidate, the data-gathering layer has a blind spot. Add the source.
Confidence calibration. When the agent labels a hypothesis 'high confidence,' it should be right 80%+ of the time. When it labels 'low,' roughly 30–40% accuracy is appropriate — the label means 'plausible, unverified,' not 'wrong.' If high and low are equally accurate, the scoring function is broken. The labels are decoration, not signal.
The constraints that prevent the automation from manufacturing certainty.
Revenue movements are multi-causal. Even when one factor dominates, alternatives must surface alongside it, and the unexplained remainder must be stated.
'Marketing drove $14,200' implies measurement precision that doesn't exist. Use '$10K–$18K' with a confidence label and the evidence behind both.
High, medium, low, or speculative. Each maps to a defined evidence threshold. Readers must know the quality of the data behind every line.
It's the audit trail for everything the system cannot measure. Suppressing it produces a report that looks complete and is misleading.
Co-occurrence is not experimental evidence. The agent must label which one supports each hypothesis — 'timing aligns' is different from 'A/B test at 95% CI confirms.'
Without the calibration loop, confidence labels are decoration. Storage makes them accountable.
Honest failure modes from teams that have shipped this, not theoretical risks.
What happens when the agent confidently attributes revenue to the wrong cause?
It will. The mitigation is the calibration loop. When the agent is wrong, trace it: usually a data source was incomplete or the scoring function over-weighted temporal alignment for a slow-burn effect. Adjust the weights and document the failure mode.
One pattern catches teams off guard: a confident-looking report ends the search for alternative explanations faster than a vague human guess does. Build prominent uncertainty signaling into the UI before this happens, not after. A 'High' label should display the actual accuracy rate from the calibration history — e.g., 'High (83% accurate, n=12)' — not just a green badge.
How are interactions between causes handled?
Two changes that each individually move revenue +5% might combine to +12% or +3%, depending on whether they amplify or cannibalize. The agent flags when high-plausibility candidates overlap in timing and user population.
Interaction effects are genuinely hard to measure at weekly cadence. The DoWhy GCM approach handles this better than the heuristic scorer — the causal graph explicitly models which variables influence others, so the attribution scores account for mediated paths. For teams not using GCM, naming the limitation in the report is the correct output. Pretending otherwise is the failure mode.
What if stakeholders ignore the confidence labels and quote 'low' estimates as fact?
That's a UI enforcement problem, not a data problem. Render low-confidence hypotheses visually distinct — desaturated, badged 'speculative,' or moved to a collapsible 'also considered' section. Make it structurally harder to screenshot a low-confidence number without its qualifier. Format is enforcement. If it's too easy to quote the number without the caveat, someone will.
Is this just marketing attribution with extra steps?
Marketing attribution covers marketing touchpoints. Revenue delta attribution covers everything that could move the number — product changes, incidents, competitive dynamics, macro shifts. Marketing attribution is one input layer. The agent operates above it, and it specifically avoids the original sin of marketing attribution: treating every conversion as attributable to a channel, even when most would have happened anyway.
When does the z-score threshold need adjusting?
Three scenarios. First: if your revenue is highly seasonal, use a year-over-year comparison in addition to the rolling 13-week window — the 13-week z-score will fire false positives every time a seasonal pattern recurs. Second: if the team runs the agent weekly, you're doing 52 tests per year — at z = 1.5, expect 7–8 false positives annually just from statistics. Raise to z = 2.0 if false positives are burning review capacity. Third: new product lines or recent pivots break the 13-week baseline. Re-seed the distribution when the business changes materially.
How do you keep the agent from replacing actual thinking?
The agent ships hypotheses, not conclusions. The Monday meeting still runs — but it starts with structured evidence instead of opening anecdotes. The team's job is deciding which hypotheses warrant deeper investigation and action. The agent's job is ensuring the conversation starts from data rather than memory. After four months of consistent reporting, the biggest observable change isn't the accuracy of any single week's verdict — it's that the team stops making confident single-cause claims without evidence, because the report format makes the habit visible.
Phased rollout. Value lands in week one, before any scoring logic is written.
The full system isn't week-one work. The first version is.
Week 1. Pull the revenue delta. Compute the z-score against 13-week history. Knowing whether this week is signal or noise is itself a meaningful output — before any cause is investigated. Post it to Slack. That's it.
Week 2–3. Add one evidence category. Experiments if the team runs an A/B platform. Incidents if PagerDuty is wired up. Get the data flowing into a structured report with a single candidate category.
Week 4–6. Add the remaining four categories one at a time. Each new source adds explanatory power and a maintenance cost. Prioritize by signal strength column in the table above.
Month 2–3. Implement the scoring function and confidence framework. This is the hard part — it will need iteration. Start with the default weights (reach: 0.30, timing: 0.25, magnitude: 0.25, corroboration: 0.20) and refine against calibration data, not intuition.
Ongoing. Run the quarterly calibration review. Adjust weights. Add data sources when blind spots show up. If a recurring blind spot won't go away, consider switching from the heuristic scorer to DoWhy GCM for that hypothesis category.
One admission from teams that have shipped this: after three or four months, the scoring model matters less than the discipline of running the process consistently. The benefit isn't that the agent is smarter than a senior analyst for any given week. It's that the agent runs every week without fail — including crunch weeks, the Monday after a long weekend, the months when the analyst is on something else. Consistency beats accuracy. The structured report changes how decisions get made over a quarter, regardless of any single week's verdict.
Most AI use case selection is workshop theater. Process mining reads the actual event logs and ranks workflows by volume, variance, and structure — so you find out whether you need an LLM, an RPA bot, or nothing before spending a dollar.
Distributed teams burn productivity at the timezone seam. Decisions buried in threads. Phantom blockers. Parallel divergence. The fix is not better Slack hygiene. It is a structured brief that extracts decisions, blockers, and active work from the tools the team already uses.
Visibility bias is a management failure mode, not a character flaw. Five signal channels, a recognition debt modifier, and a queue that surfaces the contributors your attention misses. Calm correction, not surveillance.