A weekly agent reads PRDs, roadmaps, ADRs, and OKRs, extracts the implicit assumptions buried between paragraphs, and ranks the conflicts by blast radius. Surface the contradictions before code gets written. The agent finds. Humans decide.
Why semantic drift — not explicit disagreement — is the real alignment threat
How a weekly document-scanning agent extracts implicit assumptions across team boundaries
The claim taxonomy that makes implicit assumption detection actually work
Blast-radius scoring: why time proximity outweighs semantic distance
The coherence brief format, resolution path selection, and what to track
Scaling from 10 teams to 30+ with topical clustering
Pre-deployment checklist and the five non-negotiable deployment rules
Most organizations discover strategic misalignment the expensive way. Two teams ship contradictory features. A platform migration collides with a launch. An architecture decision quietly invalidates three months of roadmap planning. Every one of those incidents was preventable — the conflict was sitting in documents nobody compared side by side.
The coherence gap is the distance between what separate teams believe to be true about shared goals, timelines, and constraints. It doesn't grow from explicit disagreement — that gets aired in planning. It grows from semantic drift: the same strategic vocabulary, used in good faith, decoded into different operational definitions. One team reads "scalable" and pictures horizontal autoscaling. Another reads the same word and plans a monolith refactor with bigger machines. Both are internally consistent. Neither knows the other exists.
A 2025 survey by Mural and The Martec Group, covering 350 marketing, sales, and R&D professionals across the US and Europe, found that 85% of GTM teams report regularly working toward different goals — despite the same 85% feeling confident about their collaboration.[9] The paradox is exact: high confidence, high misalignment, coexisting in the same room.
A weekly document-scanning agent closes the gap before it costs a quarter. Not by resolving conflicts — by making them specific enough that the right humans cannot pretend they did not see them.
Semantic drift is not a typo or a missing ticket. It's a slow divergence of meaning that no standup catches.
Semantic drift is the operational definition that two teams never reconciled because they were never in the same document at the same time. Same vocabulary. Different mechanics underneath.
A real shape: Team Alpha's PRD calls for "multi-tenant isolation" in the billing service. Team Beta's ADR pins "tenant-level data separation" as a Q3 milestone. The platform OKRs measure success by "customer onboarding velocity." All three documents touch the same surface. None defines where tenant isolation ends and onboarding speed begins. The billing team implements strict row-level security. Every API call costs an extra 200ms. The onboarding team's latency targets break the day the migration ships.
No standup catches this.[7] It lives in the assumptions between paragraphs of documents written weeks apart by people who never sat at the same table.
Zalando's engineering team discovered the same pattern when they applied AI to two years of postmortem data: the most persistent failure mode wasn't missed dependencies recorded in tickets — it was implicit assumptions embedded in narrative text that no automated check ever read.[11] Their AI pipeline found that root cause analyses systematically reflected team-level assumptions that went unspoken in the actual incident record.
Team A wants microservices, Team B wants monolith — argued in a meeting
PM and engineering disagree on the launch date — surfaced in planning
Two department heads fight over the budget — visible in the spreadsheet
Competing feature requests from different stakeholders — written in tickets
Both teams write "scalable" — one means horizontal autoscaling, the other means vertical scaling with bigger machines
PRD says "real-time" meaning seconds. ADR assumes "real-time" meaning minutes. Nobody compared.
"Customer-first" decoded as NPS by one team, revenue retention by another
OKR targets quietly assume different dependency-resolution timelines from each other
The agent runs weekly, compares cross-team documents, and surfaces the conflicts. It does not resolve them. Resolution needs authority the agent does not have.
The coherence agent runs on a fixed weekly cadence. It pulls the latest versions of strategic documents from each team, normalizes them into a shared semantic representation, then performs cross-document comparison to identify assumption conflicts. Output: a coherence brief delivered to leadership and team leads.
The agent surfaces. Humans resolve. Resolution requires judgment, context, and organizational authority. The agent's job ends at making conflicts visible with enough specificity that the right people can't route around them.
The key technical challenge is not retrieval — pulling documents from Notion or Confluence is straightforward. It's extraction: getting from "what the document says" to "what the document assumes." Explicit claims are easy to compare. Implicit ones require a prompt architecture designed around the specific ways your organization's documents embed undeclared constraints.
The agent pulls the latest published versions of PRDs, roadmaps, ADRs, and OKR documents from the document store — Notion, Confluence, Google Docs, or a Git repository. Each document gets parsed into a structured representation that preserves section hierarchy, timestamps, and authorship metadata. Draft documents are excluded by default: authors are still working through their thinking, and scanning in-progress documents produces false positives. Only published or approved documents enter the comparison pipeline.
Surface-level document comparison fails. Two PRDs can share zero keywords and still contradict each other. The agent decomposes each document into discrete claims — factual assertions, assumed constraints, stated timelines, dependency expectations. Each claim gets embedded into a vector space alongside its surrounding context. Research from ACL 2025 on atomic fact extraction found that decomposing complex claims before comparison substantially improves detection of cross-document contradictions.[12] This is where coherence detection works or fails silently.
The agent compares claims across team boundaries, looking for pairs where semantic similarity is high (same topic) but operational meaning diverges (different mechanics underneath). Same-team comparison is filtered out — that conflict gets resolved in the team's own planning. Only cross-boundary pairs matter. Sentence embedding vectors convert claims to numerical representations, enabling semantic search that catches definitional drift that keyword comparison misses entirely.
Each conflict gets scored on four dimensions: dependency depth (how many downstream teams sit under the conflicting assumption), time proximity (days until the conflict surfaces as a blocking issue), reversibility (cost to undo once shipped), and customer exposure (whether external users absorb the impact). On the first build we weighted semantic divergence highest. That put definitional differences the teams had already resolved verbally at the top of the brief, while the real critical conflict — a timeline assumption that would break three production services in six weeks — ranked lower. Time proximity must outweigh semantic distance.
Detected conflicts compile into a structured brief that names the conflicting documents, quotes the specific passages, explains the divergence, and proposes a resolution format — async doc review, 30-minute sync, or escalation to a shared decision record. The brief is the product. Everything upstream is plumbing. Not every conflict warrants a meeting; the brief makes that call explicit so recipients don't have to.
The hardest problem is not extracting what a document says. It's extracting what the document assumes.
The hardest problem in coherence detection is the gap between what a document states and what it takes for granted. A PRD might state "the recommendation engine will use collaborative filtering" (explicit claim) while assuming "the data pipeline delivers fresh user events within 5 minutes" (implicit dependency). If the data team's ADR pins batch processing on a 6-hour cycle, the implicit assumption is already broken. No keyword search catches it.
Reliable detection requires a claim taxonomy the agent can extract against. The taxonomy teaches the model where implicit assumptions live — which sentences are load-bearing and which are decoration. Chain-of-thought prompting works substantially better than simple structured extraction for this task: asking the model to reason step-by-step about what a document takes for granted before listing explicit claims surfaces assumptions that direct extraction misses.[13]
Building this taxonomy is the calibration work that most teams skip. Generic prompts produce generic results.
| Claim Type | Definition | Example | Detection Difficulty |
|---|---|---|---|
| Explicit Goal | A directly stated objective with measurable criteria | "Reduce checkout latency to under 400ms by Q3" | Low |
| Explicit Constraint | A stated limitation or boundary condition | "Hold SOC 2 compliance through the migration" | Low |
| Implicit Dependency | An unstated reliance on another team's output or timeline | PRD assumes pipeline freshness that no ADR specifies | High |
| Implicit Definition | A term used without operational definition that splits across teams | "Real-time" decoded as seconds in one doc, minutes in another | High |
| Timeline Assumption | A date or duration assumed but never validated cross-team | Roadmap pins API v2 to Q2. API team's OKR targets Q3. | Medium |
| Resource Assumption | Assumed availability of people, budget, or infrastructure | Three teams independently plan the same ML engineer for Q3 | Medium |
| Architectural Assumption | Assumed technical approach that conflicts with another team's plan | Frontend assumes REST. Backend ADR specifies GraphQL migration. | Medium |
The Zalando postmortem team found ~10% attribution errors persist even with Claude Sonnet — unless the pipeline architecture separates causality classification from description generation.
A direct prompt — "list all assumptions in this document" — fails for a predictable reason: the model treats the instruction as a request for what is unusual, not for what is taken for granted. Assumptions that are deeply embedded in an author's mental model produce no signal in the text that marks them as assumptions. They look like facts.
Zalando's two-year AI postmortem project discovered the equivalent failure mode in incident analysis: LLMs would attribute root causes to technologies mentioned prominently in the text, regardless of actual causality. Their fix was a multi-stage pipeline where separate stages handled description, causality classification, and recommendation — preventing the model from short-circuiting reasoning by picking the most salient noun.[11]
The same architecture applies to assumption extraction. Build stages that are solely responsible for a single extraction task:
Keeping these stages separate prevents the model from anchoring on surface similarity and missing mechanically divergent assumptions buried underneath.
Flagging everything equally is a noise generator. Prioritization is the entire product.
An agent that flags every conflict equally is a noise generator. The product is the ranking. Blast-radius estimation scores each conflict by how many teams, timelines, and customer commitments will absorb damage if it stays unresolved.
Three factors drive blast radius. Dependency depth — how many downstream teams rely on the assumption. Time proximity — how soon the contradiction materializes in production. Reversibility — whether the decision can be undone cheaply or requires a rewrite.
What we got wrong on the first build: the original scoring weighted semantic divergence highest. That produced a brief where the top-ranked conflict was a definitional difference that two teams had already verbally resolved but hadn't updated in their documents. The actual critical conflict ranked lower — a timeline assumption that would break three production services in six weeks. Blast-radius scoring only became useful once we weighted time proximity above semantic distance. Semantic difference tells you the conflict exists. Blast radius tells you which one to fix this week.
The software engineering field uses blast radius for exactly this kind of risk containment: not every change has the same risk profile, but without explicit scoring, small and large-impact conflicts go through the same process pipe.[14]
Structured output that names the conflict, attributes it to a passage, and proposes the resolution format. The format is not optional.
Each weekly brief carries three sections: a summary of the coherence score trend with contributing factors; a ranked list of detected conflicts with full source context; and a proposed resolution format for each one.
The resolution format is not decoration. Not every conflict warrants a meeting. Some need an async clarification in a shared document. Others require an ADR revision. The high-blast-radius, low-reversibility conflicts need a sync with decision-making authority physically in the room. Picking the wrong format is how good detection turns into wasted hours.
The brief also tracks recurring misalignment patterns. A conflict that reappears week over week — even after resolution — signals a structural cause: a shared glossary that doesn't exist, a dependency nobody owns, or a planning process that generates ambiguity by design. These patterns are worth more than any single conflict finding.
Overall coherence score (0-100) with week-over-week trend and contributing factors named
Top 3-5 ranked conflicts with source document excerpts and team attribution
Blast-radius estimate per conflict with dependency chain visualization
Proposed resolution format: async doc comment, 30-minute sync, or decision-record escalation
Recurring misalignment patterns across teams — drift that keeps coming back signals a structural cause
Async doc review — low-blast-radius definitional mismatches that one team can clarify alone
30-minute sync — medium-blast-radius timeline or dependency conflicts needing two-team negotiation
Decision-record escalation — high-blast-radius architectural conflicts that need leadership sign-off
Shared glossary update — recurring implicit definition conflicts that need permanent disambiguation
| Blast Radius Score | Reversibility | Resolution Format | Time to Resolution | Owner |
|---|---|---|---|---|
| 0–30 | Low | Async doc comment | 24–48 hrs | Single team lead |
| 30–60 | Low–Medium | 30-min cross-team sync | 3–5 days | Both team leads |
| 60–80 | Medium | ADR revision + review cycle | 1–2 weeks | Lead + architect |
| 80–100 | High | Escalation to decision record with leadership | Same week | VP or equivalent |
| Any | Any (recurring pattern) | Shared glossary update or process fix | Sprint | Eng + Product together |
The infrastructure is straightforward. The calibration is where this stops working.
treecoherence-agent/
├── connectors/
│ ├── notion.ts
│ ├── confluence.ts
│ ├── google-docs.ts
│ └── git-markdown.ts
├── extraction/
│ ├── claim-extractor.ts
│ ├── taxonomy-classifier.ts
│ ├── implicit-detector.ts
│ └── calibration-examples.ts
├── analysis/
│ ├── semantic-comparator.ts
│ ├── conflict-ranker.ts
│ ├── blast-radius.ts
│ └── dependency-graph.ts
├── output/
│ ├── brief-generator.ts
│ ├── slack-notifier.ts
│ └── dashboard-api.ts
├── config.ts
├── scheduler.ts
└── index.tsSurface conflicts. Propose formats. Stop there. Auto-resolution requires organizational authority the agent doesn't have, and a wrong resolution applied at scale is worse than the original drift.
Documents change constantly during drafting. Real-time scanning produces false positives that train recipients to ignore the brief. Weekly cadence aligns with publication rhythms and protects signal-to-noise.
"Teams may be misaligned on data freshness" is useless. "PRD line 47 assumes 5-minute pipeline freshness; ADR-104 specifies 6-hour batch" is actionable. Vague warnings get ignored on the second brief.
Feed the agent 5–10 past misalignment incidents your organization actually experienced. Use them as few-shot calibration to tune implicit assumption detection for your document patterns. Generic prompts produce generic noise.
Missing a conflict costs less than flooding leaders with false positives. A brief with three real conflicts builds trust over months. A brief with twenty noise items gets filtered to spam permanently — and the agent dies with it.
Ten teams generate 45 pairwise relationships. Thirty teams generate 435. Brute force is not a strategy.
Pairwise comparison stops scaling fast. Ten teams produce 45 pairwise relationships. Thirty teams produce 435.[8] Comparing every document against every other document burns tokens and surfaces noise. The agent needs a smarter topology.
Topical clustering with boundary detection is the strategy that holds up. The agent first groups documents by the strategic topics they address — billing, onboarding, data platform, observability — then concentrates comparison on documents that straddle topic boundaries. The highest-value conflicts live at those boundaries, where two teams touch the same system from different angles with different assumptions. Comparing inside a cluster catches drift the cluster owners would have caught themselves. Comparing across clusters catches drift nobody is currently watching for.
This means the dependency graph is structural input, not just metadata. You need to know which teams share systems before you can know which document pairs deserve full comparison. Build the graph once; the agent uses it every week to filter pairwise candidates before any embedding comparison runs.
ROI lives in conflicts caught before they became incidents. Track three numbers, not opinions.
ROI for coherence detection lives in conflicts caught before they materialized as incidents. Three numbers prove value. Conflicts surfaced per weekly brief. Resolution time from detection to decision. Production incidents and delivery delays that trace back to assumption mismatches — month-over-month, before and after the agent shipped.
Organizations running coherence analysis have reported reductions in cross-team coordination failures. Those figures depend heavily on document quality, team participation, and calibration effort. Treat them as directional. The actual mechanism isn't that the agent is smarter than humans. It's that the agent is more consistent. It reads every document, every week, and never assumes two teams have already talked.
For a team with $10M in payroll, the research on organizational misalignment suggests wasted effort in the $2M–$3M range annually.[10] A coherence agent running weekly costs roughly $5–15 per run in LLM inference for 50–100 documents — under $1,000 annually. The infrastructure cost is negligible against a single avoided misalignment incident.
Does the coherence agent replace cross-team planning meetings?
No. It makes them shorter and more specific. Instead of spending 45 minutes discovering that two teams hold different assumptions, the meeting opens with the conflict already named, the source documents linked, and a proposed resolution format on the table. Most conflicts get resolved asynchronously after the brief lands — one team updates a document, flags the resolution, and the meeting that would have surfaced the conflict never happens. The meetings that do happen are tighter because the discovery work was done by the agent, not by six people in a room.
How do you handle documents that are still in draft?
Filter by document status. Only published or approved documents enter the comparison pipeline. Scanning drafts produces noise — authors are still working through their own thinking and contradicting themselves on purpose. Allow teams to opt specific documents into early scanning when they want pre-publication coherence checks. Never make it the default.
What if teams intentionally hold different assumptions as part of an A/B strategy?
Surface them anyway. The resolution format becomes "acknowledged divergence." Teams mark specific conflicts as intentional, the agent records the annotation, and stops re-flagging that pair. Intentional divergence that is documented is not a coherence gap. Intentional divergence that is undocumented eventually becomes one — somebody new joins, reads one document, and assumes the other doesn't exist.
How much does it cost to run a coherence agent weekly?
For a typical organization with 50–100 strategic documents, expect 200–500K tokens per weekly run. At current LLM pricing, that runs roughly $5–15 per run for extraction and analysis, plus embedding costs that round to nothing. The infrastructure cost is negligible compared to one week of misaligned engineering work. The expensive part is calibration time, not runtime.
Can the agent work with documents in multiple languages?
Yes. Modern embedding models handle multilingual content cleanly. The extraction and comparison layers operate on semantic meaning, not surface text, so cross-language comparison works. Generate the brief in the organization's primary working language so leadership reads one register, not three.
When should you NOT build a coherence agent?
Two situations: when the organization has fewer than four teams actively producing strategic documents (pairwise comparison is fast enough to do manually), and when document quality is very low — teams writing vague OKRs and thin PRDs will produce vague conflicts. The agent amplifies what's already in the documents. Fix the document culture first, then automate the comparison.
The coherence gap isn't a failure of communication. It's a failure of comparison infrastructure. Teams produce the signal — it's in the documents they write every quarter. The agent's only job is to read those documents in the same week, across the same team boundaries, and produce a ranked list of the conflicts nobody has yet named out loud. The brief lands. Humans decide. The quarter ships without the drift.
The agent needs read access to strategic documents across teams. That raises real concerns about information barriers, competitive sensitivity, and access control. Implement with the minimum necessary permissions. Extract claims and discard raw document content after processing. Store the structured claims and conflict pairs, not the full text of every PRD. Work with the security team to define access boundaries before deployment, not after the first incident.
Your team codes 3x faster with AI tools, but lead time is up and deployment frequency is flat. The structural reason, and the four pipeline changes that actually fix it.
Agentic tools push engineering past 2–3x velocity and product definition becomes the binding constraint. Hiring more PMs makes it worse. The fix is a three-tier decision rights model that moves authority to where the information actually lives.
Push automation onto an absent substrate and you get usage numbers without capability. Four layers — Literacy, Sandbox, Playbooks, Feedback Loops — a scored readiness rubric, and the sequencing rhythm that holds after the mandate memo fades.