Meeting transcripts produce decisions. The decisions vanish into a Notion graveyard within thirty days. A two-agent workflow extracts structured records and attaches review triggers that fire when conditions actually change — not on a calendar.
Why decision decay is a structural problem, not a discipline problem
A seven-field decision record schema borrowed and extended from ADR practice
A two-agent extraction + scanning workflow — concrete prompts, routing logic, code
Condition-based review triggers vs. calendar-based theater
Edge cases: multi-meeting threads, contradictions, implicit authority
Prompt tuning strategy: ground truth, precision vs. recall, few-shot calibration
A setup checklist and governance rules for keeping the registry alive
Engineering managers in a 2024 survey reported losing roughly four hours per week to meetings that relitigate decisions already made. Ten percent of a forty-hour week, spent re-arguing settled ground because the original reasoning never made it into a form anyone could find.
Every org makes hundreds of decisions a quarter. Strategy pivots, vendor selections, hiring freezes, architecture choices, budget calls. They happen in a meeting. They get captured in notes. They vanish into a Notion graveyard nobody searches.
Three months later someone asks why you picked Vendor X over Vendor Y. The answer lives in a Tuesday standup transcript nobody tagged. The person who made the call left the company. The rationale is gone.
IDC put a number on it: Fortune 500 companies lose roughly $31.5 billion a year by failing to share knowledge.[9] That figure covers re-work, re-explanation, and decisions made blind — without knowing what was already settled and why. For a 200-person engineering org, the math is less dramatic but still real: when institutional knowledge leaves with each departing engineer, it costs up to 213% of that person's annual salary to rebuild at equivalent proficiency.[10]
This is the org-memory failure mode. Teams re-litigate settled questions. They reverse decisions without knowing the originals existed. They build on assumptions that were explicitly rejected six months earlier.
The fix is not another documentation initiative — your team already has documentation fatigue. The fix is an agent workflow that processes transcripts on a cadence, extracts decisions into a structured schema, and attaches review triggers that tell a separate scanning agent when to surface them again. Capture is one job. Re-surfacing is a different job. Both have to ship.
Three failure modes turn last quarter's reasoning into this quarter's argument.
Most knowledge management systems treat decisions as a documentation problem. Write it down. Put it somewhere. Hope someone finds it. Documentation alone has three failure modes that compound:
Capture failure. Nobody writes the decision down. The meeting ends, the next thing starts. Action items get tracked because somebody assigns them. The reasoning behind the action — the part that ages well — does not.
Retrieval failure. The decision was documented but lives in a format that resists search. Buried in paragraph seven of February twelfth's notes, tagged only with the meeting series name. Nobody thinks to look there when the question resurfaces.
Staleness failure. The decision was correct when made. Conditions changed. The vendor you rejected raised a round and dropped pricing forty percent. The constraint that drove the architecture call got removed. No mechanism flagged it for review.
A structured record solves all three. Automated extraction handles capture. Indexed schema handles retrieval. A review-trigger field — the piece nearly every system misses — handles staleness.
Knowledge workers already spend an average of 1.8–2.5 hours per day searching for information they need to do their jobs.[11] Decision records don't eliminate that search tax — but they shrink the worst version of it: the search for institutional reasoning that left the building with a former colleague.
The schema is the contract. Without rationale and alternatives, you are just logging events.
The schema borrows from Architectural Decision Records (ADRs) but extends the concept past software architecture into general organizational decisions.[1] ADRs have been battle-tested at companies from startups to enterprises — AWS documents the practice across teams of varying sizes.[2] The load-bearing insight from ADR practice: recording alternatives considered and rationale matters more than recording the decision itself.[3] The decision is the easy part. The reasoning is what disappears.
Recent LLM research on automated ADR generation confirms the same field hierarchy. Structured context strategies that capture rationale and constraints outperform those that capture only the decision outcome — the model produces more complete, accurate records when it has access to the deliberation, not just the conclusion.[12]
| Field | Type | Purpose | Example |
|---|---|---|---|
| decision | string | The specific choice made | Adopt PostgreSQL for the analytics data store |
| rationale | string | Why this option won over alternatives | Needed JSON support; team has existing Postgres expertise |
| alternatives | string[] | Options rejected and the reason each was cut | MongoDB (scaling concerns), BigQuery (cost at our volume) |
| date | ISO 8601 | When the decision was finalized | 2026-02-14T10:30:00Z |
| decider | string | Person or group with final authority | Sarah Chen, VP Engineering |
| affected_teams | string[] | Teams whose work changes because of this decision | Data Platform, Product Analytics, Backend |
| review_trigger | string | Specific condition that forces re-evaluation | If monthly analytics queries exceed 50M rows or BigQuery drops below $3/TB |
Extraction is a language problem. Re-surfacing is a condition-evaluation problem. Different prompts, different evals.
The workflow runs weekly, processing every transcript generated since the last run. Two agents, two distinct responsibilities:
Agent 1: Decision Extractor. Takes raw transcript text. Identifies segments where decisions actually happened. Outputs structured records that conform to the schema above.[6] During the first eight weeks, those records route through a human review step before reaching active status.
Agent 2: Review Scanner. Runs daily. Reads all existing records. Evaluates each review trigger against current conditions. Flags any record where the world has moved.
Keep them separate. The extractor needs strong language understanding and reliable structured output. The scanner needs to query external systems and reason about whether a condition fires. Different skills, different prompts, different evals. Merge them into one agent and both jobs degrade.
The human-in-the-loop step in the diagram is not optional in early weeks. It is what builds the ground truth your eval depends on.
Most transcripts are messy. Most decisions are implicit. The prompt has to handle both.
Extraction is the hard part. Transcripts are messy. Decisions are usually implicit. Someone says "okay, let's go with option B then" — that is a decision. There is no formal announcement. No gavel. No declaration.
The prompt has to handle several categories of decision language:
The prompt also has to distinguish decisions from opinions, preferences, and speculation. "I think we should use Postgres" is not a decision. "We're going with Postgres" is. The whole system fails if the extractor cannot tell those two apart.
One practical detail: transcripts from different sources have different quality profiles. Otter.ai and Fireflies produce speaker-attributed transcripts that let the agent assign the decider field reliably. Google Meet's built-in transcription is accurate but attribution is less consistent. The hard requirement is speaker attribution — without knowing who said what, the decider field becomes unreliable, and a record without a decider is half a record.
prompts/extract-decisions.ts// One prompt. Negative examples carry as much weight as positive ones.
const DECISION_EXTRACTION_PROMPT = `
You are a decision extraction agent. You will receive a meeting transcript
and must identify all decisions that were made during the meeting.
A DECISION is a commitment to a specific course of action that was agreed
upon or authorized by someone with the authority to do so.
A decision is NOT:
- An opinion or preference ("I think we should...")
- A question or proposal under discussion ("What if we...")
- An action item without a preceding choice ("John will send the report")
- Speculation about future plans ("We might want to consider...")
For each decision found, extract ALL of the following fields:
1. decision: A clear, concise statement of what was decided. Use active
voice. Start with a verb when possible.
2. rationale: Why this choice was made. Pull actual reasoning from the
transcript — paraphrase but preserve the logic.
3. alternatives: Other options that were mentioned during discussion.
For each, note why it was not chosen if stated.
4. date: Use the meeting date provided in the transcript metadata.
5. decider: The person who made the final call or the group if by
consensus. Use full names when available.
6. affected_teams: Teams or groups whose work will change. Infer from
context if not explicitly stated.
7. review_trigger: Define a specific, measurable condition that should
cause this decision to be revisited. Do NOT use time-based triggers
like "revisit in 6 months." Instead, identify what assumption or
constraint would need to change.
IMPORTANT: The review_trigger must be falsifiable and externally
verifiable. Good: "If customer churn exceeds 5% monthly." Bad:
"If things change significantly."
Output a JSON array of decision records. If no decisions were made
in the transcript, return an empty array.
`;A trigger is only useful if a machine can evaluate it without subjective judgment.
Revisit this decision next quarter
Review if circumstances change
Check back in 6 months
Reconsider if the team grows
Reassess when we have more data
If monthly active users exceed 50,000
If the Datadog bill exceeds $15K/month for two consecutive months
If more than 3 engineers request TypeScript migration in feedback surveys
If competitor Y launches a self-serve tier below $99/month
If P95 API latency exceeds 400ms on the current architecture
Strong review triggers share three properties. They are specific — they reference a measurable quantity or observable event. They are falsifiable — you can check whether the condition is true or false without subjective judgment. They are externally verifiable — the scanner can query monitoring dashboards, billing systems, surveys, or competitive intel feeds to evaluate them.
The trigger is what turns a record from a historical artifact into an active governance signal. Without it, you are building a better filing cabinet. With it, you are building an early warning system that pings you when your own assumptions stop holding.
When writing triggers at extraction time, the agent often produces vague ones by default. A post-processing step that evaluates each trigger against three questions — "Is this measurable? Can a machine check it? What system holds the data?" — catches the failures before they enter the registry.
| Trigger Category | Example Trigger | Data Source | Check Frequency |
|---|---|---|---|
| metric_threshold | P95 API latency exceeds 400ms | Datadog / Grafana / CloudWatch | Daily |
| cost_threshold | Monthly Datadog bill exceeds $15K for 2 consecutive months | Billing API / Cost Explorer | Monthly |
| market_event | Competitor Y launches self-serve tier below $99/month | Web scraper / news feed | Weekly |
| team_feedback | 3+ engineers request TypeScript migration in retro survey | Survey tool (Typeform, Retrium) | After each retro |
| growth_threshold | Monthly active users exceed 50,000 | Product analytics (Amplitude, Mixpanel) | Daily |
| vendor_pricing | BigQuery drops below $3/TB processed | Vendor pricing page scraper | Monthly |
Classify the trigger, route to the right data source, evaluate, escalate if it fires.
typescriptconst decisions = await db.decisions.findMany({
where: { status: 'active' },
orderBy: { date: 'desc' }
});typescriptconst classified = await agent.classify(decision.review_trigger, {
categories: [
'metric_threshold', // Check monitoring/analytics
'market_event', // Check news/competitive intel
'team_feedback', // Check survey/retro data
'time_elapsed', // Simple calendar check
'external_pricing', // Check vendor pricing pages
]
});typescriptconst result = await evaluateTrigger({
trigger: decision.review_trigger,
category: classified.category,
dataSources: getSourcesForCategory(classified.category),
});
// result: { triggered: boolean, evidence: string, confidence: number }typescriptif (result.triggered && result.confidence > 0.8) {
await createReviewRequest({
decision,
triggerEvidence: result.evidence,
originalContext: decision.rationale,
suggestedReviewers: decision.affected_teams,
});
}Multi-meeting threads, contradictions, and missing authority — none of which are anomalies.
Some decisions span meetings — discussed in one, finalized in another
Track 'pending_decision' status for discussions that have not resolved
Link related records with a thread_id so the deliberation history is preserved
Promote to 'active' only when explicit agreement or an authority call is detected
Later decisions contradict earlier ones without explicit reference
The scanner detects semantic overlap between new and existing records
Flag contradictions for human review — never auto-resolve
Keep both records with cross-references; the contradiction is the signal
Many meetings have no clear decision-maker present
When authority is ambiguous, tag the record with 'consensus' as decider
Block promotion to 'active' until a designated owner validates
Wire in an org-chart integration so the agent infers domain authority automatically
Without labeled transcripts, you are guessing. Without precision, you are eroding trust.
Generic prompts against meeting transcripts run around 70% field-level accuracy on the first pass. Few-shot prompting with real examples from your org pushes that number to the 85–90% range — but the gain comes from the examples matching your culture, not from prompt cleverness.[7] The model needs to see how decisions get expressed in your meeting cadences, with your jargon, by your people.
The research on this is consistent: few-shot improvements are not uniform across fields. The decision field is easiest — the model reliably identifies commitment language. The review_trigger field is hardest — it requires inferring what assumption underlies the decision, then specifying the condition that would invalidate it. Plan for the trigger field to need the most human correction in early weeks.
Field-level accuracy from LLM structured output benchmarks (2025) shows a similar pattern: high-frequency, well-structured fields (dates, names) hit 90%+ accuracy quickly; inferential fields (rationale, trigger) stabilize more slowly and benefit most from domain-specific examples.[12]
Have a human analyst manually label 20-30 transcripts, marking every actual decision. That set becomes your eval. Without ground truth, prompt changes are guesses — you cannot tell whether a tweak helped or made things worse.
Precision tells you what fraction of extracted records are real decisions. Recall tells you what fraction of real decisions were captured. Optimize for precision first. False positives destroy trust faster than missed extractions.
Generic prompts run around 70% accuracy. Three to five examples from your actual transcripts — your people, your jargon, your meeting cadence — push accuracy to 85-90%.[7] The examples teach the model how decisions get expressed in your culture, which is the part no generic prompt can know.
Route every extracted record through a quick human review before it reaches active status. Reviewer corrects errors. Corrections feed back into the prompt. After 8 weeks, drop to spot-checking 20% of extractions. Cut HITL too early and silent prompt drift compounds.
The minimum viable surface — pilot before org-wide rollout.
Transcript source wired up — Otter, Fireflies, Google Meet, or custom
Decision record schema defined in the data store, all seven fields enforced
Extraction prompt tested against 10+ real transcripts before production
Agent configured with structured output enforcement, not free-text
Review-trigger classification taxonomy specified
Data sources connected for trigger evaluation — metrics, billing, surveys
Scanner agent deployed on a daily or weekly schedule
Notification routing in place — Slack channel or email — for triggered reviews
Searchable dashboard live for browsing the registry
4-week pilot with one team before any org-wide rollout
The wrong team size or decision culture will make the registry rot faster than it fills.
This system earns its keep around 8–10 people. Below that, institutional memory lives in a few heads and informal communication is enough — you do not need a registry when the whole team fits in one Slack DM. Above 8–10, the combinatorial explosion of who-knows-what makes systematic capture worth building. For distributed orgs above 50, it is close to mandatory.
There are three situations where the system actively backfires:
High-trust, low-documentation cultures. Some orgs run on implicit authority and fast iteration. Introducing a formal registry slows the decision loop in ways that cost more than the re-litigation it prevents. Fit matters. A quarterly planning cadence benefits more than a sprint team shipping every week.
Orgs with weak transcript hygiene. The extractor is only as good as its input. If transcripts are incomplete, speaker-unattributed, or generated from meetings where key discussion happened off-channel, extraction quality degrades fast. Fix the transcript problem first.
When the registry becomes political. Decision records cited selectively to shut down legitimate reconsideration turn the system from memory into a veto mechanism. The healthy norm is: the review-trigger path is the legitimate route to revisiting a call. If that path gets blocked or ignored, the registry calcifies into bureaucratic lock-in — worse than the filing cabinet it replaced.
Onboarding gets faster. Re-litigation drops. Decision-making gets more deliberate.
Two quarters in, the decision registry becomes a real asset. New hires search it during onboarding instead of asking "why do we do it this way?" in every meeting. Planning sessions reference specific records instead of leaning on collective memory. When a trigger fires, the team revisits the call with full context — original rationale, alternatives, and the specific condition that flipped.
The second-order shift is cultural. Once people know decisions are being captured and indexed, they articulate reasoning more clearly in the meeting itself. Once they know review triggers exist, they think harder about what conditions would invalidate the choice. The system makes the org more deliberate, not just better at remembering.[8]
The registry does not prevent bad decisions. It prevents the specific failure mode where good decisions get reversed for bad reasons — or bad decisions never get revisited because no one remembers why they were made.
How accurate is AI extraction compared to manual decision logging?
With a tuned prompt and 3-5 few-shot examples from your own transcripts, precision lands around 85-90% on decision identification. That number varies by transcript quality and meeting style. Field-level accuracy splits: decision and date are reliable. Review_trigger is the hardest field and benefits most from human review during the first weeks. Treat any external benchmark as a starting estimate. Measure against your own ground truth.
What transcript sources work best with this workflow?
Anything that produces timestamped, speaker-attributed transcripts. Otter.ai, Fireflies.ai, Google Meet's built-in transcription — all usable. The hard requirement is speaker attribution. Without knowing who said what, the decider field is unreliable, and a record without a decider is half a record.
How do you keep the registry from becoming another graveyard?
The review-trigger mechanism is the answer. Static documentation rots. Triggered reviews push records back into view when conditions change. Pair the registry with a searchable dashboard and Slack notifications, and the system reaches out to people instead of waiting to be found. Reach-out is the difference.
Can this work for async decisions made in Slack or email?
Yes, with prompt adjustments. Written communication patterns differ from spoken ones. Async decisions tend to be more explicit — people write "Decision: we're doing X" — which makes extraction easier. The harder problem is identifying the right threads to process. Channel filters and role-based selectors handle most of it.
What is the minimum team size where this earns its keep?
Around 8-10 people is where return starts showing up. Below that, institutional memory lives in a few heads and informal communication is enough — you do not need a system when the whole team fits in one Slack DM. Above 8-10, the combinatorial explosion of who-knows-what makes systematic capture worth building. For distributed orgs above 50, it is close to mandatory.
How does this interact with existing ADR practices in engineering?
Cleanly, if you keep them separate. ADRs are human-authored records for architecture decisions — they belong in version control alongside the code they govern. The decision registry covers a broader class of organizational choices (vendor, hiring, process, strategy) and is auto-populated from transcripts. Wire up the registry to detect when an extracted decision matches an ADR topic, then link the records. The ADR stays the authoritative technical record; the registry entry adds the meeting context and review trigger.
Records without triggers default to static documentation. The trigger is what makes the system push instead of wait.
Vague triggers like "if things change" cannot be evaluated by the scanner. The trigger references a specific, measurable condition or it is not a trigger.
A new decision that overrides an old one must reference the original. The evolution of thinking is the asset. Lose the link, lose the asset.
Prompt drift is real. Models change, meeting culture changes, jargon shifts. Monthly measurement is the only thing that stops silent degradation. Skip it and you will only notice when trust has already collapsed.
A record that gets cited to shut down legitimate reconsideration is being misused. Any team member can request review via the trigger path. That path must stay open.
GMV is the scoreboard, not the game. Marketplace teams that wait for revenue to confirm a category is dying have already lost the merchants whose absence caused it. Four signals, one weekly brief, three to six weeks of warning before the line bends.
App Store reviews, NPS verbatims, Zendesk tickets, interview notes, community mentions — five inputs, five biases, five cadences. Treat them equal and the loudest channel wins. The fix is a normalization and weighting layer that produces one weekly brief.
Engineering directors burn 45 minutes every morning reconstructing a picture five tools could have assembled. Replace the loop: five parallel collectors, one orchestrator, a confidence score, a 90-second RED/AMBER/GREEN brief. Triage out of working memory, into code.