Organizational Memory: A Two-Agent Decision-Record Workflow

Organizational Memory: A Two-Agent Workflow That Keeps Decisions Retrievable

Meeting transcripts produce decisions. The decisions vanish into a Notion graveyard within thirty days. A two-agent workflow extracts structured records and attaches review triggers that fire when conditions actually change — not on a calendar.

What this covers

✓
Why decision decay is a structural problem, not a discipline problem
✓
A seven-field decision record schema borrowed and extended from ADR practice
✓
A two-agent extraction + scanning workflow — concrete prompts, routing logic, code
✓
Condition-based review triggers vs. calendar-based theater
✓
Edge cases: multi-meeting threads, contradictions, implicit authority
✓
Prompt tuning strategy: ground truth, precision vs. recall, few-shot calibration
✓
A setup checklist and governance rules for keeping the registry alive

Engineering managers in a 2024 survey reported losing roughly four hours per week to meetings that relitigate decisions already made. Ten percent of a forty-hour week, spent re-arguing settled ground because the original reasoning never made it into a form anyone could find.

Every org makes hundreds of decisions a quarter. Strategy pivots, vendor selections, hiring freezes, architecture choices, budget calls. They happen in a meeting. They get captured in notes. They vanish into a Notion graveyard nobody searches.

Three months later someone asks why you picked Vendor X over Vendor Y. The answer lives in a Tuesday standup transcript nobody tagged. The person who made the call left the company. The rationale is gone.

IDC put a number on it: Fortune 500 companies lose roughly $31.5 billion a year by failing to share knowledge.^[9] That figure covers re-work, re-explanation, and decisions made blind — without knowing what was already settled and why. For a 200-person engineering org, the math is less dramatic but still real: when institutional knowledge leaves with each departing engineer, it costs up to 213% of that person's annual salary to rebuild at equivalent proficiency.^[10]

This is the org-memory failure mode. Teams re-litigate settled questions. They reverse decisions without knowing the originals existed. They build on assumptions that were explicitly rejected six months earlier.

The fix is not another documentation initiative — your team already has documentation fatigue. The fix is an agent workflow that processes transcripts on a cadence, extracts decisions into a structured schema, and attaches review triggers that tell a separate scanning agent when to surface them again. Capture is one job. Re-surfacing is a different job. Both have to ship.

Decisions Decay. Without Enforcement, Decay Is the Default.

Three failure modes turn last quarter's reasoning into this quarter's argument.

~67%

of decisions undocumented within 30 days, per Atlassian's 2024 Teamwork Survey. Rates vary by org size and culture.

~4 hrs

weekly time spent re-arguing previously decided topics, self-reported across 12 companies. Individual experiences vary.

~23%

of reversed decisions had no record of original rationale in surveyed orgs. Yours will differ based on documentation culture.

Most knowledge management systems treat decisions as a documentation problem. Write it down. Put it somewhere. Hope someone finds it. Documentation alone has three failure modes that compound:

Capture failure. Nobody writes the decision down. The meeting ends, the next thing starts. Action items get tracked because somebody assigns them. The reasoning behind the action — the part that ages well — does not.

Retrieval failure. The decision was documented but lives in a format that resists search. Buried in paragraph seven of February twelfth's notes, tagged only with the meeting series name. Nobody thinks to look there when the question resurfaces.

Staleness failure. The decision was correct when made. Conditions changed. The vendor you rejected raised a round and dropped pricing forty percent. The constraint that drove the architecture call got removed. No mechanism flagged it for review.

A structured record solves all three. Automated extraction handles capture. Indexed schema handles retrieval. A review-trigger field — the piece nearly every system misses — handles staleness.

Knowledge workers already spend an average of 1.8–2.5 hours per day searching for information they need to do their jobs.^[11] Decision records don't eliminate that search tax — but they shrink the worst version of it: the search for institutional reasoning that left the building with a former colleague.

Seven Fields. Anything Missing Means the Record Will Rot.

The schema is the contract. Without rationale and alternatives, you are just logging events.

The schema borrows from Architectural Decision Records (ADRs) but extends the concept past software architecture into general organizational decisions.^[1] ADRs have been battle-tested at companies from startups to enterprises — AWS documents the practice across teams of varying sizes.^[2] The load-bearing insight from ADR practice: recording alternatives considered and rationale matters more than recording the decision itself.^[3] The decision is the easy part. The reasoning is what disappears.

Recent LLM research on automated ADR generation confirms the same field hierarchy. Structured context strategies that capture rationale and constraints outperform those that capture only the decision outcome — the model produces more complete, accurate records when it has access to the deliberation, not just the conclusion.^[12]

Field	Type	Purpose	Example
decision	string	The specific choice made	Adopt PostgreSQL for the analytics data store
rationale	string	Why this option won over alternatives	Needed JSON support; team has existing Postgres expertise
alternatives	string[]	Options rejected and the reason each was cut	MongoDB (scaling concerns), BigQuery (cost at our volume)
date	ISO 8601	When the decision was finalized	2026-02-14T10:30:00Z
decider	string	Person or group with final authority	Sarah Chen, VP Engineering
affected_teams	string[]	Teams whose work changes because of this decision	Data Platform, Product Analytics, Backend
review_trigger	string	Specific condition that forces re-evaluation	If monthly analytics queries exceed 50M rows or BigQuery drops below $3/TB

Two Agents, Two Jobs. Conflating Them Is How Both Fail.

Extraction is a language problem. Re-surfacing is a condition-evaluation problem. Different prompts, different evals.

Two-Agent Decision Architecture

The extractor runs weekly against transcripts. The scanner runs daily against current conditions. The registry is the only handoff between them.

The workflow runs weekly, processing every transcript generated since the last run. Two agents, two distinct responsibilities:

Agent 1: Decision Extractor. Takes raw transcript text. Identifies segments where decisions actually happened. Outputs structured records that conform to the schema above.^[6] During the first eight weeks, those records route through a human review step before reaching active status.

Agent 2: Review Scanner. Runs daily. Reads all existing records. Evaluates each review trigger against current conditions. Flags any record where the world has moved.

Keep them separate. The extractor needs strong language understanding and reliable structured output. The scanner needs to query external systems and reason about whether a condition fires. Different skills, different prompts, different evals. Merge them into one agent and both jobs degrade.

The human-in-the-loop step in the diagram is not optional in early weeks. It is what builds the ground truth your eval depends on.

The Hard Part Is Telling Decisions From Opinions

Most transcripts are messy. Most decisions are implicit. The prompt has to handle both.

Extraction is the hard part. Transcripts are messy. Decisions are usually implicit. Someone says "okay, let's go with option B then" — that is a decision. There is no formal announcement. No gavel. No declaration.

The prompt has to handle several categories of decision language:

Explicit decisions: "We've decided to…" or "The decision is…"
Implicit consensus: "Sounds like we're aligned on…" or "Let's move forward with…"
Authority decisions: "I'm calling this — we'll go with…" or "As the owner here, I want to…"
Negative decisions: "We're not doing X" or "Let's table that for now"

The prompt also has to distinguish decisions from opinions, preferences, and speculation. "I think we should use Postgres" is not a decision. "We're going with Postgres" is. The whole system fails if the extractor cannot tell those two apart.

One practical detail: transcripts from different sources have different quality profiles. Otter.ai and Fireflies produce speaker-attributed transcripts that let the agent assign the decider field reliably. Google Meet's built-in transcription is accurate but attribution is less consistent. The hard requirement is speaker attribution — without knowing who said what, the decider field becomes unreliable, and a record without a decider is half a record.

prompts/extract-decisions.ts

// One prompt. Negative examples carry as much weight as positive ones.
const DECISION_EXTRACTION_PROMPT = `
You are a decision extraction agent. You will receive a meeting transcript
and must identify all decisions that were made during the meeting.

A DECISION is a commitment to a specific course of action that was agreed
upon or authorized by someone with the authority to do so.

A decision is NOT:
- An opinion or preference ("I think we should...")
- A question or proposal under discussion ("What if we...")
- An action item without a preceding choice ("John will send the report")
- Speculation about future plans ("We might want to consider...")

For each decision found, extract ALL of the following fields:

1. decision: A clear, concise statement of what was decided. Use active
   voice. Start with a verb when possible.
2. rationale: Why this choice was made. Pull actual reasoning from the
   transcript — paraphrase but preserve the logic.
3. alternatives: Other options that were mentioned during discussion.
   For each, note why it was not chosen if stated.
4. date: Use the meeting date provided in the transcript metadata.
5. decider: The person who made the final call or the group if by
   consensus. Use full names when available.
6. affected_teams: Teams or groups whose work will change. Infer from
   context if not explicitly stated.
7. review_trigger: Define a specific, measurable condition that should
   cause this decision to be revisited. Do NOT use time-based triggers
   like "revisit in 6 months." Instead, identify what assumption or
   constraint would need to change.

IMPORTANT: The review_trigger must be falsifiable and externally
verifiable. Good: "If customer churn exceeds 5% monthly." Bad:
"If things change significantly."

Output a JSON array of decision records. If no decisions were made
in the transcript, return an empty array.
`;

Time-Based Triggers Are Theater. Condition-Based Triggers Are Enforcement.

A trigger is only useful if a machine can evaluate it without subjective judgment.

Theater

Revisit this decision next quarter
Review if circumstances change
Check back in 6 months
Reconsider if the team grows
Reassess when we have more data

Enforcement

If monthly active users exceed 50,000
If the Datadog bill exceeds $15K/month for two consecutive months
If more than 3 engineers request TypeScript migration in feedback surveys
If competitor Y launches a self-serve tier below $99/month
If P95 API latency exceeds 400ms on the current architecture

Strong review triggers share three properties. They are specific — they reference a measurable quantity or observable event. They are falsifiable — you can check whether the condition is true or false without subjective judgment. They are externally verifiable — the scanner can query monitoring dashboards, billing systems, surveys, or competitive intel feeds to evaluate them.

The trigger is what turns a record from a historical artifact into an active governance signal. Without it, you are building a better filing cabinet. With it, you are building an early warning system that pings you when your own assumptions stop holding.

When writing triggers at extraction time, the agent often produces vague ones by default. A post-processing step that evaluates each trigger against three questions — "Is this measurable? Can a machine check it? What system holds the data?" — catches the failures before they enter the registry.

Trigger Category	Example Trigger	Data Source	Check Frequency
metric_threshold	P95 API latency exceeds 400ms	Datadog / Grafana / CloudWatch	Daily
cost_threshold	Monthly Datadog bill exceeds $15K for 2 consecutive months	Billing API / Cost Explorer	Monthly
market_event	Competitor Y launches self-serve tier below $99/month	Web scraper / news feed	Weekly
team_feedback	3+ engineers request TypeScript migration in retro survey	Survey tool (Typeform, Retrium)	After each retro
growth_threshold	Monthly active users exceed 50,000	Product analytics (Amplitude, Mixpanel)	Daily
vendor_pricing	BigQuery drops below $3/TB processed	Vendor pricing page scraper	Monthly

The Scanner Is Just a Loop. The Hard Part Is the Data Sources.

Classify the trigger, route to the right data source, evaluate, escalate if it fires.

[01]

Load every active decision record from the registry

typescript

const decisions = await db.decisions.findMany({
  where: { status: 'active' },
  orderBy: { date: 'desc' }
});

[02]

Classify each trigger into a category the scanner can route

typescript

const classified = await agent.classify(decision.review_trigger, {
  categories: [
    'metric_threshold',  // Check monitoring/analytics
    'market_event',      // Check news/competitive intel
    'team_feedback',     // Check survey/retro data
    'time_elapsed',      // Simple calendar check
    'external_pricing',  // Check vendor pricing pages
  ]
});

[03]

Route to the right data source. Evaluate. Return a structured verdict.

typescript

const result = await evaluateTrigger({
  trigger: decision.review_trigger,
  category: classified.category,
  dataSources: getSourcesForCategory(classified.category),
});
// result: { triggered: boolean, evidence: string, confidence: number }

[04]

If the trigger fires with high confidence, raise a review request with full context

typescript

if (result.triggered && result.confidence > 0.8) {
  await createReviewRequest({
    decision,
    triggerEvidence: result.evidence,
    originalContext: decision.rationale,
    suggestedReviewers: decision.affected_teams,
  });
}

Edge Cases That Break Naive Extractors

Multi-meeting threads, contradictions, and missing authority — none of which are anomalies.

Multi-meeting decisions

Some decisions span meetings — discussed in one, finalized in another
Track 'pending_decision' status for discussions that have not resolved
Link related records with a thread_id so the deliberation history is preserved
Promote to 'active' only when explicit agreement or an authority call is detected

Contradictory decisions

Later decisions contradict earlier ones without explicit reference
The scanner detects semantic overlap between new and existing records
Flag contradictions for human review — never auto-resolve
Keep both records with cross-references; the contradiction is the signal

Implicit authority

Many meetings have no clear decision-maker present
When authority is ambiguous, tag the record with 'consensus' as decider
Block promotion to 'active' until a designated owner validates
Wire in an org-chart integration so the agent infers domain authority automatically

You Cannot Tune the Extractor Without Ground Truth

Without labeled transcripts, you are guessing. Without precision, you are eroding trust.

Generic prompts against meeting transcripts run around 70% field-level accuracy on the first pass. Few-shot prompting with real examples from your org pushes that number to the 85–90% range — but the gain comes from the examples matching your culture, not from prompt cleverness.^[7] The model needs to see how decisions get expressed in your meeting cadences, with your jargon, by your people.

The research on this is consistent: few-shot improvements are not uniform across fields. The decision field is easiest — the model reliably identifies commitment language. The review_trigger field is hardest — it requires inferring what assumption underlies the decision, then specifying the condition that would invalidate it. Plan for the trigger field to need the most human correction in early weeks.

Field-level accuracy from LLM structured output benchmarks (2025) shows a similar pattern: high-frequency, well-structured fields (dates, names) hit 90%+ accuracy quickly; inferential fields (rationale, trigger) stabilize more slowly and benefit most from domain-specific examples.^[12]

[01]
Build a ground-truth dataset before tuning anything
Have a human analyst manually label 20-30 transcripts, marking every actual decision. That set becomes your eval. Without ground truth, prompt changes are guesses — you cannot tell whether a tweak helped or made things worse.
[02]
Measure precision and recall separately
Precision tells you what fraction of extracted records are real decisions. Recall tells you what fraction of real decisions were captured. Optimize for precision first. False positives destroy trust faster than missed extractions.
[03]
Add few-shot examples from your own org
Generic prompts run around 70% accuracy. Three to five examples from your actual transcripts — your people, your jargon, your meeting cadence — push accuracy to 85-90%.^[7] The examples teach the model how decisions get expressed in your culture, which is the part no generic prompt can know.
[04]
Run a human-in-the-loop validation step for the first 8 weeks
Route every extracted record through a quick human review before it reaches active status. Reviewer corrects errors. Corrections feed back into the prompt. After 8 weeks, drop to spot-checking 20% of extractions. Cut HITL too early and silent prompt drift compounds.

What You Actually Need to Ship This

The minimum viable surface — pilot before org-wide rollout.

Decision Extraction Pipeline — Setup Checklist

Transcript source wired up — Otter, Fireflies, Google Meet, or custom
Decision record schema defined in the data store, all seven fields enforced
Extraction prompt tested against 10+ real transcripts before production
Agent configured with structured output enforcement, not free-text
Review-trigger classification taxonomy specified
Data sources connected for trigger evaluation — metrics, billing, surveys
Scanner agent deployed on a daily or weekly schedule
Notification routing in place — Slack channel or email — for triggered reviews
Searchable dashboard live for browsing the registry
4-week pilot with one team before any org-wide rollout

When Not to Build This (and What to Do Instead)

The wrong team size or decision culture will make the registry rot faster than it fills.

This system earns its keep around 8–10 people. Below that, institutional memory lives in a few heads and informal communication is enough — you do not need a registry when the whole team fits in one Slack DM. Above 8–10, the combinatorial explosion of who-knows-what makes systematic capture worth building. For distributed orgs above 50, it is close to mandatory.

There are three situations where the system actively backfires:

High-trust, low-documentation cultures. Some orgs run on implicit authority and fast iteration. Introducing a formal registry slows the decision loop in ways that cost more than the re-litigation it prevents. Fit matters. A quarterly planning cadence benefits more than a sprint team shipping every week.

Orgs with weak transcript hygiene. The extractor is only as good as its input. If transcripts are incomplete, speaker-unattributed, or generated from meetings where key discussion happened off-channel, extraction quality degrades fast. Fix the transcript problem first.

When the registry becomes political. Decision records cited selectively to shut down legitimate reconsideration turn the system from memory into a veto mechanism. The healthy norm is: the review-trigger path is the legitimate route to revisiting a call. If that path gets blocked or ignored, the registry calcifies into bureaucratic lock-in — worse than the filing cabinet it replaced.

Two Quarters In, the Registry Becomes a Real Asset

Onboarding gets faster. Re-litigation drops. Decision-making gets more deliberate.

Seconds

Time to retrieve any past decision and its full rationale once the registry is populated and indexed

Significantly fewer

Re-litigation conversations in planning meetings — exact reductions depend on meeting culture

Varies

Stale decisions surfaced per month via review triggers — depends on trigger quality and how live your data integrations are

Much faster

Onboarding time for new hires to internalize past choices — quantified gains vary by registry depth

Two quarters in, the decision registry becomes a real asset. New hires search it during onboarding instead of asking "why do we do it this way?" in every meeting. Planning sessions reference specific records instead of leaning on collective memory. When a trigger fires, the team revisits the call with full context — original rationale, alternatives, and the specific condition that flipped.

The second-order shift is cultural. Once people know decisions are being captured and indexed, they articulate reasoning more clearly in the meeting itself. Once they know review triggers exist, they think harder about what conditions would invalidate the choice. The system makes the org more deliberate, not just better at remembering.^[8]

The registry does not prevent bad decisions. It prevents the specific failure mode where good decisions get reversed for bad reasons — or bad decisions never get revisited because no one remembers why they were made.

How accurate is AI extraction compared to manual decision logging?

With a tuned prompt and 3-5 few-shot examples from your own transcripts, precision lands around 85-90% on decision identification. That number varies by transcript quality and meeting style. Field-level accuracy splits: decision and date are reliable. Review_trigger is the hardest field and benefits most from human review during the first weeks. Treat any external benchmark as a starting estimate. Measure against your own ground truth.

What transcript sources work best with this workflow?

Anything that produces timestamped, speaker-attributed transcripts. Otter.ai, Fireflies.ai, Google Meet's built-in transcription — all usable. The hard requirement is speaker attribution. Without knowing who said what, the decider field is unreliable, and a record without a decider is half a record.

How do you keep the registry from becoming another graveyard?

The review-trigger mechanism is the answer. Static documentation rots. Triggered reviews push records back into view when conditions change. Pair the registry with a searchable dashboard and Slack notifications, and the system reaches out to people instead of waiting to be found. Reach-out is the difference.

Can this work for async decisions made in Slack or email?

Yes, with prompt adjustments. Written communication patterns differ from spoken ones. Async decisions tend to be more explicit — people write "Decision: we're doing X" — which makes extraction easier. The harder problem is identifying the right threads to process. Channel filters and role-based selectors handle most of it.

What is the minimum team size where this earns its keep?

Around 8-10 people is where return starts showing up. Below that, institutional memory lives in a few heads and informal communication is enough — you do not need a system when the whole team fits in one Slack DM. Above 8-10, the combinatorial explosion of who-knows-what makes systematic capture worth building. For distributed orgs above 50, it is close to mandatory.

How does this interact with existing ADR practices in engineering?

Cleanly, if you keep them separate. ADRs are human-authored records for architecture decisions — they belong in version control alongside the code they govern. The decision registry covers a broader class of organizational choices (vendor, hiring, process, strategy) and is auto-populated from transcripts. Wire up the registry to detect when an extracted decision matches an ADR topic, then link the records. The ADR stays the authoritative technical record; the registry entry adds the meeting context and review trigger.

Decision Record Governance Rules

[01]

Every record must have a non-empty review_trigger field

Records without triggers default to static documentation. The trigger is what makes the system push instead of wait.

[02]

Triggers must be falsifiable and externally verifiable

Vague triggers like "if things change" cannot be evaluated by the scanner. The trigger references a specific, measurable condition or it is not a trigger.

[03]

Contradicting a previous decision requires linking the original record

A new decision that overrides an old one must reference the original. The evolution of thinking is the asset. Lose the link, lose the asset.

[04]

Extraction accuracy is measured monthly against ground truth

Prompt drift is real. Models change, meeting culture changes, jargon shifts. Monthly measurement is the only thing that stops silent degradation. Skip it and you will only notice when trust has already collapsed.

[05]

The registry is a prompt for discussion, not a veto

A record that gets cited to shut down legitimate reconsideration is being misused. Any team member can request review via the trigger path. That path must stay open.

Key terms in this piece

organizational memorydecision recordsmeeting transcript processingknowledge managementADRreview triggersdecision extractioninstitutional knowledgeAI agent workflow

Sources

[1]ADR GitHub Organization — Architectural Decision Records (ADRs)(adr.github.io)↩
[2]AWS Architecture Blog — Master Architecture Decision Records (ADRs): Best Practices for Effective Decision Making(aws.amazon.com)↩
[3]Joel Parker Henderson — Architecture Decision Record Templates and Examples(github.com)↩
[4]Google Cloud — Architecture Decision Records(cloud.google.com)↩
[5]Microsoft — Architecture Decision Records in Azure Well-Architected Framework(learn.microsoft.com)↩
[6]Relevance AI — Extract Data From Meeting Transcripts(relevanceai.com)↩
[7]Prompt Engineering — Agents at Work: The 2026 Playbook for Building Reliable Agentic Workflows(promptengineering.org)↩
[8]Wikipedia — Organizational Memory(en.wikipedia.org)↩
[9]Nuclino Blog — Not sharing knowledge costs Fortune 500 companies $31.5 billion a year(blog.nuclino.com)↩
[10]Inc. — The Cost and Consequence of Institutional Memory Drain(inc.com)↩
[11]Bloomfire — The 7 Knowledge Management Trends Shaping 2025(bloomfire.com)↩
[12]arXiv — Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs(arxiv.org)↩

Organizational Memory: A Two-Agent Workflow That Keeps Decisions Retrievable

Field

Type

Purpose

Example

decision

string

The specific choice made

Adopt PostgreSQL for the analytics data store

rationale

string

Why this option won over alternatives

Needed JSON support; team has existing Postgres expertise

alternatives

string[]

Options rejected and the reason each was cut

MongoDB (scaling concerns), BigQuery (cost at our volume)

date

ISO 8601

When the decision was finalized

2026-02-14T10:30:00Z

decider

string

Person or group with final authority

Sarah Chen, VP Engineering

affected_teams

string[]

Teams whose work changes because of this decision

Data Platform, Product Analytics, Backend

review_trigger

string

Specific condition that forces re-evaluation

If monthly analytics queries exceed 50M rows or BigQuery drops below $3/TB

// One prompt. Negative examples carry as much weight as positive ones. const DECISION_EXTRACTION_PROMPT = ` You are a decision extraction agent. You will receive a meeting transcript and must identify all decisions that were made during the meeting. A DECISION is a commitment to a specific course of action that was agreed upon or authorized by someone with the authority to do so. A decision is NOT: - An opinion or preference ("I think we should...") - A question or proposal under discussion ("What if we...") - An action item without a preceding choice ("John will send the report") - Speculation about future plans ("We might want to consider...") For each decision found, extract ALL of the following fields: 1. decision: A clear, concise statement of what was decided. Use active voice. Start with a verb when possible. 2. rationale: Why this choice was made. Pull actual reasoning from the transcript — paraphrase but preserve the logic. 3. alternatives: Other options that were mentioned during discussion. For each, note why it was not chosen if stated. 4. date: Use the meeting date provided in the transcript metadata. 5. decider: The person who made the final call or the group if by consensus. Use full names when available. 6. affected_teams: Teams or groups whose work will change. Infer from context if not explicitly stated. 7. review_trigger: Define a specific, measurable condition that should cause this decision to be revisited. Do NOT use time-based triggers like "revisit in 6 months." Instead, identify what assumption or constraint would need to change. IMPORTANT: The review_trigger must be falsifiable and externally verifiable. Good: "If customer churn exceeds 5% monthly." Bad: "If things change significantly." Output a JSON array of decision records. If no decisions were made in the transcript, return an empty array. `;

Trigger Category

Example Trigger

Data Source

Check Frequency

metric_threshold

P95 API latency exceeds 400ms

Datadog / Grafana / CloudWatch

Daily

cost_threshold

Monthly Datadog bill exceeds $15K for 2 consecutive months

Billing API / Cost Explorer

Monthly

market_event

Competitor Y launches self-serve tier below $99/month

Web scraper / news feed

Weekly

team_feedback

3+ engineers request TypeScript migration in retro survey

Survey tool (Typeform, Retrium)

After each retro

growth_threshold

Monthly active users exceed 50,000

Product analytics (Amplitude, Mixpanel)

Daily

vendor_pricing

BigQuery drops below $3/TB processed

Vendor pricing page scraper

Monthly

const classified = await agent.classify(decision.review_trigger, { categories: [ 'metric_threshold', // Check monitoring/analytics 'market_event', // Check news/competitive intel 'team_feedback', // Check survey/retro data 'time_elapsed', // Simple calendar check 'external_pricing', // Check vendor pricing pages ] });

const result = await evaluateTrigger({ trigger: decision.review_trigger, category: classified.category, dataSources: getSourcesForCategory(classified.category), }); // result: { triggered: boolean, evidence: string, confidence: number }

if (result.triggered && result.confidence > 0.8) { await createReviewRequest({ decision, triggerEvidence: result.evidence, originalContext: decision.rationale, suggestedReviewers: decision.affected_teams, }); }

What this covers

Decisions Decay. Without Enforcement, Decay Is the Default.

Seven Fields. Anything Missing Means the Record Will Rot.

Two Agents, Two Jobs. Conflating Them Is How Both Fail.

The Hard Part Is Telling Decisions From Opinions

Time-Based Triggers Are Theater. Condition-Based Triggers Are Enforcement.

The Scanner Is Just a Loop. The Hard Part Is the Data Sources.

Load every active decision record from the registry

Classify each trigger into a category the scanner can route

Route to the right data source. Evaluate. Return a structured verdict.

If the trigger fires with high confidence, raise a review request with full context

Edge Cases That Break Naive Extractors

Multi-meeting decisions

Contradictory decisions

Implicit authority

You Cannot Tune the Extractor Without Ground Truth

Build a ground-truth dataset before tuning anything

Measure precision and recall separately

Add few-shot examples from your own org

Run a human-in-the-loop validation step for the first 8 weeks

What You Actually Need to Ship This

Decision Extraction Pipeline — Setup Checklist

When Not to Build This (and What to Do Instead)

Two Quarters In, the Registry Becomes a Real Asset

Decision Record Governance Rules

Every record must have a non-empty review_trigger field

Triggers must be falsifiable and externally verifiable

Contradicting a previous decision requires linking the original record

Extraction accuracy is measured monthly against ground truth

The registry is a prompt for discussion, not a veto

Marketplace Signal Layer: Spot a Dying Category 3-6 Weeks Before GMV Drops

Customer Voice Synthesis: Merge 5 Feedback Channels Into One Weekly Brief

Cross-System Signal Radar: Triage Incidents in Code, Not Six Dashboards

What this covers

Decisions Decay. Without Enforcement, Decay Is the Default.

Seven Fields. Anything Missing Means the Record Will Rot.

Two Agents, Two Jobs. Conflating Them Is How Both Fail.

The Hard Part Is Telling Decisions From Opinions

Time-Based Triggers Are Theater. Condition-Based Triggers Are Enforcement.

The Scanner Is Just a Loop. The Hard Part Is the Data Sources.

Load every active decision record from the registry

Classify each trigger into a category the scanner can route

Route to the right data source. Evaluate. Return a structured verdict.

If the trigger fires with high confidence, raise a review request with full context

Edge Cases That Break Naive Extractors

Multi-meeting decisions

Contradictory decisions

Implicit authority

You Cannot Tune the Extractor Without Ground Truth

Build a ground-truth dataset before tuning anything

Measure precision and recall separately

Add few-shot examples from your own org

Run a human-in-the-loop validation step for the first 8 weeks

What You Actually Need to Ship This

Decision Extraction Pipeline — Setup Checklist

When Not to Build This (and What to Do Instead)

Two Quarters In, the Registry Becomes a Real Asset

Decision Record Governance Rules

Every record must have a non-empty review_trigger field

Triggers must be falsifiable and externally verifiable

Contradicting a previous decision requires linking the original record

Extraction accuracy is measured monthly against ground truth

The registry is a prompt for discussion, not a veto

Marketplace Signal Layer: Spot a Dying Category 3-6 Weeks Before GMV Drops

Customer Voice Synthesis: Merge 5 Feedback Channels Into One Weekly Brief

Cross-System Signal Radar: Triage Incidents in Code, Not Six Dashboards