Retro-to-Pattern Engine: A Year of Sprint Retros, Clustered

The Retro-to-Pattern Engine: Surface What Your Team Keeps Tripping On

Twenty-six retros a year, three platforms, zero memory. The same friction points keep resurfacing because nobody re-reads 26 documents. An agent that normalizes, clusters, and ranks the patterns turns retro output into a longitudinal record the team can act on.

Workflow AutomationintermediateDec 17, 20257 min read

By Viktor Bezdek · VP Engineering, Groupon

Sprint retrospectives are the only ceremony most teams actually like. They are also the ceremony with the shortest memory.

Every two weeks the team gathers, writes the sticky notes, surfaces the friction points — and surfaces them again six sprints later, and again six sprints after that. The retro board gets archived. The action items half-land. Three months later someone says "didn't we talk about this before?" with the tired certainty of a person who already knows the answer.

The failure mode is not honesty. Most teams are surprisingly direct when given the space.^[2] The failure mode is structural: retro output lives in dozens of unconnected documents scattered across Confluence pages, Notion databases, and Google Docs, each formatted differently, each forgotten within days. No human re-reads 26 retro transcripts back-to-back. The longitudinal record does not exist.

An agent can hold all 26 in working memory. That is the entire premise of this article: a pipeline that ingests a year of retros, normalizes their incompatible formats, clusters recurring themes by semantic similarity, and ranks them by how much the team is actually paying for each one. Output is a report ordered by frequency, sentiment, and recurrence velocity — not a feeling.

Many

Retro themes recur within 3 months. Recurrence rate tracks action-item follow-through, not retro quality. Anonymized retro datasets show a majority of themes resurface.

40-50%

Of action items never reach completion, per aggregated data from ScatterSpoke and TeamRetro. Teams with explicit ownership do better; teams without owners drift to the low end.

Hours saved

Per quarter vs. manual review — scales with how many retro documents you have and how thorough the current process pretends to be.

3-5

Hidden pattern clusters most teams miss — rough estimate from teams with 6–18 months of retro data. Your number depends on team stability.

Why Retros Forget Their Own Lessons

The retro is honest. The system around the retro has no memory.

Retrospectives sit in an odd spot in agile practice. Teams rate them higher than planning or grooming. Teams also operationalize them less than any other ceremony.^[2] A retro produces a Confluence page, sometimes a Jira ticket, occasionally a Slack thread. It almost never produces a longitudinal record that connects this sprint's friction to last quarter's friction.

Three structural forces produce the amnesia.

Format drift. The scrum master who ran retros in Q1 used a Confluence template with three columns. The new facilitator switched to Notion and a Start/Stop/Continue layout. Someone else ran a Google Doc with freeform bullets. The data exists. It resists comparison.

Volume blindness. Two-week sprints generate 26 retro documents per year. Nobody re-reads 26 documents. Most people barely remember the previous retro by the time the next one starts.

Action-item decay. Aggregated data from ScatterSpoke and TeamRetro shows teams complete only roughly 40–50% of retro action items on average^[6]^[5] — wide variance by team maturity and ownership. The incomplete ones do not carry forward. They vanish from collective memory and resurface as fresh complaints months later, indistinguishable from new problems.

Memory by Vibe

Re-reading 26 documents across 3 platforms by hand
Theme identification by whoever has the longest tenure
No frequency tracking across sprints
Action items lost between retros
Patterns visible only to people who were there

Longitudinal Record

Agent ingests every document in minutes
Semantic clustering groups themes by meaning, not by phrasing
Frequency and recurrence span tracked per pattern
Unresolved patterns flagged with full appearance history
Patterns visible to anyone, regardless of tenure

The Normalization Layer: Three Incompatible Formats, One Schema

Before you cluster anything, you need a common shape. Each platform fights you differently.

Clustering does not work on heterogeneous data. Retros arrive in at least three incompatible formats, and each platform demands its own extraction strategy.

The normalization layer flattens every retro document into a list of retro items. Each item carries a sentiment polarity (positive, negative, neutral), a sprint identifier, and the cleaned raw text. Everything downstream — embeddings, clustering, ranking — runs against this schema. Get the schema wrong and the rest of the pipeline produces noise that looks like signal.

[01]
Pull raw content from each platform
Use the Confluence REST API (GET /wiki/api/v2/pages/{id}?body-format=storage), the Notion API (query a database filtering by 'Retrospective' type), or the Google Docs API (documents.get with body content parsing). Three platforms, three return shapes: Confluence hands you XHTML storage format, Notion returns block arrays, Google Docs returns a structural elements tree. There is no shortcut — write a connector per source.
[02]
Parse platform-specific structure into retro items
Confluence templates lean on tables or panels with category headers (What went well, What didn't, Actions). Notion stores items as child blocks under labeled sections. Google Docs rely on heading styles or bold text to mark categories. Each parser walks its own tree, extracts items, and maps each one to a sentiment category.
[03]
Normalize into the unified RetroItem schema
Every extracted item becomes a RetroItem with fields: id (UUID), sprintId (e.g. 'sprint-47'), date (ISO 8601), text (cleaned string), sentiment (positive | negative | neutral), source (confluence | notion | gdocs), and raw (original text for debugging). Strip markdown, normalize whitespace, drop facilitator meta-comments. The cleaner this layer is, the less noise propagates downstream.

lib/retro-normalizer.ts

// One schema for every platform. Mismatches die here, not in the cluster pool.
interface RetroItem {
  id: string;
  sprintId: string;
  date: string; // ISO 8601
  text: string;
  sentiment: 'positive' | 'negative' | 'neutral';
  source: 'confluence' | 'notion' | 'gdocs';
  raw: string;
}

const SENTIMENT_MAP: Record<string, RetroItem['sentiment']> = {
  'what went well': 'positive',
  'went well': 'positive',
  'keep': 'positive',
  'positives': 'positive',
  'what didn\'t go well': 'negative',
  'challenges': 'negative',
  'stop': 'negative',
  'frustrations': 'negative',
  'actions': 'neutral',
  'try': 'neutral',
  'start': 'neutral',
  'experiments': 'neutral',
};

function classifySentiment(sectionHeader: string): RetroItem['sentiment'] {
  // Unknown headers default to neutral. Tighten the map before you tighten the model.
  const normalized = sectionHeader.toLowerCase().trim();
  return SENTIMENT_MAP[normalized] ?? 'neutral';
}

Semantic Clustering: Same Pain, Three Different Phrasings

Keyword matching misses the pattern. Embeddings catch it because they ignore the words.

Here is the core problem. "Deployments take too long" from Sprint 31, "CI pipeline is a bottleneck" from Sprint 38, and "we spent half of Thursday waiting for staging to deploy" from Sprint 42 are the same pattern. Keyword matching will report three unrelated complaints. Semantic embedding collapses them into one cluster.

The approach is not exotic. Embed every normalized retro item into a vector space, then cluster the vectors.

Embedding choice. For retro items at 5–30 words each, a lightweight model — text-embedding-3-small from OpenAI or voyage-3-lite from Voyage AI — is accurate enough. The texts are short and domain-specific; you do not need a large model. Batch the embedding call: ~2000 items per request keeps latency and cost flat.

Clustering choice: HDBSCAN, not K-means. You do not know the number of clusters in advance, and retro items produce clusters of wildly different sizes.^[3] A deployment-pain cluster might hold 15 items. A meeting-fatigue cluster might hold 4. K-means forces a number you do not have. HDBSCAN handles density variance natively and — this is the part that matters — it identifies noise points: items that belong to no cluster. Noise filtering is how you separate one-off complaints from recurring patterns.

lib/retro-clusterer.ts

import { HDBSCAN } from 'hdbscanjs';

// HDBSCAN over cosine distance. Noise is signal — it filters one-off complaints.
interface ClusterInput {
  items: RetroItem[];
  embeddings: number[][]; // parallel array of embedding vectors
}

interface ThemeCluster {
  id: string;
  label: string; // filled by LLM labeling pass
  items: RetroItem[];
  centroid: number[];
  frequency: number; // count of unique sprints represented
  firstSeen: string; // earliest sprint date
  lastSeen: string;  // most recent sprint date
  recurrenceSpan: number; // days between first and last
}

function clusterRetroItems(input: ClusterInput): ThemeCluster[] {
  const clusterer = new HDBSCAN({
    minClusterSize: 3,
    minSamples: 2,
    metric: 'cosine',
  });

  const labels = clusterer.fit(input.embeddings);

  // Drop noise points (label === -1). They are not patterns.
  const groups = new Map<number, RetroItem[]>();
  labels.forEach((label, idx) => {
    if (label === -1) return;
    if (!groups.has(label)) groups.set(label, []);
    groups.get(label)!.push(input.items[idx]);
  });

  // Build ThemeCluster objects ordered by recurrence span.
  return Array.from(groups.entries()).map(([id, items]) => {
    const dates = items.map(i => new Date(i.date)).sort((a, b) => +a - +b);
    const sprintIds = new Set(items.map(i => i.sprintId));
    return {
      id: `cluster-${id}`,
      label: '',
      items,
      centroid: computeCentroid(
        items.map((_, i) => input.embeddings[labels.indexOf(id)])
      ),
      frequency: sprintIds.size,
      firstSeen: dates[0].toISOString(),
      lastSeen: dates[dates.length - 1].toISOString(),
      recurrenceSpan: (+dates[dates.length - 1] - +dates[0]) / 86400000,
    };
  });
}

Label the Clusters. Rank by What They Actually Cost.

Frequency alone ranks badly. Combine three signals or you will surface noise.

A raw cluster is a numbered group of similar text. It becomes useful only when labeled by what it is and ranked by what it costs.

LLM labeling. Pass each cluster's items to a model with a prompt: "These retro items appeared across multiple sprints. Generate a 3–8 word theme label and a one-sentence summary." The model sees real complaints — not centroids — so it produces "Deployment pipeline bottlenecks," not "Cluster 7."

Impact scoring. Frequency alone is a weak ranking signal. A pattern that appeared in 20 of 26 sprints sounds severe; if it is "standup runs long," the actual cost is small. Combine three signals into a composite:

Frequency (F): unique sprints where the pattern appears, divided by total sprints analyzed. Range 0–1.
Sentiment weight (S): proportion of negative items in the cluster. Pure negativity scores higher than mixed clusters.
Recurrence velocity (V): inverse of average gap between appearances. A pattern hitting every sprint outranks one with two clusters three months apart.

The composite is impact = (0.4 * F) + (0.3 * S) + (0.3 * V), normalized to 0–100. The weighting favors patterns that are both frequent and persistently negative over patterns that are simply common. Tune the weights to your team. The 0.4/0.3/0.3 split is a starting point, not a verdict.

Rank	Pattern	Sprints Hit	Impact Score	First Seen	Status
1	Deployment pipeline bottlenecks	18 / 26	87	Sprint 22	Unresolved
2	Unclear acceptance criteria on stories	14 / 26	72	Sprint 24	Partially addressed
3	Cross-team dependency delays	12 / 26	68	Sprint 25	Unresolved
4	Test environment instability	11 / 26	61	Sprint 29	Resolved Sprint 41
5	Sprint scope creep from stakeholders	9 / 26	54	Sprint 30	Unresolved

Pipeline Architecture: From Documents to Decisions

End-to-end shape of the engine. Each box exists because something specific breaks without it.

Retro Pattern Engine Pipeline

Weekly run pulls from three platforms, normalizes into one schema, gates on validation, clusters semantically, ranks by impact, ships the report.

Retro Pattern Engine Data Flow

Items flow from raw documents through the normalizer, the embedder, HDBSCAN, the impact ranker, and the report. The empty-output path exists on purpose.

The Report Generates Action — or It Generates Defensiveness

Same data, two framings. One drives change. The other gets the engine quietly turned off.

The fastest way to kill a pattern engine is to ship the first report as a list of "things you keep screwing up." That triggers defensiveness, not improvement.^[4] The presentation layer carries as much weight as the analysis.

Three design constraints keep the output usable.

Show trajectory, not just snapshots. Every pattern needs a sparkline or timeline showing when it appeared and where it is trending. A pattern that hit 8 of the first 13 sprints but only 2 of the last 13 is a success story even if the lifetime count looks alarming. Teams need to see the curve, not just the total.

Separate observation from prescription. The report says "Deployment pipeline bottlenecks appeared in 18 of 26 sprints, with the highest concentration in Sprints 33–38." It stops there. It does not say "Fix your deployment pipeline." The team already knows. What they lack is the evidence to prioritize the fix over the next feature ticket.

Link patterns to specific retro items. Every cluster expands to show the actual quotes from each sprint. This does two jobs at once: it builds trust in the clustering ("yes, these really are the same complaint") and it gives the team the specifics needed to draft a targeted intervention. A pattern with no source quotes is a number; a pattern with the actual sentences is a fight someone can pick up.

Report Sections That Drive Action

✓
Executive summary: top 3 patterns with impact scores and trend arrows
✓
Pattern detail cards: theme label, timeline visualization, every source quote, suggested next step
✓
Resolution tracker: previously identified patterns that have improved or resolved, with dates
✓
New signals: themes appearing for the first time in recent sprints — the early warning channel
✓
Sentiment shift: categories where team mood has measurably moved quarter over quarter

Anti-patterns in Report Design

Naming individuals associated with complaints — pattern engine becomes blame engine
Red/green color coding that implies pass/fail judgment on the team
Ranking teams against each other when multiple teams feed the engine
Raw sentiment scores published without trend or context

Building It: Five Commands, End to End

Concrete steps to deploy the engine on your team's data. None of them are exotic.

[01]

Wire the platform connectors

typescript

// Confluence connector — one auth boundary per platform.
const confluenceClient = new ConfluenceAPI({
  baseUrl: process.env.CONFLUENCE_URL,
  token: process.env.CONFLUENCE_TOKEN,
});

const retroPages = await confluenceClient.search({
  cql: 'label = "retrospective" AND created >= "2025-03-01"',
  expand: ['body.storage'],
});

[02]

Run the normalization pipeline

bash

# Pull every retro from every source. Output is the unified schema.
bun run retro-engine normalize \
  --sources confluence,notion,gdocs \
  --date-range 2025-03-01:2026-03-01 \
  --output normalized-items.json

[03]

Embed and cluster

bash

# Embed all items in one batched call. HDBSCAN over cosine distance.
bun run retro-engine cluster \
  --input normalized-items.json \
  --model text-embedding-3-small \
  --min-cluster-size 3 \
  --output clusters.json

[04]

Label clusters and score impact

bash

# LLM labeling pass + composite impact score per cluster.
bun run retro-engine rank \
  --input clusters.json \
  --weights frequency=0.4,sentiment=0.3,velocity=0.3 \
  --output pattern-report.json

[05]

Render the pattern report

bash

# Final HTML with timelines, drill-downs, source quotes per pattern.
bun run retro-engine report \
  --input pattern-report.json \
  --format html \
  --output retro-patterns-2026-q1.html

Where the Engine Breaks

Five rules that survive contact with real retro data.

Rules That Survive Real Retro Data

[01]

Minimum 6 months of retro data before running the engine

Fewer than 12–13 retros gives HDBSCAN nothing to work with. Density is the whole game; sparse data outputs noise points dressed up as clusters.

[02]

Re-embed when team composition changes significantly

Lose 3 of 5 members and hire replacements: the team is effectively new. Patterns from the old team may not apply. Tag every item with a team-composition version and let the consumer filter.

[03]

Never auto-assign action items from the report

The engine identifies. Humans decide what to do. Auto-assigning by frequency manufactures busywork and corrodes trust in the tool — both at once.

[04]

Spot-check cluster coherence on every run

After clustering, manually read 2–3 clusters. If items in a cluster do not feel related, lower minClusterSize or switch from cosine to euclidean. Coherence breaks silently; eyeballs catch it.

[05]

Handle multilingual retros explicitly

If the team writes in multiple languages, use a multilingual embedding model (Cohere embed-multilingual-v3) or pre-translate to a common language. Mixed-language clusters look broken because they are.

Project Layout

How to organize the engine codebase. Each directory carries one responsibility.

Retro Pattern Engine Project Layout

tree

retro-pattern-engine/
├── src/
│   ├── connectors/
│   │   ├── confluence.ts
│   │   ├── notion.ts
│   │   └── gdocs.ts
│   ├── normalizer/
│   │   ├── parser.ts
│   │   ├── sentiment-mapper.ts
│   │   └── deduplicator.ts
│   ├── clustering/
│   │   ├── embedder.ts
│   │   ├── hdbscan.ts
│   │   └── labeler.ts
│   ├── ranking/
│   │   ├── impact-scorer.ts
│   │   └── trend-analyzer.ts
│   └── report/
│       ├── generator.ts
│       └── templates/
├── config.ts
├── cli.ts
└── package.json

Does the Engine Actually Help, or Is It a Pretty Dashboard?

A pattern engine that produces beautiful reports and changes nothing is overhead.

Track these signals or do not bother running the engine.

Pattern resolution rate. Of the top 10 patterns identified in Q1, how many moved to "resolved" or "improving" by Q2? Realistic target: 2–3 of the top 10 showing measurable progress per quarter.^[7] Below that, you are running observability theater.

Action item completion rate. If the team uses the report to generate focused action items, watch whether completion rates climb off the typical 40–50% baseline.^[6] The hypothesis: data-backed priorities are harder to deprioritize than vibes-based ones.

New pattern emergence. A healthy team rotates patterns. New ones appear; old ones resolve. If the same top 5 persist for three consecutive quarters despite full visibility, the problem is not visibility. Something structural is blocking resolution and the report should call that stagnation out explicitly.

Retro engagement. Anecdotally, teams report higher retro engagement once they know the output feeds a longitudinal system. People contribute more carefully when the input has a shelf life longer than two weeks.

Now the uncomfortable finding from teams that have run this for more than a year: the pattern engine usually confirms what senior engineers already knew and had been saying for months. "Deployment bottlenecks" appearing in 18 of 26 sprints surprises nobody who actually deploys. What the engine changes is not the discovery. It changes the political authority to act. A staff engineer saying "this is a problem" is one input. A chart showing the same pattern in 18 sprints is a different input.

If the organization needs that kind of quantitative cover before fixing obvious problems, the engine is not solving a visibility gap. It is patching a process dysfunction. Both are useful. They are not the same thing.

2-3

Top patterns resolved per quarter — realistic target. Below this, the engine is producing reports nobody acts on.

65%+

Action item completion rate when priorities are data-backed — aspirational vs. the ~40–50% baseline. Driven by ownership clarity, not the engine itself.

3 quarters

Stagnation threshold. A pattern persisting this long with full visibility means the blocker is structural, not informational.

Beyond Basic Clustering

Once the pipeline runs reliably, three extensions are worth the effort.

Temporal pattern analysis. Apply a time-weighted decay so recent sprints count more than older ones. Use a sliding window of 6 sprints to detect emerging patterns before they entrench. The engine flips from a retrospective tool into a near-real-time early warning system.^[1]

Cross-team pattern detection. When multiple teams run the engine, aggregate their reports to find systemic problems. "Deployment pipeline bottlenecks" appearing across four teams is not a team problem. It is a platform problem. The same data, viewed at a different aggregation level, points at a different owner.

Correlation with delivery metrics. Link pattern data to sprint velocity, cycle time, defect rates. If a pattern cluster correlates with velocity drops, you have quantitative evidence for the cost of inaction: "deployment friction costs us 15% of sprint capacity." That sentence reorders a roadmap. Vague friction does not.

Pre-Launch Checklist

Verifiable states. Not aspirations.

Retro Pattern Engine Launch Readiness

API credentials provisioned for every source platform — never shared across connectors
At least 12 retro documents available (6+ months of biweekly sprints)
Normalization parser tested against each platform's actual format, not the docs
Embedding model selected; API key in the secret store, not the prompt
HDBSCAN parameters tuned on a sample dataset, not the full year
LLM labeling prompt reviewed for tone — labels carry the report's voice
Impact scoring weights agreed on with the team that will read the output
Report template reviewed by a non-technical stakeholder before first run
Data retention policy confirmed — retro data names people and complaints
Team briefed: this is a longitudinal record, not a performance review tool

Frequently Asked Questions

How many retro documents before the engine produces useful results?

12–13 retrospectives — roughly 6 months of biweekly sprints — gives HDBSCAN enough density to form real clusters. Below that, most items end up classified as noise. For sturdier output, aim for 20+ retros covering at least 9 months. The engine is a density tool; sparse data produces sparse results.

Does it work if our retros are in different languages?

Yes — with the right embedding model. Cohere's embed-multilingual-v3 or OpenAI's text-embedding-3-large project text from different languages into the same vector space, so 'deployment problems' in English and 'Bereitstellungsprobleme' in German land near each other. With a single-language model, you will get language-segregated clusters and miss the underlying pattern.

Does the engine replace retrospectives?

No. It analyzes retro output. The conversation itself — the psychological safety, the live discussion, the surfacing of new friction — is the retro's actual job. The engine extends the value of that conversation by connecting it to a year of prior conversations. Replacing the retro would remove the input the engine depends on.

How do we keep the report from turning into a blame tool?

Three constraints. Never attach individual names to patterns. Frame every pattern as a system observation, not a team failure. Always show trajectory — improving, stable, worsening — so teams see progress next to problems. Have the facilitator present the first report to set the tone. After the first report lands well, the team will defend the framing themselves.

What if our retros are unstructured — freeform text, no categories?

Default every item to neutral sentiment and lean on the embedding layer to discover structure. You lose the sentiment signal, which weakens impact scoring, but clustering still works on semantic similarity alone. With 15+ freeform retros, run a one-time LLM pass to retroactively classify each item as positive, negative, or action-oriented. A few-shot prompt — "Classify this retro item as positive, negative, or action-oriented" with 3–5 examples — clears 90% accuracy on typical engineering retro text and recovers most of the sentiment signal.

Key terms in this piece

sprint retrospective analysisretro pattern enginesemantic clustering retrospectivesagile continuous improvementretrospective anti-patternsNLP sprint datateam learning curvesHDBSCAN text clusteringretrospective action itemssprint history patterns

Sources

[1]GoRetro — AI and the Data-Driven Future of Sprint Retrospectives(goretro.ai)↩
[2]Scrum.org — What Is a Sprint Retrospective?(scrum.org)↩
[3]MDPI Applied Sciences — Automated Analysis of Sprint Retrospectives Using NLP and Clustering(mdpi.com)↩
[4]Scrum.org — 21 Sprint Retrospective Anti-Patterns(scrum.org)↩
[5]TeamRetro — Avoid These Retrospective Anti-Patterns in 2025(teamretro.com)↩
[6]ScatterSpoke — Agile Retrospective Antipatterns That Most Scrum Masters Never Realize(scatterspoke.com)↩
[7]Easy Agile — Actionable Agile Sprint Retrospective Expert Advice(easyagile.com)↩

Share this article

X LinkedIn Hacker News

The Retro-to-Pattern Engine: Surface What Your Team Keeps Tripping On

Workflow AutomationintermediateDec 17, 20257 min read

By Viktor Bezdek · VP Engineering, Groupon

// One schema for every platform. Mismatches die here, not in the cluster pool. interface RetroItem { id: string; sprintId: string; date: string; // ISO 8601 text: string; sentiment: 'positive' | 'negative' | 'neutral'; source: 'confluence' | 'notion' | 'gdocs'; raw: string; } const SENTIMENT_MAP: Record<string, RetroItem['sentiment']> = { 'what went well': 'positive', 'went well': 'positive', 'keep': 'positive', 'positives': 'positive', 'what didn\'t go well': 'negative', 'challenges': 'negative', 'stop': 'negative', 'frustrations': 'negative', 'actions': 'neutral', 'try': 'neutral', 'start': 'neutral', 'experiments': 'neutral', }; function classifySentiment(sectionHeader: string): RetroItem['sentiment'] { // Unknown headers default to neutral. Tighten the map before you tighten the model. const normalized = sectionHeader.toLowerCase().trim(); return SENTIMENT_MAP[normalized] ?? 'neutral'; }

import { HDBSCAN } from 'hdbscanjs'; // HDBSCAN over cosine distance. Noise is signal — it filters one-off complaints. interface ClusterInput { items: RetroItem[]; embeddings: number[][]; // parallel array of embedding vectors } interface ThemeCluster { id: string; label: string; // filled by LLM labeling pass items: RetroItem[]; centroid: number[]; frequency: number; // count of unique sprints represented firstSeen: string; // earliest sprint date lastSeen: string; // most recent sprint date recurrenceSpan: number; // days between first and last } function clusterRetroItems(input: ClusterInput): ThemeCluster[] { const clusterer = new HDBSCAN({ minClusterSize: 3, minSamples: 2, metric: 'cosine', }); const labels = clusterer.fit(input.embeddings); // Drop noise points (label === -1). They are not patterns. const groups = new Map<number, RetroItem[]>(); labels.forEach((label, idx) => { if (label === -1) return; if (!groups.has(label)) groups.set(label, []); groups.get(label)!.push(input.items[idx]); }); // Build ThemeCluster objects ordered by recurrence span. return Array.from(groups.entries()).map(([id, items]) => { const dates = items.map(i => new Date(i.date)).sort((a, b) => +a - +b); const sprintIds = new Set(items.map(i => i.sprintId)); return { id: `cluster-${id}`, label: '', items, centroid: computeCentroid( items.map((_, i) => input.embeddings[labels.indexOf(id)]) ), frequency: sprintIds.size, firstSeen: dates[0].toISOString(), lastSeen: dates[dates.length - 1].toISOString(), recurrenceSpan: (+dates[dates.length - 1] - +dates[0]) / 86400000, }; }); }

Rank

Pattern

Sprints Hit

Impact Score

First Seen

Status

Deployment pipeline bottlenecks

18 / 26

Sprint 22

Unresolved

Unclear acceptance criteria on stories

14 / 26

Sprint 24

Partially addressed

Cross-team dependency delays

12 / 26

Sprint 25

Unresolved

Test environment instability

11 / 26

Sprint 29

Resolved Sprint 41

Sprint scope creep from stakeholders

9 / 26

Sprint 30

Unresolved

// Confluence connector — one auth boundary per platform. const confluenceClient = new ConfluenceAPI({ baseUrl: process.env.CONFLUENCE_URL, token: process.env.CONFLUENCE_TOKEN, }); const retroPages = await confluenceClient.search({ cql: 'label = "retrospective" AND created >= "2025-03-01"', expand: ['body.storage'], });

# Pull every retro from every source. Output is the unified schema. bun run retro-engine normalize \ --sources confluence,notion,gdocs \ --date-range 2025-03-01:2026-03-01 \ --output normalized-items.json

# Embed all items in one batched call. HDBSCAN over cosine distance. bun run retro-engine cluster \ --input normalized-items.json \ --model text-embedding-3-small \ --min-cluster-size 3 \ --output clusters.json

retro-pattern-engine/ ├── src/ │ ├── connectors/ │ │ ├── confluence.ts │ │ ├── notion.ts │ │ └── gdocs.ts │ ├── normalizer/ │ │ ├── parser.ts │ │ ├── sentiment-mapper.ts │ │ └── deduplicator.ts │ ├── clustering/ │ │ ├── embedder.ts │ │ ├── hdbscan.ts │ │ └── labeler.ts │ ├── ranking/ │ │ ├── impact-scorer.ts │ │ └── trend-analyzer.ts │ └── report/ │ ├── generator.ts │ └── templates/ ├── config.ts ├── cli.ts └── package.json

Track these signals or do not bother running the engine.

Why Retros Forget Their Own Lessons

The Normalization Layer: Three Incompatible Formats, One Schema

Pull raw content from each platform

Parse platform-specific structure into retro items

Normalize into the unified RetroItem schema

Semantic Clustering: Same Pain, Three Different Phrasings

Label the Clusters. Rank by What They Actually Cost.

Pipeline Architecture: From Documents to Decisions

The Report Generates Action — or It Generates Defensiveness

Report Sections That Drive Action

Anti-patterns in Report Design

Building It: Five Commands, End to End

Wire the platform connectors

Run the normalization pipeline

Embed and cluster

Label clusters and score impact

Render the pattern report

Where the Engine Breaks

Rules That Survive Real Retro Data

Minimum 6 months of retro data before running the engine

Re-embed when team composition changes significantly

Never auto-assign action items from the report

Spot-check cluster coherence on every run

Handle multilingual retros explicitly

Project Layout

Retro Pattern Engine Project Layout

Does the Engine Actually Help, or Is It a Pretty Dashboard?

Beyond Basic Clustering

Pre-Launch Checklist

Retro Pattern Engine Launch Readiness

Frequently Asked Questions

Related

Why Retros Forget Their Own Lessons

The Normalization Layer: Three Incompatible Formats, One Schema

Pull raw content from each platform

Parse platform-specific structure into retro items

Normalize into the unified RetroItem schema

Semantic Clustering: Same Pain, Three Different Phrasings

Label the Clusters. Rank by What They Actually Cost.

Pipeline Architecture: From Documents to Decisions

The Report Generates Action — or It Generates Defensiveness

Report Sections That Drive Action

Anti-patterns in Report Design

Building It: Five Commands, End to End

Wire the platform connectors

Run the normalization pipeline

Embed and cluster

Label clusters and score impact

Render the pattern report

Where the Engine Breaks

Rules That Survive Real Retro Data

Minimum 6 months of retro data before running the engine

Re-embed when team composition changes significantly

Never auto-assign action items from the report

Spot-check cluster coherence on every run

Handle multilingual retros explicitly

Project Layout

Retro Pattern Engine Project Layout

Does the Engine Actually Help, or Is It a Pretty Dashboard?

Beyond Basic Clustering

Pre-Launch Checklist

Retro Pattern Engine Launch Readiness

Frequently Asked Questions

Related