New hires don't lack capability. They lack context. Three onboarding agents — orientation, historical reasoning, starter-ticket matching — index the institutional knowledge that already exists in PRs, ADRs, and post-mortems. Ramp compresses.
Why ramp failure is an indexing problem, not a documentation problem
Three-layer agent architecture: orientation, historical reasoning, starter-ticket matching
Exact filtering thresholds that separate signal from noise in PR mining
The RAG retrieval model that makes the context index queryable
Failure modes — what the first version gets wrong and why
A pre-launch checklist and a decision heuristic for Monday morning
The senior engineer you just hired is not confused because they are incompetent. They are confused because the context they need to operate is distributed across systems that were never designed to be read by anyone other than the people who built them. Old PRs. ADR files three migrations behind. Slack threads from 2024. The heads of three engineers who joined before the Series B.
The questions are predictable: why did we build it this way, where does this service talk to that one, what should I actually work on first. The answers exist. They are just not retrievable.
This is an indexing problem, not a documentation problem. The onboarding agent does not write new docs. It extracts the reasoning that is already in the system — PR comments, post-mortems, ADRs, incident channels — and structures it for someone encountering the codebase cold. Three layers, each targeting a question the new hire is actually asking on a specific day of their ramp.
Each layer collapses a different feedback loop between the new hire and the people who already know the answer.
One agent is not enough. The questions a new hire asks on day one are different from the ones that show up on day fifteen, and the same retrieval pipeline cannot serve both well. Layer 1 hands over the map. Layer 2 explains why the map looks the way it does. Layer 3 picks the first move. Run sequentially, each layer narrows the surface area the next one has to cover.
Generated structural orientation. Not auto-doc theater — a brief built for someone reading the system cold.
The orientation agent runs against every repository the new hire will touch and produces a service brief per repo. This is not the README rewritten by a model. The README is an input — usually a stale one — among several.
For each service, the brief carries four things:
What it does. A two-paragraph plain-language description derived from README, API surface, and route handlers. If the README has not been touched in six months, the agent flags it and falls back to code-level analysis.[1] Stale documentation with no freshness signal is worse than no documentation — it manufactures false confidence.
How it connects. A dependency map extracted from imports, API client configurations, and infrastructure-as-code. What this service calls. What calls it. What shared state it touches. The map is generated, not maintained — drift is not possible because nobody owns it manually.
Where the load-bearing files are. A guided tour of the directories that matter, weighted by commit frequency over the last 90 days. Files that change often are files the new hire will touch. Stable utility code can wait.
Who actually owns it. CODEOWNERS plus the most active contributors in the last 90 days plus the on-call rotation. The right answer to 'who do I ask' is a name, not a team channel.
treeonboarding-briefs/
├── payments-service/
│ ├── overview.md
│ ├── dependency-map.json
│ ├── key-files-guide.md
│ ├── ownership.md
│ └── recent-changes.md
├── user-service/
│ ├── overview.md
│ ├── dependency-map.json
│ ├── key-files-guide.md
│ ├── ownership.md
│ └── recent-changes.md
└── notification-service/
├── overview.md
├── dependency-map.json
├── key-files-guide.md
├── ownership.md
└── recent-changes.mdThe historical context exists in PR threads, ADRs, and post-mortems. It just was never indexed for retrieval.
The most expensive sentence in any onboarding conversation is 'there's a reason for that, but nobody remembers exactly what it was.' The reason exists. Some senior engineer typed it out two years ago in a PR review or an incident channel and then closed the tab. Layer 2 indexes those typings.
Three sources, ranked by signal density:
Architecture Decision Records. Where they exist, ADRs are the highest-density context source the team produces. The agent parses every ADR, links each decision to the services and code paths it affects, and surfaces them on lookup.[4] AWS's prescriptive guidance — drawn from more than 200 real ADR cycles — notes that teams spend material time re-litigating decisions that were already made and documented, simply because the documentation was never tied to the code it governed. When the new hire asks why the notification service polls instead of taking webhooks, the answer is the 2024 ADR, not a Slack ping to the senior on call.
PR review threads. The richest source most teams have, and the most under-indexed. Significant PRs — 8+ review comments, multiple revision cycles, large diffs — carry inline arguments about tradeoffs, rejected alternatives, and the failure modes the author is defending against. The agent indexes those discussions and ties them to the files they touch.
Post-mortem documents. Defensive code looks excessive until you read the post-mortem. The retry wrapper that wraps every external call has a date attached to it: the cascading timeout in 2024 that took the payment pipeline down for three hours. The agent links recommendations to the diffs that implemented them so the new hire reads code with the scar tissue visible.
Embedding-only retrieval fails on the questions new hires actually ask. The architecture needs three retrieval modes.
Most onboarding knowledge base attempts use a single vector store and call it RAG. This works well for recall — 'show me something about the payment service' — and poorly for reasoning — 'why does this file exist and why does it look the way it does.' The queries that matter during onboarding skew heavily toward the second type.[7]
A more robust setup uses three retrieval modes and routes between them:
Semantic retrieval (vector similarity) handles 'what is X' and 'show me examples of Y' — the orientation questions that dominate Layer 1. The embedding model indexes file content, README excerpts, and service summaries. Chunk size matters here: 512-token chunks with 64-token overlap work better than naive paragraph splits because code spans multiple logical units that natural paragraph breaks will cut through.
Keyword + path retrieval (BM25 or exact match) handles 'find the ADR that changed the auth flow' or 'show me context for src/payments/retry.ts'. File path matching is the highest-precision signal when the new hire is already looking at specific code. Hybrid search — combining semantic and BM25 scores with Reciprocal Rank Fusion — outperforms either alone by a meaningful margin on the 'explain this file' queries that Layer 2 needs to answer.
Graph traversal handles 'trace this decision forward' — starting from an ADR and following which files changed, which incidents referenced the same pattern, and which subsequent ADRs superseded or modified the original decision. The dependency map from Layer 1 becomes the graph; each ContextEntry's relevantPaths are edges. This is the retrieval mode that no embedding model replaces.
The routing logic is straightforward: path present in the query → keyword + graph. No path but a time reference → BM25 with date filter. Freeform question → semantic + rerank. Start simple. The failure mode of over-engineering the router before the index has real content is spending two sprints on architecture and shipping nothing.
The starter ticket either teaches the codebase or wastes a week. Match scope, skill, and sprint context — or don't ship the agent at all.
Most teams botch the starter ticket. Either it is so trivial it teaches nothing — fix a typo in a doc nobody reads — or it is so sprawling the new hire spends three weeks understanding why the architecture is the way it is before they touch a line of code. The right starter ticket sits in a narrow band: meaningful enough to force engagement with the codebase, scoped enough to ship in two to three days, and connected to something the team will actually merge.
The matching agent takes three inputs and returns a ranked list:
Sprint context. Current backlog, team velocity, upcoming deadlines. The agent surfaces tickets that are needed but not on the critical path — work the team wants done but will not block the sprint if the new hire takes longer than expected.
Skill profile. Stated experience with languages, frameworks, and domain areas, captured during the interview process or a structured intake. A frontend specialist does not get a Kubernetes networking ticket on day three. This is not a kindness. It is a calibration of where the new hire's existing context actually transfers.
Codebase accessibility. From Layer 1: which services have current documentation, the highest test coverage, and the most active reviewers. The agent biases toward those. The first ticket lands where the safety net is densest.
Output is a ranked list of three to five tickets, each with the rationale attached: what the new hire will learn, which services the change will touch, who they should pair with on review.
| Factor | Weight | Score 1 (Cut It) | Score 5 (Ship It) |
|---|---|---|---|
| Scope clarity | 0.25 | Vague requirements. Done criteria undefined. | Bounded scope. Acceptance criteria written down. |
| Learning value | 0.25 | Trivial change. Codebase exposure: zero. | Touches 2-3 services. Forces engagement with core patterns. |
| Safety net | 0.20 | No tests. No active owner. Sole reviewer on PTO. | Strong test suite. Reviewers responsive within hours. |
| Sprint relevance | 0.15 | Backlog filler. Team will not notice if it ships. | Current sprint. Team needs it merged this cycle. |
| Skill match | 0.15 | Requires a stack the new hire has never touched. | Aligns directly with stated experience. |
The most valuable onboarding artifact is not a document anyone wrote. It is the by-product of every team's daily communication that nobody curated for onboarding. The agent harvests this passively — no new process, no one drafting onboarding wikis on top of their actual work.
PR comment mining. Review comments are direct links between code patterns and operational scars. When a reviewer writes 'use the existing retry wrapper — last time someone rolled their own we got the cascading timeout from INC-2847,' that comment is a pointer from a code line to an incident. The agent indexes those pointers.
Slack thread analysis. Project channels carry decision narratives that never become formal docs. The agent scans threads with high engagement — many participants, many messages — in channels tagged to the new hire's team, extracts the decision, and links it to the relevant code or ticket.
Incident context. Post-mortems document what broke. The richer context lives in the incident channel itself: the hypotheses that were tested and rejected, the workarounds that became permanent, the assumptions that turned out to be wrong. The agent indexes the channel transcript alongside the formal write-up.
All of this runs continuously. By the time the new hire signs their offer, the index is already populated with months of institutional knowledge.[3]
A caveat that cost us a sprint to learn. The first version pulled every PR comment indiscriminately. Result: noise. Snarky review banter, debates from a pre-migration architecture, context tied to code that had since been deleted. The index is only as useful as its filter. The threshold that worked for us: 8+ review comments, 2+ participants, merged within the last 18 months. Older PRs are excluded unless they touch infrastructure that has not changed. The filter is not optional. It is the product.
README last touched 14 months ago. Treated as truth.
Shadow a senior engineer for a week. Hope they remember to explain the load-bearing parts.
Ask 'why' in Slack. Wait hours. Get half an answer.
Starter ticket assigned by guesswork. Teaches nothing or everything.
Tribal knowledge transferred by accident, through trial and error.
First meaningful PR: 60-90 days in.
Service brief regenerated weekly. Freshness timestamped.
Query the context index. Historical reasoning surfaces in seconds with sources attached.
PR threads, ADRs, and post-mortems indexed and tied to the files they explain.
Starter ticket matched to skill profile, sprint priority, and codebase safety net.
Institutional knowledge searchable from day one. Senior engineers stop being a verbal cache.
First meaningful PR: 2-4 weeks, depending on codebase complexity.
The failure modes are predictable. Name them before you hit them.
Every team that builds this system makes roughly the same mistakes in the same order. Here is the sequence:
Indexing too much, filtering too little. The first instinct is to ingest everything — all PRs, all Slack channels, all doc pages. The index balloons. Query latency climbs. Retrieval quality collapses because the signal-to-noise ratio is 1:20. The fix is not a better embedding model. It is the filter. Enforce the thresholds before you build the retrieval layer.
Treating the service brief as documentation. The brief is context for someone unfamiliar with the system. If engineers start citing the brief as the authoritative source — instead of the actual code or the ADR — you've created a new stale layer on top of the old stale layer. The brief regenerates weekly and carries no normative authority. The code is the source of truth. The brief is a reading guide.
Wiring the Slack scanner to DMs or private channels. This happened at two organizations I'm aware of. The immediate reaction from the engineering team was correct: the scanner got shut down, the trust damage took months to repair, and the whole project was set back. Public engineering channels only. The list is a config file checked into the repo. Any engineer can audit it.
Shipping Layer 3 before Layer 1 is accurate. Starter ticket matching generates visible output fast, which creates pressure to ship it first. Resist. A miscalibrated skill profile combined with an inaccurate service brief produces a ticket recommendation that is actively harmful — it lands a new hire in the wrong part of the codebase with the wrong frame for what they're looking at. Get the orientation brief right first.
Not measuring time-to-first-PR before deploying. The metric that validates this whole system is time-to-first-meaningful-PR. If you don't establish a baseline before the agent ships, you have no way to claim improvement. The measurement cost is a spreadsheet and a conversation with engineering managers about what 'meaningful' means. Do it before you write the first line of agent code.
The agent earns its cost at a specific team size and hiring velocity. Below those thresholds, a structured wiki beats an agent.
| Signal | Build the Agent | Use a Structured Wiki Instead |
|---|---|---|
| Hiring velocity | 4+ engineers per quarter. Onboarding is a recurring event. | Fewer than 4 per year. Onboarding is rare enough to handle manually. |
| Codebase age | 3+ years. ADRs, post-mortems, and PR threads have accumulated density. | Under 2 years. Not enough historical signal to make the index valuable. |
| Team size | 20+ engineers. Context gap between senior and new hire is wide. | Under 10. The founding team still remembers why everything was built. |
| PR history | 50+ significant PRs (8+ comments). Enough contested decisions to index. | Fewer than 20. Not enough signal. Mine what you have manually. |
| ADR practice | Even 10 ADRs exist as a starting point — the agent fills gaps. | Zero ADRs, zero post-mortems. Index the PR threads first; start ADR practice simultaneously. |
These ranges come from self-reported outcomes at teams running structured onboarding, not controlled studies. Codebases with active ADR practices and decent test coverage see the biggest gains. Teams with sparse documentation see modest improvements until the context index matures — which takes a quarter, not a week. Pilot with one team. Measure time-to-first-PR before claiming org-wide impact. Numbers without a baseline are theater.
How do you keep the service briefs from going stale?
Run Layer 1 on a weekly cron, not on hire dates. Diff every run. Briefs that change every week are telling you the service is in flux — that is itself useful context for the next new hire. Version each brief with a generated-on timestamp. A stale brief with no freshness signal is worse than no brief at all because it manufactures false confidence.
What if the team doesn't keep ADRs?
Most teams don't. That is the default state, not the exception. The agent compensates by raising the weight on PR mining — significant PRs with 8+ comments and multiple revision cycles serve as informal ADRs. The agent can also draft retroactive ADRs from the highest-signal PR threads and surface them to engineering leadership for formalization. Onboarding becomes the forcing function for the ADR practice the team should have started two years ago.
How do you keep sensitive material out of the Slack scan?
Scope hard. Public engineering channels only — never DMs, never private channels, never HR or leadership channels. Keyword filter against 'salary', 'performance review', 'layoff', 'HR'. Publish the full list of scanned channels in the repo as a config file so any engineer can read it and propose changes via PR. The list is not a secret. The transparency is the point — it stops the scanner from becoming an unmonitored surveillance vector.
Does the agent replace the buddy or mentor?
No, and conflating them is the mistake teams make on first deployment. The agent handles information transfer — the structured, factual context about code, architecture, and historical decisions that takes hours to communicate verbally and is mostly already written down somewhere. The buddy handles cultural integration, unwritten norms, and the judgment calls no index will ever capture. The agent's job is to stop the buddy from being a walking encyclopedia so they can focus on the parts of onboarding that actually require a human: psychological safety, team dynamics, advocating for the new hire when nobody else will.
What retrieval architecture actually works at scale?
A hybrid approach: semantic search (vector similarity) for 'what is X' queries, BM25 keyword search for 'find the ADR about Y' queries, and graph traversal for 'trace this decision forward' queries. Route between them based on whether the query contains a file path or date reference. Reciprocal Rank Fusion to combine semantic and keyword results. Reranking on the top 20 candidates before returning the top 5. The embedding chunk size that works in practice is 512 tokens with 64-token overlap — larger chunks lose precision on file-specific questions, smaller chunks lose the context that makes summaries coherent.
How long does it take for the index to become useful?
For PR mining and post-mortems, the index is useful from day one if the codebase is more than 18 months old — there's already enough accumulated signal. ADR indexing is only as good as ADR coverage, which is often thin at first. The full system reaches steady-state usefulness — where the new hire can answer 80% of their 'why' questions without interrupting a senior engineer — after about one quarter of continuous indexing and one or two onboarding cycles that generate feedback on retrieval quality.
The bottleneck is not access to code. It is access to context. Every engineering organization sits on months of accumulated decisions, tradeoffs, and operational scars embedded in artifacts nobody curates for onboarding. The agent does not write new docs. It indexes the artifacts that already exist and makes them retrievable on the day someone new needs them.
Ship Layer 1 first. Repository access only — minimal integration cost, immediate signal. Add Layer 2 once the service briefs are accurate enough that engineers stop correcting them. Add Layer 3 last, after sprint integration and skill profile intake are stable. The system compounds: every new hire who uses it sharpens the index for the next one.[2]
Tribal knowledge is a single point of failure. Index it before the next hire shows up.
Most AI use case selection is workshop theater. Process mining reads the actual event logs and ranks workflows by volume, variance, and structure — so you find out whether you need an LLM, an RPA bot, or nothing before spending a dollar.
Distributed teams burn productivity at the timezone seam. Decisions buried in threads. Phantom blockers. Parallel divergence. The fix is not better Slack hygiene. It is a structured brief that extracts decisions, blockers, and active work from the tools the team already uses.
Visibility bias is a management failure mode, not a character flaw. Five signal channels, a recognition debt modifier, and a queue that surfaces the contributors your attention misses. Calm correction, not surveillance.