The primary consumer of your documentation is no longer a human. It is an agent making code changes, retrieving context, executing workflows. Treat docs as infrastructure — versioned, tested, owned — or ship guesses every time the model runs.
Every Claude session starts from zero unless something carries the org forward. CLAUDE.md is that something — a persistent context layer that encodes team topology, current priorities, and the decisions you have already paid for. Treat it as a config file and you keep paying the coordination tax.
Four layers between a generic assistant and a colleague: always-on slow facts, on-demand skill files, live MCP data, and a persistent entity graph. One architecture. Zero fine-tuning. The teams that ship all four cut correction cycles in half.
New hires don't lack capability. They lack context. Three onboarding agents — orientation, historical reasoning, starter-ticket matching — index the institutional knowledge that already exists in PRs, ADRs, and post-mortems. Ramp compresses.
Most enterprise AI failures are not model failures. They are retrieval failures. Chunking, embeddings, vector stores, knowledge graphs, and the context budget — what actually breaks at scale and how to build the memory layer that holds.
SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.
Most agent failures return HTTP 200. The dashboard stays green while the reasoning chain quietly compounds the wrong premise. Here is the triage runbook, the failure-mode field guide, and the postmortem template that survives non-deterministic systems.
Most production agent failures are not model failures. They are missing constraints — business rules carried in four engineers' heads with no formal representation agents can query. The fix is a versioned, governed context store the data team owns instead of answers.