Four layers between a generic assistant and a colleague: always-on slow facts, on-demand skill files, live MCP data, and a persistent entity graph. One architecture. Zero fine-tuning. The teams that ship all four cut correction cycles in half.
First time you point Claude at a real codebase, it behaves like a brilliant contractor who has never seen your org chart, your deployment topology, or your naming conventions. Every question needs preamble. Every answer needs correction. The coordination tax is the entire interaction.
A context layer is the structural fix. Not fine-tuning. Not a custom model. A deliberate architecture of context files, skill definitions, live data connections, and entity relationships that load the right information at the right time, scoped to the task in front of it.[4] Teams that build all four layers cut correction cycles by more than half inside the first month.
Four layers, each tuned to a different access pattern: always-on slow facts, on-demand domain skills, live real-time data, and a persistent entity graph. Most teams get Layer 1 right and stop. The teams that build all four report fewer corrections, fewer hallucinations about internal systems, and far less time spent re-explaining context the assistant should already have. One honest caveat: Layer 4 is hard to maintain. If your organizational data is scattered across Jira, Slack, and Notion with no source of truth, the entity graph will cost more than it returns for the first six months. Build it last, on purpose.
Why CLAUDE.md is the highest-leverage context investment — and where teams overload it
How skill files solve the context budget problem without limiting coverage
The real token cost of MCP connections: when always-on becomes always-expensive
When to build Layer 4 (entity graph) and when to skip it for six months
Concrete SKILL.md frontmatter syntax, MCP server selection heuristics, and Monday-morning actions
Slow facts, domain skills, live data, entity relationships
Skills load on-demand, so they cost nothing until the task needs them
A 50-tool server consumes 25,000–75,000 tokens before you type a word
Magnitude depends on documentation quality and team discipline
Always-on context. The slow facts that survive next month's roadmap.
CLAUDE.md is the first file the assistant reads when it enters your project.[1] Treat it as the employee handbook plus the architecture decision record plus the tribal knowledge that lives in three Slack threads and one engineer's head. Most teams treat it as a README. That is the mistake.
The always-on layer holds slow facts — information with a long shelf life that applies to almost every interaction. Tech stack and versions. Deployment targets. Naming conventions. Testing philosophy. Directory structure. The specific things a new hire would need on day one to avoid an embarrassing first commit.
Anthropics's own documentation recommends keeping CLAUDE.md under 200 lines.[6] That number is a forcing function, not an aesthetic preference. CLAUDE.md loads at session start and stays loaded across every turn of every conversation. A 5,000-token file costs 5,000 tokens before you've typed a word — then again on every subsequent message. The fix is not to keep the file sparse; it's to move procedure and domain-specific knowledge into Layer 2 skills, where they cost nothing until invoked.
What does not belong here: anything that changes weekly, anything specific to a single feature, and anything that reads like a procedure rather than a fact. If a CLAUDE.md section has grown into a multi-step deployment workflow, it belongs in a skill. If it describes Stripe webhook handling, it belongs in a payments skill. The file passes when you can state its content as a set of invariants — things that are always true about this codebase — rather than a set of instructions.
Domain knowledge that loads only when the task touches the domain. Everything else stays out of the window.
Layer 2 fixes the budget problem. Your organization has deep knowledge about payments, authentication, onboarding, billing, compliance, and a dozen other domains. Loading all of it into every interaction wastes tokens and dilutes the signal until the assistant treats noise like instruction.[2]
Skill files are domain-specific context packages that load on demand. Each skill is a directory with a SKILL.md file holding instructions, conventions, and resources for that domain. Task is about payments — the payments skill loads. Task shifts to authentication — the auth skill loads, the payments skill drops out.[2]
The mechanism that makes this work is YAML frontmatter in the SKILL.md file. The description field tells the assistant what this skill covers and when to activate it. The content below the frontmatter carries the actual instructions. Claude reads the description first (low cost), then loads the full skill body only when the task matches.[6] This is progressive disclosure: metadata first, instructions on match, supporting resource files only when explicitly referenced.
The mechanism lets you maintain a library of 50+ skills without any of them consuming context until the task makes them relevant. Without progressive disclosure, you'd be choosing between coverage and budget. With it, you stop choosing.
One failure mode to watch: skills whose description field is too broad. A skill described as 'handles backend work' activates on nearly every engineering task — defeating progressive disclosure entirely. Write descriptions as trigger conditions: 'Use when working with Stripe payments, webhooks, or refund flows.' Specific triggers mean the skill loads when it should, stays dormant when it shouldn't.
tree.claude/
├── commands/
│ ├── deploy.md
│ ├── create-migration.md
│ └── generate-api-route.md
└── skills/
├── payments/
│ ├── SKILL.md
│ ├── stripe-conventions.md
│ └── refund-flow.md
├── auth/
│ ├── SKILL.md
│ ├── session-management.md
│ └── rbac-model.md
└── onboarding/
├── SKILL.md
├── wizard-steps.md
└── trial-logic.md| Characteristic | Layer 1 (CLAUDE.md) | Layer 2 (Skill Files) |
|---|---|---|
| Update frequency | Monthly or less | Weekly to monthly |
| Scope | Entire organization | Single domain or feature area |
| Loading behavior | Always loaded | Loaded on demand when trigger matches |
| Token budget | ≤200 lines, hard ceiling | 5–10% of window per skill while active |
| Content type | Stack, conventions, architecture invariants | Domain logic, API patterns, edge cases |
| Owner | Tech lead or platform team | Domain team or feature owner |
| Examples | Naming conventions, deploy targets, anti-patterns | Stripe webhook handling, auth session flow, RBAC rules |
| Wrong to put here | Deployment procedures, domain-specific flows | Universal conventions, global anti-patterns |
The information that changes faster than anyone updates the doc — and why the context tax matters.
Layers 1 and 2 carry static knowledge — things you write down and update on a cadence. Layer 3 carries the information that changes faster than anyone documents it. Current database schema. State of the CI pipeline. The Notion page edited an hour ago. The latest production error log.
MCP connections give the assistant runtime access to external systems.[1] Instead of pasting a schema or describing an error log by memory, the assistant queries the source and operates on the current state. The source is the truth. The doc is a snapshot that lies the moment it stops being maintained.
Here is the failure mode most teams discover after they've already committed to twelve MCP servers: each well-documented MCP tool definition consumes 500–1,500 tokens at initialization.[7] A server with 50 tools costs 25,000–75,000 tokens before you've typed a single prompt. A typical enterprise stack — GitHub MCP, Jira, Notion, database schema, CI/CD — can consume 100,000–200,000 tokens before the conversation starts, leaving almost no room for actual work on a 200,000-token context window.[7]
Claude Code's answer to this is deferred loading: by default, MCP tool definitions are not injected up-front; only tool names enter context until Claude calls a specific tool.[6] That deferred model only holds if you're running a current version and haven't overridden the behavior. Verify it. Run /mcp to see what's connected and /usage to see how much of your context window the current configuration is consuming.
The leverage point in Layer 3 is which connections live always-on versus which sit behind explicit invocation. Database schema is relevant to almost every code task — keep it always-on. Production logs only matter during debugging — keep them on-demand. Analytics only matter during reporting tasks. Default to on-demand. Promote to always-on only when the connection earns it across most interactions.
Connect every available MCP server by default
Assume more tools means smarter responses
Never audit which servers are actually used
Always-on for everything including analytics and logs
Discover context budget problem when quality degrades
Start with zero servers; add when a concrete need is named
Run /usage weekly to see which servers consume context
Demote servers to on-demand if not used in most sessions
Always-on only for schema and CI status; everything else on-demand
Set a 25% context ceiling on Layer 1 + always-on Layer 3 combined
Database schema browser — relevant to most code generation tasks
Git status and recent commit history — needed to understand current work
Project configuration and environment variable names (never values)
CI/CD pipeline status — required for deployment and testing decisions
Production error logs — activated during debugging workflows
Analytics and metrics dashboards — activated during reporting tasks
Notion or Confluence pages — activated when referencing documentation
Slack channel history — activated when researching team decisions
Calendar and scheduling — activated for planning and coordination tasks
Projects, teams, people, decisions, systems — and the relationships between them. The hardest layer. The one that changes everything.
Layer 4 is where the context layer stops being useful and starts being structural. An entity graph stores the relationships between projects, teams, people, decisions, and systems. When the assistant knows that the payments team owns the Stripe integration, that Sarah is the tech lead, that the team migrated from REST to tRPC last quarter, and that there's an open RFC about webhook retry logic — answers start accounting for organizational reality, not just technical correctness.
Context graphs connect entities, events, decisions, policies, and evidence so the assistant can answer why, not just what.[3] Most implementations converge on a durable master graph plus query-specific subgraphs that load into context based on the task in flight.[3]
This is the hardest layer because the data is the most dispersed. Team ownership lives in the org chart tool. Project relationships live in Jira or Linear. Decision history lives in RFCs, Slack threads, and meeting notes. Architecture relationships live in code and the docs nobody updates. Building the graph means pulling from every one of those sources and maintaining edges as they drift.
Drift is the default state of any graph without an owner. Knowledge graphs reduce hallucination rates significantly when properly maintained — but 'properly maintained' is the constraint.[8] Pick the owner before you pick the schema. Define the update cadence before you write the first edge. If neither of those decisions can happen in the next sprint, skip Layer 4 for now and revisit in six months when you have more organizational context on where the assistant's blind spots actually hurt.
| Signal | Build Layer 4 now | Skip for 6 months |
|---|---|---|
| Org data location | Single source of truth (one Jira, one org chart tool) | Scattered across Jira + Linear + Notion + Slack with no canonical source |
| Team ownership | Team-system mapping is documented and stable | Ownership is disputed or changes faster than quarterly |
| Graph maintainer | Named owner, update cadence defined, tied to existing process | Would be maintained as a side project with no clear owner |
| Correction pattern | Assistant makes org-level mistakes: wrong team, wrong system owner | Assistant makes technical mistakes that Layers 1-3 already fix |
| Team maturity | Layers 1-3 stable and passing the first-PR test | Layers 1-3 not yet built or still being tuned |
Audit existing onboarding docs, ADRs, and convention notes. Distill the slow facts into one file. Test it the only way that matters: ask the assistant to explain your project structure. If it gets the structure wrong, the file is missing information. Iterate until the explanation is correct without prompting.
Pick the three domains where the assistant gets corrected most. Write skill files for those first. Each skill carries domain-specific conventions, common patterns, and the edge cases someone has already learned the hard way. Coverage everywhere is a trap. Density where it costs you is the leverage point.
Start with the data sources engineers paste into chat the most: database schema, CI status, documentation. Add production monitoring and analytics as on-demand servers, not always-on. Watch the context budget after each addition — run /usage weekly. Always-on creep is how the budget breaks.
Start narrow: teams, the systems they own, the people on each team. Expand into project dependencies, decision history, and RFC references only after the narrow graph is stable. Update the graph as part of regular team process — not as a side project that an intern owns. Drift is the default. Process is the mechanism that reverses it.
Frequency, update rate, size. The same three questions decide every piece of knowledge.
The most common failure mode in building a context layer is shoving everything into CLAUDE.md. The second most common is having no framework for placement, which produces the same result eventually.[5]
Three variables decide where knowledge belongs.
Access frequency. How often does this matter? Almost every interaction — Layer 1. Only when working on payments — Layer 2. Only during production debugging — Layer 3.
Update rate. How fast does it drift? Slow facts (monthly or less) — Layers 1–2. Fast facts (daily or weekly) — Layer 3. Relationship data that evolves gradually — Layer 4.
Context size. How many tokens does it cost? Large knowledge — full schemas, complete API docs — never always-on. Load through Layer 2 skills or Layer 3 MCP. Small knowledge — naming conventions, deploy targets — can earn the always-on slot.
Run every new piece of knowledge through these three. The answer falls out. The mistake comes from skipping the question.
Operational checks that prevent context layer debt from accumulating silently.
How large should CLAUDE.md actually be?
Anthropic's own documentation recommends keeping it under 200 lines.[6] The right number depends on project complexity, but the framing matters: 200 lines is a forcing function, not a target. The file should hold only invariants — facts that are true on every session, across every task. Anything that reads like a procedure belongs in a skill, where it's free until needed. Validate the right length empirically — the file should pass the first-PR test without the assistant needing follow-up corrections about structure or conventions.
Can I have multiple CLAUDE.md files in a monorepo?
Yes. Claude Code reads CLAUDE.md at multiple levels: project root, subdirectories, and user-level (~/.claude/CLAUDE.md). A monorepo runs a root CLAUDE.md with shared conventions plus per-package CLAUDE.md files with package-specific context. The files merge hierarchically — more specific files override more general ones. Use the hierarchy intentionally. Root carries cross-repo invariants. Package-level files carry the local stack and conventions without duplicating what's already upstream.
What's the real token cost of adding an MCP server?
With deferred loading enabled (the default in current Claude Code), you pay tool name overhead only until a tool is actually called.[6] Without it, a well-documented 50-tool server costs 25,000–75,000 tokens at initialization, and an enterprise stack of five to ten servers can consume 100,000–200,000 tokens before you type anything.[7] Confirm deferred loading is active by running /mcp and checking your server configuration. Even with deferral, more servers means more names in context — audit with /usage and disable anything not used in most sessions.
How do I measure whether the context layer is working?
Two metrics. Correction rate: how often you correct the assistant on internal systems or conventions. Re-explanation rate: how often you paste the same context back into a conversation. A working context layer cuts both by more than half within the first month. The cheap measurement: tally corrections for one week before and one week after adding skill files for your top two domains. The delta is usually obvious without tooling.
What about sensitive information in the context layer?
Never put secrets, credentials, or PII in CLAUDE.md or skill files. For sensitive organizational data, use Layer 3 MCP with proper access controls — the data stays in the source system and is queried on demand instead of sitting in plaintext files. Layer 4 entity graphs hold roles and relationships, not personal details. The rule: if it would be wrong to commit to the repo, it's wrong to put in a context file.
When should I fine-tune instead of building a context layer?
Almost never, for organizational knowledge. Fine-tuning changes model weights — it's expensive, requires training data curation, and the knowledge it encodes goes stale without re-training. A context layer externalizes the same knowledge into files you can edit and version-control today. Fine-tuning earns its cost when you need specialized task behavior or inference style that can't be prompted — not when you need the assistant to know your naming conventions or your team's domain patterns.
Anything that drifts weekly belongs in Layer 3, not Layer 1. Stale CLAUDE.md content is worse than no content because it teaches the assistant incorrect facts about your current state — the assistant trusts the file, the file is wrong, and you're debugging hallucinations that came from your own commits.
Without a concrete trigger, the assistant either loads everything (wasting context) or loads nothing (missing the domain knowledge it needs). The trigger is the contract. Write it as a named condition — 'Use when working with Stripe payments or code in /lib/payments/' — not a vague topic label.
Layers add up. If always-on context (Layer 1 plus always-on Layer 3) eats more than 25% of the window, there's not enough room left for the actual task. Drift in always-on context is the failure mode nobody notices until the assistant gets noticeably worse. Run /usage; fix what you find.
CLAUDE.md and skill files affect every interaction every team member has with the assistant. Untested changes introduce errors that are hard to trace because they show up as quality regressions, not as broken builds. Pull request review is the minimum bar. The first-PR test is the acceptance test.
CLI tools like gh, aws, and gcloud add zero token overhead — Claude calls them as shell commands with no MCP definition sitting in context. An equivalent MCP server costs 500–1,500 tokens per tool just to be listed. Default to CLI; use MCP only when you need bidirectional state or structured tool output that CLI parsing can't handle reliably.[6]
GMV is the scoreboard, not the game. Marketplace teams that wait for revenue to confirm a category is dying have already lost the merchants whose absence caused it. Four signals, one weekly brief, three to six weeks of warning before the line bends.
App Store reviews, NPS verbatims, Zendesk tickets, interview notes, community mentions — five inputs, five biases, five cadences. Treat them equal and the loudest channel wins. The fix is a normalization and weighting layer that produces one weekly brief.
Engineering directors burn 45 minutes every morning reconstructing a picture five tools could have assembled. Replace the loop: five parallel collectors, one orchestrator, a confidence score, a 90-second RED/AMBER/GREEN brief. Triage out of working memory, into code.