Most production agent failures are not model failures. They are missing constraints — business rules carried in four engineers' heads with no formal representation agents can query. The fix is a versioned, governed context store the data team owns instead of answers.
A support routing agent classifies enterprise customers using a column that was semantically reclassified eighteen months ago. The values still look correct. The column still exists. The rule changed. Nobody updated the record because nobody owned the record. Enterprise accounts get routed to the wrong queue for weeks before someone notices the pattern in escalations.
The model did not malfunction. It executed against the data it was given, applying the only meaning it could infer from the schema. The constraint that would have caught the error — use customers.currenttier, not orders.customertier for routing — exists only in the heads of the four people on the data team. There is no formal representation an agent can query. So the question routes back to a human, or it doesn't get asked at all.
This is a data sovereignty problem in the precise sense: the rules that make data mean something are concentrated in a small number of people and have no executable shape. Every agent question that touches business semantics turns into a Slack thread. The data engineer becomes the bottleneck. Adding more agents multiplies the load proportionally.
A context store is the structural answer — a versioned, governed, freshness-monitored layer that encodes what data means and what constraints apply, so agents query it directly instead of routing questions back to humans. Data engineers ship semantic schemas the way they ship dbt models. This article covers what goes in one, how to structure the schema, how to keep it current, and the ownership inversion that turns the data team from per-question answerers into per-schema publishers.
Why NL2SQL accuracy collapses from benchmark to production — and the specific failure class responsible
What a context store actually is versus a RAG index, data catalog, or wiki
The four-field schema that separates a governed document from a YAML graveyard
Freshness states (fresh / stale / expired / conflicted) and which one demands what agent behavior
CDC-driven freshness monitoring versus polling — why the 59-minute gap matters
The ownership inversion: data engineers as schema publishers, not per-question answerers
Decision table: context store versus alternatives, with thresholds
A working Python integration example you can adapt Monday morning
Why data engineers become the rate-limiter once agents are in the loop, and why more agents makes it worse.
Research on NL2SQL agents matches what every practitioner who has shipped one already knows: data agents fail not from retrieval errors or model limitations, but because they form misconceptions about what data actually represents.[1] They query the right column for the wrong concept. They join on a deprecated ID field. They ignore the rule that inventory counts below 5 are unreliable due to sync lag from a legacy ERP. The model did nothing wrong. Nobody encoded the rules.
The performance gap between benchmark and production is stark. Models reach 75–85% execution accuracy on clean academic benchmarks like Spider and BIRD — but in real enterprise environments with unfamiliar schemas and domain-specific logic, that number collapses. The SIGKDD NL2SQL-BUGs study found that roughly 25% of failures on curated benchmarks trace to semantic errors alone — in real databases with more complex schemas and diverse query patterns, the rate is higher.[10] Tk-Boost, a tribal knowledge injection framework, recovers up to 16.9% accuracy on Spider 2.0 and 13.7% on BIRD by pre-loading agents with the institutional rules that production databases carry but never document.[1]
Meta hit this at scale. Across 4,100+ code files in three repositories, only 5% had any form of codified tribal knowledge — the rules experienced engineers applied automatically but had never written down. To close the gap, they built a pre-compute engine using 50+ specialized AI agents that systematically read code files and produced context documents encoding what the data meant, what to watch for, and how to navigate edge cases.[2] Coverage went from 5% to 100% of code modules.
Most teams can't run 50 agents to bootstrap. They face the same structural problem regardless. A data engineer who already spends 30% of the week answering semantic questions from analysts won't survive a system where agents ask the same questions ten times faster. The Datadog State of AI Engineering report found that 69% of all LLM input tokens in production traces were consumed by system prompts — internal instructions and policy definitions pumped in from outside because the underlying data layer carries no semantics of its own.[9] Teams are compensating for a missing context layer by stuffing it into prompts. The questions have to stop routing to humans. The only way to stop them is to publish the answers as queryable artifacts before the agent asks.
Agents query data with no semantic anchor
Every business rule question routes back to a data engineer per agent
Each new agent re-discovers the same gaps
Stale rules fail silently — no observable signal until a customer complains
Data engineers are a per-question bottleneck
69% of input tokens spent on policy prompts stuffed in at runtime
Agents query versioned context documents alongside data
Data engineers publish rules once; every agent queries them at runtime
New agents inherit the full governed semantic layer on day one
Freshness states (fresh, stale, expired) are observable, alertable, enforceable
Data engineers are per-schema owners — one schema serves many agents
System prompt shrinks; context tokens route to the decision, not the boilerplate
The distinctions that matter: RAG is not it, the data catalog is not it, the wiki is not it.
A context store is a versioned, governed semantic layer that gives AI agents the business rules, constraints, and freshness metadata they need to interpret data correctly — as a queryable, owned artifact, not as institutional folklore.
It is not a RAG knowledge base. RAG retrieves chunks of text by semantic similarity. A context store serves governed rules with explicit ownership, version history, and freshness SLAs. The operational difference is unforgiving: a RAG document goes stale silently and retrieval quality drifts. A context store document has an owner who is responsible for keeping it current, and a freshness state agents observe at query time. The two are complementary, not competitors. RAG handles unstructured retrieval. The context store handles governed business rule delivery for operational decisions.
It is not a data catalog either. Catalogs describe what data exists and where. A context store encodes what data means and what constraints apply — the rules that can't be derived from a column description. Catalogs are inventory. Context stores are doctrine.
The Atlan context engineering framework treats every piece of context as a versioned, auditable data product with the same lifecycle and ownership discipline you'd apply to a core dbt model.[4] The a16z framing is blunter: agents need a business knowledge layer that answers the questions raw data can't.[5]
Context Kubernetes, the research architecture for managing enterprise context at scale, names the components precisely: a Context Registry managing document identity, a Freshness Manager tracking four states (fresh, stale, expired, conflicted), a Permission Engine enforcing RBAC, and a Trust Policy determining which documents agents can act on without human verification.[6] These are the infrastructure primitives. The discipline they imply is the actual product.
| Tool | What it does | What it can't do | When to add a context store |
|---|---|---|---|
| RAG / vector index | Retrieves relevant text chunks by semantic similarity | Ownership, versioning, freshness SLAs, negative constraints | Always — they're complementary, not competing |
| Data catalog | Inventories tables, columns, lineage, ownership | Encodes business rules and constraints; doesn't tell agents what data must NOT be used for | When agents need rule enforcement, not just discovery |
| System prompt | Injects context at request time | Scales poorly; 69% of input tokens are already system prompts in most teams; no versioning or freshness | When you've hit prompt bloat — consolidate rules into a queryable store |
| Data dictionary / wiki | Human-readable rule documentation | No freshness state, no ownership enforcement, no machine-queryable structure | When the wiki has more than 20 rules and agents can't reliably find the right one |
| Semantic layer (dbt metrics, Looker) | Defines certified metrics with consistent definitions | Doesn't cover routing rules, classification thresholds, or negative constraints | When agents need domain rules beyond metric definitions |
5% of 4,100+ code files had any written institutional knowledge. The rest lived in engineers' heads. (Engineering at Meta, Apr 2026)
Teams compensate for a missing context layer by stuffing policy rules into prompts at runtime — a scaling dead end. (Datadog State of AI Engineering 2026)
Tk-Boost recovers up to 16.9% accuracy on Spider 2.0 by pre-loading agents with institutional schema rules. (arXiv 2602.13521)
Fresh, stale, expired, conflicted. Each demands different agent behavior at query time. (Context Kubernetes, arXiv 2604.11623)
What separates a governed context document from a wiki page with extra YAML.
Most teams that try to build a context store start by porting their existing data documentation — column descriptions, table readmes, dictionary entries. The output is a searchable knowledge base. It's not a governed context store, because it's missing the three fields that make a document agent-safe: constraints, freshness, and owner.
The constraints field is the one almost nobody covers. It encodes what agents must not do with this data — the negative knowledge that lives in experienced engineers' heads and routinely causes failures when absent. A rule that says enterprise tier requires ARR >= $50K is useful. A constraint that says never use orders.customertier for current-state routing — use customers.currenttier is the difference between a routing agent that works and one that silently misclassifies for months.
The freshness object anchors the SLA. Without it, the context document is a snapshot with no expiry. Agents that can see freshness state include uncertainty in their output when a rule is stale — instead of making confident wrong decisions.
A minimal but complete schema for a context document:
Freshness fails silently by default. Observable state — not hope — is the fix.
Freshness is where most context store implementations rot in production. Teams build the initial schema, populate it with the rules that exist on day one, and ship. Six months later the business has changed, the documents haven't, and agents are confidently applying retired rules to live decisions. The failure mode is structural: degradation is silent by default. A stale document doesn't throw. It serves wrong information with full confidence.
The four-state model from the Context Kubernetes architecture cuts the problem cleanly:[6]
Different context types take different SLAs. Inventory and real-time pricing rules need freshness within minutes — a stale inventory rule causes agents to promise availability that doesn't exist. Financial reporting context tolerates daily freshness. Support routing rules sit in between and need near-real-time updates to avoid misrouting live cases.[3] The SLA belongs in the schema itself, not in a separate monitoring config, because agents inspect it at query time to decide whether to use the document or escalate uncertainty.
The production-grade pattern is CDC-driven freshness, not polling. When dim_customers updates, the freshness monitor checks whether any context document references that model and transitions staleness state immediately.[3] Polling hourly means you can be 59 minutes behind a breaking rule change. CDC-driven updates catch it in minutes. The cost difference shows up in a customer escalation.
A concrete integration pattern — not the YAML store itself, but the API surface agents talk to.
The context store is useless if agents read raw YAML files. They need a queryable API that enforces access control, returns freshness state, and logs every lookup. The schema lives in Git; the API is the runtime surface.
A minimal FastAPI wrapper is enough to start. The important design decision is: freshness state must be in every response. Agents that receive a stale document need to know it at the time of the response, not after a human reads the audit log. Here's a pattern you can adapt:
Reactive answering doesn't scale to agent traffic. Schema authoring does. The work is the same; the delivery changes.
The traditional data engineering model is reactive. Analysts ask questions. Engineers answer them. The throughput is bounded by how fast the engineer can context-switch and type a Slack reply. When agents enter the loop, that ceiling collapses. Agents generate questions continuously, with no natural throttle.
The ownership inversion changes the direction of the work. Data engineers become context publishers. They formalize the rules they currently hold in their heads into governed context documents — and agents query those documents directly. The engineer's output changes from verbal explanations to schema artifacts. The questions stop routing to humans because the answers are already published.
This is what a16z means by your data agents need context:[5] data stops being a pipeline artifact and becomes a context product. One well-maintained context document serves a dozen agents consistently. The marginal cost of adding another agent drops near zero once the schema exists.
The friction is the quality of the initial formalization. Getting data engineers to write constraints[] fields — not just column descriptions — requires a different mental model: writing not just what the data is, but what can go wrong when an agent uses it incorrectly. That work is unfamiliar. Teams that rush it produce documents that are technically accurate and operationally incomplete. The implementations that work treat initial formalization as a sprint goal, not a backlog item, and pair engineers with agents during discovery to surface the questions agents actually ask before writing the schema.
The Atlan context engineering framework names the architectural answer to this drift: Phase 3 orchestration, which routes only the certified context product a query needs — blocking uncertified documents from reaching agents entirely.[4] Once a document is published, it's the only version agents see. The certification step is the gate.
Before any schema gets written, name the top 20 questions that agents in your system already ask about data. These are the questions that currently route to a data engineer. They become your first 20 context documents. Start with the rules that cause the most support tickets or silent agent failures in staging.
For each identified rule, ship the full schema: type, version, owner, content, constraints[], freshness, retrieval hints. The constraints field is mandatory — a document without negative knowledge will misfire on the first edge case. Version at 1.0.0 for initial publication. Bump minor on rule updates, major on breaking changes.
When the underlying source changes, affected documents transition to stale automatically. Polling does not catch this fast enough — a 1-hour poll interval means up to 59 minutes of stale context before agents notice. CDC catches it in minutes. Configure pre-expiry alerts to the owning team.
Agents query the context store via an API, not by reading YAML files directly. The API enforces access control, returns freshness state in every response, and logs every lookup for audit and coverage analysis. A REST endpoint returning JSON with staleness metadata is enough to start.
The maintenance bottleneck re-emerges only if authoring is ad-hoc. When a new dbt model ships, the owning team publishes the corresponding context document before agents can query the model. That makes context documents a first-class deliverable and distributes authoring to the teams that already own the underlying data.
Pre-production controls that separate a governed context store from a YAML graveyard.
Name the failure before you build the defense. Each mode has a different root cause and a different fix.
Teams building context stores often treat them as a general good-practice measure — which makes them hard to prioritize and harder to staff. The argument gets sharper when you name the specific failure modes they prevent.
Silent semantic drift. A business rule changes. The column doesn't. The agent keeps applying the old rule with full confidence. No exception is raised. The only observable signal is a downstream customer complaint or a skewed metric someone catches in a quarterly review. A freshness-monitored context document makes the drift visible: the source model updates, the document transitions to stale, the owning team gets an alert, and agents flag uncertainty on every decision until the document is re-verified.
Column aliasing. Two tables have columns named the same thing that mean different things in different contexts — customer_tier in orders is the tier at time of purchase; customer_tier in customers is the current tier. An agent with no context store queries either column depending on which it retrieves first. A context document with a constraints[] entry saying which column to use for which decision class eliminates the ambiguity entirely. This is the class of error the NL2SQL-BUGs benchmark was designed to measure — and it accounts for a material fraction of production failures on real enterprise schemas.[10]
Threshold ossification. A business threshold (enterprise ARR cutoff, inventory reliability floor, latency SLA) was defined 18 months ago and encoded in 14 different places. Eleven of them got updated when the threshold changed. Three didn't. One of the three is a context store document that an agent queries. A versioned, owned document with a known source makes the update surface explicit. The owning team updates one artifact; every agent picks up the new version at next query.
How is a context store actually different from a RAG knowledge base?
RAG retrieves text chunks by semantic similarity. A context store serves governed business rules with explicit ownership, version history, and freshness SLAs. The operational difference is what fails silently. RAG documents go stale and retrieval quality drifts without anyone noticing. Context store documents have an owner accountable for currency and a freshness state agents observe at query time. Use RAG for unstructured retrieval and the context store for governed rule delivery for operational decisions. They belong in the same stack.
What should the first 20 context documents cover?
Start with the rules behind the most agent failures and the questions that route most often to data engineers. The highest-leverage documents cover: business classifications (what 'enterprise' means in each system), column semantic corrections (use this column for X, not that one), threshold definitions (the exact numbers and conditions that define a category), and workflow routing rules (which values map to which queues). Mine incident postmortems and Slack threads for recurring patterns. Any question that has appeared more than twice is a top candidate.
How do you handle conflicting context documents?
The conflicted state exists for exactly this case — two documents making contradictory claims about the same data. When the freshness monitor detects a conflict, both documents transition to conflicted and agents surface uncertainty rather than commit to a decision. A human from the owning team resolves it, updates one or both documents, and re-publishes. In practice, conflicts surface most often during business rule changes when the old rule was never explicitly retired. The fix is a retirement protocol: when a new rule supersedes an old one, set the old document's status to archived before the new one ships.
Do we need dedicated infrastructure, or can we start simpler?
A Git repository of YAML files plus a REST API is enough for the first iteration. Schema design, versioning discipline, and ownership model matter more than tooling. Atlan, Snowflake Cortex, and similar platforms provide native context layers for teams already on those tools. Context Kubernetes is the reference architecture for declarative orchestration at scale. Don't invest in purpose-built infrastructure until at least 20 governed documents are running against real agent queries — early iteration usually surfaces schema changes that are easier to make before the tooling hardens around them.
How do we stop context store maintenance from becoming the next bottleneck?
The work shifts from reactive answering to proactive schema authoring — a one-time cost per rule that pays out per agent query. The bottleneck re-emerges only if authoring is ad-hoc. The fix: make context document creation part of the definition of done for any significant data model change. When a new dbt model ships, the owning team publishes the corresponding context document before agents can query the model. That makes documents a first-class deliverable and distributes authoring to the teams that already own the underlying data — rather than centralizing it on one overburdened platform team.
Which context document types need the shortest freshness SLAs?
Inventory counts and real-time pricing rules need freshness measured in minutes — a stale inventory rule causes agents to promise availability that doesn't exist. Workflow routing rules (support tier, escalation path) need hourly SLAs at most to avoid live misroutes. Business classification thresholds (ARR cutoffs, churn risk bands) typically tolerate daily freshness. Financial reporting definitions can tolerate daily or longer. The rule of thumb: freshness SLA = the maximum lag at which the wrong answer causes an observable customer harm.
Context stores are not new. Semantic layers and data catalogs have existed for years. What has changed is the ownership model and the cost of getting it wrong. When agents acted on queries one at a time with a human in the loop, a stale business rule caused one bad answer. When agents act continuously at scale, the same stale rule causes a systematic failure that compounds until someone reads the escalation log.
The teams that get this right are the ones where data engineers stop seeing themselves as answerers and start seeing themselves as publishers. The work is the same: encoding the rules they already know. The delivery is the difference. A Slack reply serves one analyst once. A schema artifact serves every agent that ever queries it — at near-zero marginal cost per query.
Start with the failures. Pick five agent decisions that went wrong in the last sprint and trace each one back to the missing or stale rule. Write those five as governed context documents with non-empty constraints[]. Set a freshness SLA. Ship them behind a simple REST API. Measure whether the same failures recur. That's one week of work for a clear signal about whether the pattern scales. The agents that stay shipped are the ones running on top of context the data team owns instead of remembers.