Every enterprise deployment hits the same wall. The model is capable. The prompts are tight. The framework is wired. Then someone asks a question that requires knowing what was in a Confluence page from eight months ago, cross-referenced with a Slack thread from last Tuesday, filtered by what the user is actually authorized to see.
The agent hallucinates. Or returns something generic. Or — the failure mode that costs the most — returns a confidently wrong answer pulled from a stale document that was superseded three revisions ago.
This is the context engineering problem. It is now the primary bottleneck in enterprise AI. Gartner's framing — "context engineering is in, prompt engineering is out" — lands once you have competent models, because the quality of what you feed them outweighs how you ask[2].
Context engineering is the discipline of structuring everything an LLM needs — prompts, memory, retrieved documents, tool outputs, conversation history — so it can make reliable decisions. For enterprise systems that means a memory layer: persistent infrastructure that gives agents access to institutional knowledge across sessions, users, and time[1].
The non-obvious leverage point: upgrading the model is the least efficient investment once an agent is working. The same improvement in answer quality costs roughly 10x less when it comes from better retrieval than from a more capable model. Spend the engineering hours on the memory layer first.
Tutorial RAG Doesn't Survive Contact With Production
The gap between a single-PDF demo and a memory layer that holds across 40,000 documents is the entire job.
The standard RAG tutorial: chunk the documents, embed them, store them in a vector database, retrieve top-k, stuff the prompt. It works on a single PDF.
Then point it at 40,000 Confluence pages, 200,000 Slack messages, a Salesforce instance, three SharePoint sites, and a Notion workspace. Different problem class entirely.
The failure pattern is consistent: roughly 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM[5]. Most teams discover this after weeks of prompt tuning and model swaps, only to find the retrieved context was wrong before the model ever saw it.
Enterprise RAG fails in three predictable ways. Retrieval noise — the vector search returns documents that are semantically similar but factually irrelevant. "Q3 budget planning" matches "Q1 budget planning" with high cosine similarity. The Q1 doc is useless. Stale context — documents indexed once, never updated. The agent cites a policy revised two months ago. Permission leakage — the retrieval layer has no concept of who is asking. An intern's query returns the same executive compensation data as the CFO's.
Fixing these requires treating context as an engineering discipline, not a hook bolted onto an agent framework.
Chunking Is Where Most Teams Burn the Most Time on the Wrong Approach
The instinct says semantic. The benchmarks disagree. Here's what the data actually shows.
The instinct is to reach for semantic chunking — splitting along "meaning boundaries" instead of fixed token counts. It sounds obviously better. The benchmarks tell a different story.
A February 2026 benchmark by PremAI across 50 academic papers put recursive 512-token splitting at roughly 69% accuracy, with semantic chunking at 54% — fifteen points behind[4]. The mechanism: semantic chunking produced fragments averaging 43 tokens. Too small. Too fragmented. The chunks lost the surrounding context that made them useful. Caveat — the benchmark used academic texts. Enterprise corpora behave differently.
We deployed semantic chunking across an enterprise corpus of 30,000 documents because it sounded correct. After six weeks we ran a retrieval recall comparison against recursive splitting. Recall was 12% lower with semantic. The corpus was mostly internal policy prose, which benefited from the larger, more stable window of recursive splitting. We migrated everything back. A painful week of re-embedding. Run the comparison before committing.
The practical default is simpler than most teams expect: recursive character splitting at 256-512 tokens with 10-20% overlap. Roughly 50-100 overlap tokens for 512-token chunks. As of early 2026 this default outperforms more elaborate strategies on most enterprise corpora. Benchmark against your own data before locking it in.
Defaulting to semantic chunking without benchmarking against recursive splitting
Fixed 1024-token chunks regardless of document type
Zero overlap between chunks — context lost at every boundary
Treating prose, code, tables, and logs as the same input
Ignoring document structure (headers, sections, list hierarchies)
Start at recursive 512-token splitting; benchmark alternatives against a labeled set
Match chunk size to the embedding model's actual sweet spot (typically 256-512 tokens)
10-20% overlap to preserve boundary context
Structure-aware splitting for Markdown, HTML, and code — header hierarchy carries meaning
50-100 query-answer pairs as a labeled test set before any strategy ships
Late chunking is the most promising development here. Instead of splitting first and embedding each chunk independently, you feed the whole document into a long-context embedding model and split the resulting embeddings. Each chunk retains awareness of the full document. Pronouns resolve. Headers carry through. Cross-references stay intact.
Jina's embeddings-v4 and Voyage AI's 32K-token context window both support the pattern. The tradeoff is compute: you embed the full document instead of individual chunks. For corpora where accuracy outweighs marginal embedding cost, late chunking is worth evaluating.
Embedding Choice Is a One-Way Door — Pick Carefully
Mixing models in one index is not an option. The cost of getting it wrong is re-embedding the entire corpus.
The embedding market shifted hard in 2025-2026. The old default — OpenAI's text-embedding-ada-002 — is two generations behind. The current landscape has three tiers, each with sharp tradeoffs.
| Model | Context Window | MTEB Retrieval Score | Price per 1M tokens | Best For |
|---|---|---|---|---|
| Voyage AI voyage-3-large | 32K tokens | Highest (MTEB leader) | $0.06-$0.18 | Maximum retrieval quality, long documents |
| OpenAI text-embedding-3-large | 8K tokens | Strong baseline | $0.13 | Broad ecosystem integration, balanced cost |
| Jina embeddings-v4 | 8K (dense) / 8K (ColBERT) | Competitive with Voyage | Varies | Multi-modal retrieval, late interaction |
| Google Gemini Embedding | Up to 3K tokens | Cross-lingual leader | Free | Multilingual corpora, cost-sensitive workloads |
| Open-source (BGE, E5) | 512-8K tokens | Varies widely | Self-hosted cost | Air-gapped environments, full data control |
Voyage AI's voyage-3-large beats OpenAI's text-embedding-3-large by roughly 9.74% and Cohere's embed-v3 by roughly 20.71% on MTEB retrieval as of March 2026[8]. Its 32K-token context window means longer documents embed without chunking — or with late chunking to preserve context across boundaries. Benchmark rankings shift as new models ship. Verify against the current MTEB leaderboard before locking in a choice.
Raw scores are not the whole story. The constraints decide:
- Already in the OpenAI ecosystem? The text-embedding-3 family gives you good-enough quality with the simplest integration. The
smallvariant at $0.02/M is hard to beat for cost-sensitive workloads. - Multi-modal retrieval? Jina's v4 handles text and images through one pathway, with both dense and ColBERT-style multi-vector embeddings.
- Multilingual corpus? Google's Gemini embedding leads on cross-lingual benchmarks and costs nothing through the API.
- Regulatory constraints? Self-hosted BGE-M3 or E5-Mistral give you full data sovereignty and an operational tax to match.
Vectors vs Graphs vs Hybrid: Which Fight Are You Actually Having?
Three retrieval architectures, three different failure classes. The right answer for most enterprises is the hybrid.
The retrieval backend is one of the most consequential architectural choices in the stack. Three viable patterns. The right answer for most enterprises is the combination.
Vector stores win on semantic similarity over unstructured text. "How do we handle customer refunds?" surfaces documents on refund policies, return procedures, and chargeback handling — even when none of them use the word "refund." This is the breadth play. Vector search covers messy, unstructured knowledge bases where relationships between documents are implicit.
Knowledge graphs win on structured relationships and multi-hop reasoning. "Which teams depend on the payments service and what are their SLAs?" resolves through explicit traversal: payments-service → consumed-by → [checkout, subscriptions, invoicing] → SLA nodes. This is the depth play. Graphs win when the question requires traversing relationships, enforcing permissions, or reasoning across connected entities[7].
The hybrid uses vectors for breadth and graphs for depth. A query router classifies the incoming question and fans out to both systems when both are needed. Results merge, deduplicate, and pass through a cross-encoder re-ranker before context assembly. Schema App's published evaluation reports roughly 15-30% improvements in faithfulness and answer relevancy with hybrid retrieval[6]. Actual gains depend heavily on corpus structure, re-ranking configuration, and caching. Calibrate against your own data.
Vector-only is the right call when
- ✓
The corpus is primarily unstructured text — docs, emails, chat logs, support tickets
- ✓
Queries are open-ended and exploratory; the user does not know exactly what they want
- ✓
You need to ship and iterate fast — vector stores carry less operational tax
- ✓
Document relationships are not well-defined or change frequently
A knowledge graph earns its keep when
- ✓
Questions require multi-hop reasoning across entities (team → service → SLA → incident)
- ✓
Access control is the constraint — graphs model permissions as first-class relationships
- ✓
The domain has stable ontologies (org charts, service maps, compliance frameworks)
- ✓
You need explainable retrieval — graph paths produce audit trails that vector scores cannot
Bigger Context Windows Don't Solve the Allocation Problem
The context window is a budget, not a bucket. Attention degrades long before the limit does.
Models keep getting larger context windows. Claude supports 200K tokens. Gemini 2M. GPT-4o 128K. Why does context window management still matter?
Because attention degrades with length. Mid-2025 research showed retrieval quality drops measurably even for models with massive context windows when the prompt is stuffed with too much retrieved text. Shorter, more precise context consistently produces better answers than 50K tokens of "potentially relevant" documents. The threshold varies by model and task type, but the pattern is consistent.
The context window is a budget. You allocate a finite resource across competing demands.
Static allocation is the easy part. The real fight is dynamic rebalancing as conversations extend. As history grows, the space available for retrieved context shrinks. Four strategies handle this.
Sliding window with summarization. Keep the last N turns verbatim. Summarize everything older. The summary consumes fewer tokens while preserving key context. Simplest approach. Works for most conversational agents.
Retrieval budget scaling. Reduce the number of retrieved chunks as conversation history grows. Top-8 on the first turn, top-3 by turn fifteen. Trades retrieval breadth for conversation continuity.
Hierarchical context. Maintain two tiers — a compact summary layer always included, a detail layer included only when the query requires depth. The summary costs 200-500 tokens and provides ambient context. The detail layer carries specifics on demand.
Context eviction with recency bias. Score each piece of context by relevance to the current query and recency of insertion. Evict the lowest scores first. Requires scoring infrastructure. Holds up under long-running sessions.
Institutional Memory Lives in Silos. Your Job Is to Unify It.
From scattered knowledge stores to a single memory layer agents can query — with permissions intact.
Institutional memory is the accumulated knowledge an organization has about how it works. Not just documents — the decisions, the context around those decisions, the informal rules that never got written down, and the relationships between all of these.
Most enterprises have this fragmented across dozens of silos: Confluence, Jira, SharePoint, Slack, Google Drive, CRMs, ticketing systems, code repos, and the heads of long-tenured employees. Context engineering for enterprise AI means building a unified memory layer agents can query across all of these sources with appropriate access controls intact.
- [01]
Inventory the knowledge sources
Map every system that holds institutional knowledge. Categorize by freshness (real-time, daily, weekly, static), structure (structured, semi-structured, unstructured), and access model (public, role-based, sensitive). The inventory drives the ingestion pipeline. Without it you ingest blind and discover the gaps in production.
- [02]
Design ingestion with freshness tiers
Real-time indexing for everything is expensive and unnecessary. Separate sources into tiers — real-time (Slack, ticketing), daily (Confluence, SharePoint), weekly (archived documents, historical data). Each tier gets its own sync cadence and resource allocation. Mismatched tiers either waste compute or ship stale answers.
- [03]
Embed with metadata, not just vectors
Every chunk carries metadata beyond the embedding: source system, document title, last modified date, author, access permissions, version hash. The metadata is what enables filtered retrieval, freshness sorting, and access control at query time. A vector without metadata is a guess waiting to happen.
- [04]
Build the access control layer
The retrieval layer enforces permissions or it does not enforce anything. When a user queries, retrieved results respect what that user is authorized to see. That means mapping identity from the auth provider to permissions on each chunk — at ingestion time, not retrieval time.
- [05]
Route queries; re-rank results
A single retrieval strategy will not handle every query type. Build a query router that classifies incoming questions and dispatches to the right backend — vector search for open-ended, graph traversal for entity lookups, keyword search for exact matches. The cross-encoder re-ranker sits between retrieval and context assembly and catches most of the false positives that embedding similarity misses.
Five Memory Types. Each Has a Different Eviction Policy.
Mature agent memory is not document retrieval with extra steps. It is multiple stores with distinct write cadences.
Production-grade agent memory in 2026 is more than document retrieval. Mature systems implement multiple memory types, each serving a distinct purpose in the agent's reasoning.
| Memory Type | Scope | Storage | Example |
|---|---|---|---|
| Working memory | Current session | Context window (ephemeral) | The user's current question, recent tool outputs, active plan |
| Episodic memory | Cross-session | Vector store + timestamps | "Last week this user asked about the payments API migration" |
| Semantic memory | Organizational | Vector store + knowledge graph | Company policies, architecture docs, product specs |
| Procedural memory | Agent-specific | Prompt templates + tool configs | How to query Jira, how to format a code review, escalation rules |
| Shared memory | Multi-agent | Shared store with namespaces | Research agent stores findings; writing agent retrieves them later |
The structural insight is that these memory types have different write cadences and eviction policies. Working memory is ephemeral — it exists only for the current interaction. Episodic memory accumulates per-user and needs periodic summarization to prevent unbounded growth. Semantic memory reflects organizational knowledge and updates on the cadence of the source systems. Procedural memory changes only when agent behavior is updated.
Frameworks like Mem0 and LangGraph's checkpointing handle working and episodic memory adequately. Semantic memory is where most enterprise teams need custom infrastructure, because it ties directly into the ingestion pipeline and access control layer — the parts that cannot be outsourced.
What the Pipeline Looks Like in Code
Concrete patterns for retrieval, re-ranking, ingestion, and version eviction.
lib/context-engine.tsinterface RetrievalResult {
content: string;
metadata: {
source: string;
lastModified: string;
permissions: string[];
score: number;
};
}
interface ContextBudget {
systemPrompt: number; // tokens reserved for instructions
retrievedContext: number; // tokens for RAG results
conversationHistory: number; // tokens for past messages
outputReserve: number; // tokens reserved for generation
}
// Permissions filtered before similarity scoring, never after.
async function assembleContext(
query: string,
userId: string,
history: Message[],
budget: ContextBudget
): Promise<string> {
// Route query to the right backend
const queryType = await classifyQuery(query);
// Retrieve with user permissions pre-filtered
const vectorResults = await vectorStore.search(query, {
topK: 10,
filter: { permissions: { $in: getUserPermissions(userId) } },
});
const graphResults = queryType === 'entity_lookup'
? await knowledgeGraph.traverse(query, { maxDepth: 3 })
: [];
// Merge, deduplicate, re-rank
const merged = deduplicateResults([...vectorResults, ...graphResults]);
const reranked = await crossEncoderRerank(query, merged);
// Trim to the retrieval budget
const contextChunks = fitToBudget(reranked, budget.retrievedContext);
// Summarize history when over budget
const trimmedHistory = await trimHistory(
history, budget.conversationHistory
);
return formatContext(contextChunks, trimmedHistory);
}lib/ingestion-pipeline.tsinterface ChunkingConfig {
strategy: 'recursive' | 'semantic' | 'structure-aware';
chunkSize: number; // target tokens per chunk
chunkOverlap: number; // overlap tokens between chunks
respectBoundaries: boolean; // never split mid-sentence
}
// Default that survives most enterprise corpora. Verify against yours.
const PRODUCTION_DEFAULTS: ChunkingConfig = {
strategy: 'recursive',
chunkSize: 512,
chunkOverlap: 64, // ~12% overlap
respectBoundaries: true,
};
async function ingestDocument(
doc: SourceDocument,
config: ChunkingConfig = PRODUCTION_DEFAULTS
) {
// Detect document structure first
const docType = detectDocumentType(doc); // markdown, html, code, plain
const effectiveConfig = docType === 'markdown'
? { ...config, strategy: 'structure-aware' as const }
: config;
// Chunk with metadata preserved end-to-end
const chunks = await chunk(doc.content, {
...effectiveConfig,
metadata: {
sourceId: doc.id,
sourceSystem: doc.system,
lastModified: doc.updatedAt,
permissions: doc.acl,
versionHash: hash(doc.content),
},
});
// Skip unchanged documents — version hash decides
const existing = await vectorStore.getBySourceId(doc.id);
if (existing?.versionHash === hash(doc.content)) return;
// Evict stale chunks before writing new ones
if (existing) await vectorStore.deleteBySourceId(doc.id);
const embeddings = await embedBatch(chunks.map(c => c.content));
await vectorStore.upsertBatch(chunks, embeddings);
}Anti-Patterns That Look Reasonable and Fail at Scale
Each one survives a demo. None of them survives a real corpus.
Anti-Patterns to Avoid
Stuffing the entire context window with retrieved text
More context is not always better. Attention degrades with length. Keep retrieved context to 50-60% of the available window and reserve space for conversation history and output. The model's failure mode at long context is silent — the wrong answer with no warning.
Skipping re-ranking after retrieval
Vector similarity is a rough proxy for relevance. A cross-encoder re-ranker examines query-document pairs jointly and catches false positives that embedding similarity misses. Highest-leverage single addition to most RAG pipelines.
Indexing once and forgetting
Enterprise knowledge changes constantly. Build change detection into the ingestion pipeline. Stale chunks that cite outdated information are worse than no retrieval — they produce confidently wrong answers, which is the failure mode that erodes trust fastest.
Ignoring chunk boundaries when documents reference each other
A chunk that says 'as described in the architecture doc' is useless without that referenced document. Use parent-child chunk relationships or include document-level summaries alongside fine-grained chunks.
Treating permissions as an afterthought
Retrofitting access control onto a flat vector store is painful. Design ACL-aware retrieval from day one. Pre-filter on permissions before similarity search, not after. The default state of an unaudited retrieval layer is permission leakage.
Verify These Before the Context Engine Goes Live
What a real deployment looks like — every item is a verifiable state, not a soft goal.
Enterprise Context Engine — Go-Live Checklist
Chunking strategy benchmarked against 50+ query-answer pairs from the actual corpus — recall numbers logged
Embedding model selected and tested; migration plan documented for the next switch
Ingestion pipeline running with change detection and stale chunk eviction — not full re-indexing
Access control pre-filtering verified against adversarial test queries
Cross-encoder re-ranker deployed between retrieval and context assembly
Context window budget allocation defined and enforced programmatically — not in the prompt
Retrieval logging active — every query logged with results, scores, and latency
Freshness monitoring alerting on stale index segments before they reach users
Fallback behavior defined for zero-retrieval and low-confidence scenarios
Load tested at 10x expected query volume — p95 latency inside the budget
If You Cannot Measure Context Quality, You Cannot Improve It
Three metrics. Each one isolates a different failure class. None of them is optional.
Context quality is measurable. The metrics are simpler than most teams expect. Three carry the load.
Retrieval recall — what percentage of the relevant chunks appear in the retrieved set? Measure against a labeled test set. Roughly 85%+ is a reasonable starting target before tuning anything else. The right threshold depends on the application's tolerance for missing context.
Context precision — what percentage of retrieved chunks are actually relevant to the query? Low precision wastes context window budget on noise. A re-ranker is the fastest fix.
Answer faithfulness — does the model's response stay grounded in the retrieved context? Catches both hallucinations and cases where the model ignores the retrieved context in favor of its parametric knowledge. RAGAS and TruLens automate this evaluation.
Managed vector database or self-host?
Start managed. Pinecone, Weaviate Cloud, and Qdrant Cloud handle scaling, backups, and index optimization. Self-hosting earns its keep when data residency is non-negotiable or when you need custom index configurations the managed offerings will not support. The operational tax of self-hosted vector databases is real — index compaction, backup strategy, version upgrades. Pay it deliberately, not by accident.
How do you handle documents that change frequently?
Build a change-detection pipeline using webhooks or polling with content hashing. When a document changes, evict its old chunks and re-embed the new version. A version hash stored alongside each chunk makes the comparison cheap. For extremely dynamic sources like Slack, use windowed indexing — index the last 90 days, expire older messages.
Is RAG still necessary with 1M+ token context windows?
Yes. Even with massive context windows, you still decide WHAT to put in them. A million tokens of irrelevant documents produces worse results than 10K tokens of precisely relevant ones. Long context windows change the math — you can include more per query — but retrieval, filtering, and ranking remain essential. RAG is evolving into context engineering, not disappearing. The economic constraint is also real: at current pricing, filling a 1M-token context costs $1-5 per query. Not viable for high-volume applications. Selective retrieval is the correct approach even when the technical limit is removed.
How many chunks should you retrieve per query?
Retrieve 8-12. Re-rank down to 3-5 for context assembly. The retrieval-to-context ratio is the lever: retrieve broadly to maximize recall, then re-rank aggressively to maximize precision. Watch the marginal relevance of each additional chunk. If chunks 6-12 are consistently irrelevant after re-ranking, reduce initial retrieval to save latency.
Is GraphRAG production-ready?
GraphRAG is maturing fast. As of early 2026, it is production-viable for domains with well-defined entity relationships — org structures, service architectures, compliance frameworks. The constraint is graph construction: building the knowledge graph from unstructured text still requires real engineering effort. Start with a manually curated graph for the highest-value entities and expand incrementally. Premature graph construction is wasted compute.
- [1]The New Stack — Memory For AI Agents: A New Paradigm Of Context Engineering(thenewstack.io)↩
- [2]Anthropic — Effective Context Engineering For AI Agents(anthropic.com)↩
- [3]Weaviate — Context Engineering(weaviate.io)↩
- [4]PremAI — RAG Chunking Strategies: The 2026 Benchmark Guide(blog.premai.io)↩
- [5]Firecrawl — Best Chunking Strategies For RAG(firecrawl.dev)↩
- [6]Schema App — Why Hybrid Graph + Vector RAG Is The Future Of Enterprise AI(schemaapp.com)↩
- [7]Machine Learning Mastery — Vector Databases vs Graph RAG For Agent Memory: When To Use Which(machinelearningmastery.com)↩
- [8]Elephas — Best Embedding Models(elephas.app)↩
- [9]Towards Data Science — Beyond RAG(towardsdatascience.com)↩