Every enterprise AI deployment hits the same wall. The model is capable. The prompts are solid. The agent framework is wired up. And then someone asks a question that requires knowing what happened in a Confluence page from eight months ago, cross-referenced with a Slack thread from last Tuesday, filtered by what the current user is actually authorized to see.
The agent hallucinates. Or returns something generic. Or — the most insidious failure — returns a confidently wrong answer drawn from a stale document that was superseded three revisions ago.
This is the context engineering problem, and it is now the primary bottleneck for enterprise AI. According to Gartner, "context engineering is in, and prompt engineering is out" — a shift that makes sense once you have competent models, because the quality of what you feed them matters more than how you ask[2].
Context engineering is the discipline of structuring everything an LLM needs — prompts, memory, retrieved documents, tool outputs, conversation history — so it can make reliable decisions. For enterprise systems, this means building a memory layer: persistent infrastructure that gives agents access to institutional knowledge across sessions, users, and time[1].
Why Naive RAG Fails at Enterprise Scale
The gap between tutorial RAG and production memory systems
The standard RAG tutorial goes like this: chunk your documents, embed them, store them in a vector database, retrieve the top-k results, stuff them into the prompt. It works beautifully on a single PDF.
Then you try it on 40,000 Confluence pages, 200,000 Slack messages, a Salesforce instance, three SharePoint sites, and a Notion workspace. Suddenly you are dealing with a different class of problem entirely.
Based on available research and practitioner reports, roughly 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM[5]. Most teams discover this after spending weeks tuning prompts and swapping models, only to realize the retrieved context was wrong from the start.
Enterprise RAG fails in three predictable ways. Retrieval noise — the vector search returns documents that are semantically similar but factually irrelevant. "Q3 budget planning" matches "Q1 budget planning" with high cosine similarity, but the Q1 doc is useless for a Q3 question. Stale context — documents were indexed once and never updated. The agent cites a policy that was revised two months ago. Permission leakage — the retrieval layer has no concept of who is asking. An intern's query returns the same executive compensation data as the CFO's.
Fixing these requires treating context as an engineering discipline, not an afterthought you bolt onto an agent framework.
Chunking Decisions That Actually Matter
What the benchmarks say — and where your intuition is wrong
Chunking is where most teams waste the most time on the wrong approach. The instinct is to reach for semantic chunking — splitting documents along "meaning boundaries" rather than fixed token counts. It sounds obviously better. The benchmarks tell a more nuanced story.
A February 2026 benchmark by PremAI across 50 academic papers placed recursive 512-token splitting at approximately 69% accuracy, while semantic chunking landed at roughly 54% — fifteen points behind[4]. Why? Semantic chunking produced fragments averaging just 43 tokens. Too small. Too fragmented. The chunks lost the surrounding context that made them useful. Note that this benchmark used academic texts; your enterprise corpus may behave differently.
The practical starting point is simpler than you think: recursive character splitting at 256-512 tokens with 10-20% overlap. That is roughly 50-100 tokens of overlap for 512-token chunks. This is a commonly validated default as of early 2026, and it outperforms fancier approaches on most enterprise corpora — though always benchmark against your own data before committing.
Defaulting to semantic chunking without benchmarking against recursive splitting
Using fixed 1024-token chunks regardless of document type
Zero overlap between chunks — losing context at boundaries
Treating all document types the same (code, prose, tables, logs)
Ignoring document structure (headers, sections, list hierarchies)
Start with recursive 512-token splitting, then benchmark alternatives
Match chunk size to your embedding model's sweet spot (typically 256-512 tokens)
10-20% overlap to preserve boundary context
Structure-aware splitting for Markdown, HTML, and code — use header hierarchy
Build a 50-100 pair test set before committing to any strategy
Late chunking is the most promising development in this space. Instead of splitting first and embedding each chunk independently, you feed the entire document into a long-context embedding model, then split the resulting embeddings. Each chunk retains awareness of the full document context — pronouns resolve correctly, headers carry through, and cross-references stay intact.
Jina's embeddings-v4 and Voyage AI's 32K-token context window both support this pattern. The tradeoff is compute cost: you embed the full document instead of individual chunks. For enterprise corpora where accuracy matters more than marginal embedding cost, late chunking is worth evaluating.
Embedding Strategy Selection
Choosing the right embedding model for your retrieval architecture
The embedding model market shifted significantly in 2025-2026. The old default — OpenAI's text-embedding-ada-002 — is now two generations behind. The current landscape has three tiers of options, each with distinct tradeoffs.
| Model | Context Window | MTEB Retrieval Score | Price per 1M tokens | Best For |
|---|---|---|---|---|
| Voyage AI voyage-3-large | 32K tokens | Highest (MTEB leader) | $0.06-$0.18 | Maximum retrieval quality, long documents |
| OpenAI text-embedding-3-large | 8K tokens | Strong baseline | $0.13 | Broad ecosystem integration, balanced cost |
| Jina embeddings-v4 | 8K (dense) / 8K (ColBERT) | Competitive with Voyage | Varies | Multi-modal retrieval, late interaction |
| Google Gemini Embedding | Up to 3K tokens | Cross-lingual leader | Free | Multilingual corpora, cost-sensitive workloads |
| Open-source (BGE, E5) | 512-8K tokens | Varies widely | Self-hosted cost | Air-gapped environments, full data control |
Voyage AI's voyage-3-large outperforms OpenAI's text-embedding-3-large by approximately 9.74% and Cohere's embed-v3 by roughly 20.71% on MTEB retrieval benchmarks as of March 2026[8]. Its 32K-token context window means you can embed longer documents without chunking — or use late chunking to preserve context across boundaries. Benchmark rankings change as new models are released, so verify against the current MTEB leaderboard before making production commitments.
But raw benchmark scores are not the whole story. The right choice depends on your constraints:
- Already in the OpenAI ecosystem? The text-embedding-3 family offers good enough quality with the simplest integration path. The
smallvariant at $0.02/M tokens is hard to beat for cost-sensitive workloads. - Need multi-modal retrieval? Jina's v4 handles text and images through a unified pathway, supporting both dense and ColBERT-style multi-vector embeddings.
- Multilingual corpus? Google's Gemini embedding leads on cross-lingual benchmarks and costs nothing through the API.
- Regulatory constraints? Self-hosted open-source models like BGE-M3 or E5-Mistral give you full data sovereignty.
Vector Stores vs Knowledge Graphs vs Hybrid
Three retrieval architectures and when each one wins
The retrieval backend decision is one of the most consequential architectural choices in your context engineering stack. There are three viable patterns, and the right answer for most enterprises is some combination of both.
Vector stores excel at semantic similarity over unstructured text. Ask "how do we handle customer refunds?" and the vector search finds documents about refund policies, return procedures, and chargeback handling — even if they never use the word "refund." This is the breadth play. Vector search covers messy, unstructured knowledge bases where the relationships between documents are implicit.
Knowledge graphs excel at structured relationships and multi-hop reasoning. Ask "which teams depend on the payments service and what are their SLAs?" and a graph traversal finds the answer through explicit entity relationships: payments-service → consumed-by → [checkout, subscriptions, invoicing] → each with SLA nodes. This is the depth play. Graphs shine when the question requires traversing relationships, enforcing permissions, or reasoning across connected entities[7].
The hybrid approach uses vectors for breadth and graphs for depth. A query router classifies the incoming question and fans out to both systems when needed. Results are merged, deduplicated, and re-ranked by a cross-encoder before context assembly. According to Schema App's published evaluation, teams have reported roughly 15-30% improvements in faithfulness and answer relevancy with hybrid retrieval[6] — though actual gains depend heavily on corpus structure, re-ranking configuration, and caching. Your mileage may vary.
When to use vector-only retrieval
- ✓
Your corpus is primarily unstructured text (docs, emails, chat logs, support tickets)
- ✓
Queries are open-ended and exploratory — users do not know exactly what they are looking for
- ✓
You need to get to production fast and iterate — vector stores have simpler operational overhead
- ✓
The relationships between documents are not well-defined or change frequently
When to add a knowledge graph
- ✓
Questions require multi-hop reasoning across entities (team → service → SLA → incident)
- ✓
Access control is critical — graphs model permissions as first-class relationships
- ✓
Your domain has stable ontologies (org charts, service maps, compliance frameworks)
- ✓
You need explainable retrieval — graph paths provide audit trails that vectors cannot
Context Window Management Under Pressure
How to allocate tokens when everything is competing for space
Models keep getting larger context windows — Claude supports 200K tokens, Gemini 2M, GPT-4o 128K. So why does context window management still matter?
Because attention degrades with length. Research from mid-2025 showed that retrieval quality drops measurably even for models with massive context windows when you stuff them with too much retrieved text. Shorter, more precise context consistently produces better answers than dumping 50K tokens of "potentially relevant" documents — though the optimal threshold varies by model and task type.
The enterprise context window is a budget, not a bucket. You are allocating a finite resource across competing demands.
The real challenge is not the static allocation — it is the dynamic rebalancing as conversations extend. As conversation history grows, the space available for retrieved context shrinks. You have four strategies to manage this:
Sliding window with summarization. Keep the last N turns verbatim and summarize everything older. The summary consumes fewer tokens while preserving key context. This is the simplest approach and works well for most conversational agents.
Retrieval budget scaling. Reduce the number of retrieved chunks as conversation history grows. Start with top-8 retrieval on the first turn, scale down to top-3 by turn fifteen. This trades retrieval breadth for conversation continuity.
Hierarchical context. Maintain two retrieval tiers — a compact summary layer (always included) and a detail layer (included only when the query requires depth). The summary layer costs 200-500 tokens and provides ambient context. The detail layer provides specifics on demand.
Context eviction with recency bias. Score each piece of context by relevance to the current query and recency of insertion. Evict the lowest-scoring items first. This requires scoring infrastructure but handles long-running sessions gracefully.
Building Institutional Memory That Agents Can Use
From scattered knowledge silos to a unified memory layer
Institutional memory is the accumulated knowledge an organization has about how it works — not just the documents, but the decisions, the context around those decisions, the informal rules that never got written down, and the relationships between all of these.
Most enterprises have this knowledge fragmented across dozens of silos: Confluence, Jira, SharePoint, Slack, Google Drive, CRMs, ticketing systems, code repositories, and the heads of long-tenured employees. Context engineering for enterprise AI means building a unified memory layer that agents can query across all of these sources with appropriate access controls.
- 1
Inventory your knowledge sources
Map every system that contains institutional knowledge. Categorize by freshness (real-time, daily, weekly, static), structure (structured, semi-structured, unstructured), and access model (public, role-based, sensitive). This inventory drives your ingestion pipeline design.
- 2
Design your ingestion pipeline with freshness tiers
Not everything needs real-time indexing. Separate your sources into freshness tiers: real-time (Slack, ticketing), daily (Confluence, SharePoint), and weekly (archived documents, historical data). Each tier gets its own sync cadence and resource allocation.
- 3
Embed with metadata
Every chunk needs metadata beyond the embedding vector: source system, document title, last modified date, author, access permissions, and a document version hash. This metadata enables filtered retrieval, freshness sorting, and access control at query time.
- 4
Build the access control layer
The retrieval layer must enforce permissions. When a user queries the system, the retrieved results must respect what that user is authorized to see. This means mapping identity from your auth provider to permissions on each chunk.
- 5
Implement query routing and re-ranking
A single retrieval strategy will not handle every query type. Build a query router that classifies incoming questions and dispatches to the appropriate retrieval backend — vector search for open-ended questions, graph traversal for entity lookups, keyword search for exact matches.
Memory Types Your Agents Need
Short-term, long-term, episodic, and procedural — each serves a different purpose
Production-grade agent memory in 2026 goes beyond simple document retrieval. Mature agent systems implement multiple memory types, each serving a distinct purpose in the agent's reasoning.
| Memory Type | Scope | Storage | Example |
|---|---|---|---|
| Working memory | Current session | Context window (ephemeral) | The user's current question, recent tool outputs, active plan |
| Episodic memory | Cross-session | Vector store + timestamps | "Last week this user asked about the payments API migration" |
| Semantic memory | Organizational | Vector store + knowledge graph | Company policies, architecture docs, product specs |
| Procedural memory | Agent-specific | Prompt templates + tool configs | How to query Jira, how to format a code review, escalation rules |
| Shared memory | Multi-agent | Shared store with namespaces | Research agent stores findings that writing agent later retrieves |
The critical insight is that these memory types have different write cadences and eviction policies. Working memory is ephemeral — it exists only for the current interaction. Episodic memory accumulates per-user and should be summarized periodically to prevent unbounded growth. Semantic memory reflects organizational knowledge and updates on the same cadence as your source systems. Procedural memory changes only when agent behavior is updated.
Frameworks like Mem0 and LangGraph's checkpointing handle working and episodic memory reasonably well. Semantic memory is where most enterprise teams need custom infrastructure, because it ties directly into your ingestion pipeline and access control layer.
Implementation Patterns in Code
Concrete code for the retrieval and context assembly pipeline
lib/context-engine.tsinterface RetrievalResult {
content: string;
metadata: {
source: string;
lastModified: string;
permissions: string[];
score: number;
};
}
interface ContextBudget {
systemPrompt: number; // tokens reserved for instructions
retrievedContext: number; // tokens for RAG results
conversationHistory: number; // tokens for past messages
outputReserve: number; // tokens reserved for generation
}
async function assembleContext(
query: string,
userId: string,
history: Message[],
budget: ContextBudget
): Promise<string> {
// 1. Route query to appropriate retrieval backends
const queryType = await classifyQuery(query);
// 2. Retrieve with user permissions pre-filtered
const vectorResults = await vectorStore.search(query, {
topK: 10,
filter: { permissions: { $in: getUserPermissions(userId) } },
});
const graphResults = queryType === 'entity_lookup'
? await knowledgeGraph.traverse(query, { maxDepth: 3 })
: [];
// 3. Merge and re-rank
const merged = deduplicateResults([...vectorResults, ...graphResults]);
const reranked = await crossEncoderRerank(query, merged);
// 4. Trim to budget
const contextChunks = fitToBudget(reranked, budget.retrievedContext);
// 5. Summarize history if over budget
const trimmedHistory = await trimHistory(
history, budget.conversationHistory
);
return formatContext(contextChunks, trimmedHistory);
}lib/ingestion-pipeline.tsinterface ChunkingConfig {
strategy: 'recursive' | 'semantic' | 'structure-aware';
chunkSize: number; // target tokens per chunk
chunkOverlap: number; // overlap tokens between chunks
respectBoundaries: boolean; // never split mid-sentence
}
const PRODUCTION_DEFAULTS: ChunkingConfig = {
strategy: 'recursive',
chunkSize: 512,
chunkOverlap: 64, // ~12% overlap
respectBoundaries: true,
};
async function ingestDocument(
doc: SourceDocument,
config: ChunkingConfig = PRODUCTION_DEFAULTS
) {
// 1. Detect document structure
const docType = detectDocumentType(doc); // markdown, html, code, plain
const effectiveConfig = docType === 'markdown'
? { ...config, strategy: 'structure-aware' as const }
: config;
// 2. Chunk with metadata preservation
const chunks = await chunk(doc.content, {
...effectiveConfig,
metadata: {
sourceId: doc.id,
sourceSystem: doc.system,
lastModified: doc.updatedAt,
permissions: doc.acl,
versionHash: hash(doc.content),
},
});
// 3. Check for existing version — skip if unchanged
const existing = await vectorStore.getBySourceId(doc.id);
if (existing?.versionHash === hash(doc.content)) return;
// 4. Evict stale chunks, embed and store new ones
if (existing) await vectorStore.deleteBySourceId(doc.id);
const embeddings = await embedBatch(chunks.map(c => c.content));
await vectorStore.upsertBatch(chunks, embeddings);
}Context Engineering Anti-Patterns
Mistakes that look reasonable but fail at scale
Anti-Patterns to Avoid
Do not stuff the entire context window with retrieved text
More context is not always better. At long context lengths, retrieval quality degrades. Keep retrieved context to 50-60% of the available window and reserve space for conversation history and output.
Do not skip re-ranking after retrieval
Vector similarity is a rough proxy for relevance. A cross-encoder re-ranker examines query-document pairs jointly and catches false positives that embedding similarity misses. This is the single highest-ROI addition to most RAG pipelines.
Do not index once and forget
Enterprise knowledge changes constantly. Build change detection into your ingestion pipeline. Stale chunks that cite outdated information are worse than no retrieval at all — they produce confidently wrong answers.
Do not ignore chunk boundaries when documents reference each other
A chunk that says 'as described in the architecture doc' is useless without that referenced document. Use parent-child chunk relationships or include document-level summaries alongside fine-grained chunks.
Do not treat permissions as an afterthought
Retrofitting access control onto a flat vector store is painful. Design ACL-aware retrieval from day one. Pre-filter on permissions before similarity search, not after.
Production Readiness Checklist
What to verify before your context engine handles real queries
Enterprise Context Engine — Go-Live Checklist
Chunking strategy benchmarked against 50+ query-answer pairs from your corpus
Embedding model selected and tested — migration plan documented for future model switches
Ingestion pipeline running with change detection and stale chunk eviction
Access control pre-filtering verified with adversarial test queries
Cross-encoder re-ranker deployed between retrieval and context assembly
Context window budget allocation defined and enforced programmatically
Retrieval logging active — every query logged with results, scores, and latency
Freshness monitoring alerting on stale index segments
Fallback behavior defined for zero-retrieval and low-confidence scenarios
Load tested at 10x expected query volume with acceptable p95 latency
Measuring Context Quality
You cannot improve what you do not measure
Context quality is measurable, and the metrics are more straightforward than most teams expect. The three that matter most:
Retrieval recall — what percentage of the relevant chunks appear in the retrieved set? Measure this against your labeled test set. A target of roughly 85%+ is a reasonable starting point before tuning anything else — though the right threshold depends on your application's tolerance for missing context.
Context precision — what percentage of retrieved chunks are actually relevant to the query? Low precision means you are wasting context window budget on noise. A re-ranker is the fastest fix.
Answer faithfulness — does the model's response stay grounded in the retrieved context? This catches both hallucinations and cases where the model ignores the retrieved context in favor of its parametric knowledge. Frameworks like RAGAS and TruLens automate this evaluation.
Should I use a managed vector database or self-host?
For most enterprises, start managed. Pinecone, Weaviate Cloud, and Qdrant Cloud handle scaling, backups, and index optimization. Self-hosting makes sense when you have strict data residency requirements or need custom index configurations that managed offerings do not support. The operational overhead of self-hosted vector databases is real — plan for index compaction, backup strategies, and version upgrades.
How do I handle documents that change frequently?
Implement a change-detection pipeline using webhooks or polling with content hashing. When a document changes, evict its old chunks and re-embed the new version. Use a version hash stored alongside each chunk to make this comparison fast. For extremely dynamic sources like Slack, consider windowed indexing — only index the last 90 days and expire older messages.
Is RAG still necessary with 1M+ token context windows?
Yes. Even with massive context windows, you still need to decide WHAT to put in them. A million tokens of irrelevant documents produces worse results than 10K tokens of precisely relevant ones. Long context windows change the math — you can include more context per query — but retrieval, filtering, and ranking remain essential. RAG is evolving into context engineering, not disappearing.
How many chunks should I retrieve per query?
Start with 8-12 chunks retrieved, then re-rank down to 3-5 for context assembly. The retrieval-to-context ratio matters: retrieve broadly to maximize recall, then re-rank aggressively to maximize precision. Monitor the marginal relevance of each additional chunk — if chunks 6-12 are consistently irrelevant after re-ranking, reduce your initial retrieval to save latency.
What about GraphRAG — is it production-ready?
GraphRAG is maturing fast. As of early 2026, it is production-viable for domains with well-defined entity relationships — org structures, service architectures, compliance frameworks. The main limitation is graph construction: building the knowledge graph from unstructured text still requires significant engineering effort. Start with a manually curated graph for your highest-value entities and expand incrementally.
We spent three months optimizing prompts before realizing the problem was what we were feeding the model, not how we were asking. Once we rebuilt the ingestion pipeline with proper chunking, metadata, and freshness controls, answer quality jumped 40% without changing a single prompt.
- [1]The New Stack — Memory For AI Agents: A New Paradigm Of Context Engineering(thenewstack.io)↩
- [2]Anthropic — Effective Context Engineering For AI Agents(anthropic.com)↩
- [3]Weaviate — Context Engineering(weaviate.io)↩
- [4]PremAI — RAG Chunking Strategies: The 2026 Benchmark Guide(blog.premai.io)↩
- [5]Firecrawl — Best Chunking Strategies For RAG(firecrawl.dev)↩
- [6]Schema App — Why Hybrid Graph + Vector RAG Is The Future Of Enterprise AI(schemaapp.com)↩
- [7]Machine Learning Mastery — Vector Databases vs Graph RAG For Agent Memory: When To Use Which(machinelearningmastery.com)↩
- [8]Elephas — Best Embedding Models(elephas.app)↩
- [9]Towards Data Science — Beyond RAG(towardsdatascience.com)↩