Skip to content
AI Native Builders

Context Engineering for Enterprise AI: Building the Memory Layer Your Agents Need

A practitioner's guide to building enterprise memory infrastructure for AI agents — covering embedding strategies, chunking decisions, vector stores vs knowledge graphs, and context window management that actually works at scale.

Data, Context & KnowledgeadvancedFeb 24, 20268 min read
Editorial illustration of a robot librarian attempting to organize an impossibly large card catalog of enterprise knowledge sources, capturing the challenge of building memory infrastructure for AI agentsBuilding the memory layer that turns stateless AI into systems with institutional knowledge

Every enterprise AI deployment hits the same wall. The model is capable. The prompts are solid. The agent framework is wired up. And then someone asks a question that requires knowing what happened in a Confluence page from eight months ago, cross-referenced with a Slack thread from last Tuesday, filtered by what the current user is actually authorized to see.

The agent hallucinates. Or returns something generic. Or — the most insidious failure — returns a confidently wrong answer drawn from a stale document that was superseded three revisions ago.

This is the context engineering problem, and it is now the primary bottleneck for enterprise AI. According to Gartner, "context engineering is in, and prompt engineering is out" — a shift that makes sense once you have competent models, because the quality of what you feed them matters more than how you ask[2].

Context engineering is the discipline of structuring everything an LLM needs — prompts, memory, retrieved documents, tool outputs, conversation history — so it can make reliable decisions. For enterprise systems, this means building a memory layer: persistent infrastructure that gives agents access to institutional knowledge across sessions, users, and time[1].

Why Naive RAG Fails at Enterprise Scale

The gap between tutorial RAG and production memory systems

The standard RAG tutorial goes like this: chunk your documents, embed them, store them in a vector database, retrieve the top-k results, stuff them into the prompt. It works beautifully on a single PDF.

Then you try it on 40,000 Confluence pages, 200,000 Slack messages, a Salesforce instance, three SharePoint sites, and a Notion workspace. Suddenly you are dealing with a different class of problem entirely.

Based on available research and practitioner reports, roughly 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM[5]. Most teams discover this after spending weeks tuning prompts and swapping models, only to realize the retrieved context was wrong from the start.

~80%
of RAG failures reported to originate in the ingestion/chunking layer, not the model. Based on practitioner data — your results may vary.
15-30%
improvement in answer quality from hybrid vector+graph retrieval vs vector-only (Schema App evaluation). Actual gains depend on query patterns and corpus.
512 tokens
typical chunking sweet spot — recursive splitting at this size beats semantic chunking in benchmarks, though results vary by corpus type

Enterprise RAG fails in three predictable ways. Retrieval noise — the vector search returns documents that are semantically similar but factually irrelevant. "Q3 budget planning" matches "Q1 budget planning" with high cosine similarity, but the Q1 doc is useless for a Q3 question. Stale context — documents were indexed once and never updated. The agent cites a policy that was revised two months ago. Permission leakage — the retrieval layer has no concept of who is asking. An intern's query returns the same executive compensation data as the CFO's.

Fixing these requires treating context as an engineering discipline, not an afterthought you bolt onto an agent framework.

Chunking Decisions That Actually Matter

What the benchmarks say — and where your intuition is wrong

Chunking is where most teams waste the most time on the wrong approach. The instinct is to reach for semantic chunking — splitting documents along "meaning boundaries" rather than fixed token counts. It sounds obviously better. The benchmarks tell a more nuanced story.

A February 2026 benchmark by PremAI across 50 academic papers placed recursive 512-token splitting at approximately 69% accuracy, while semantic chunking landed at roughly 54% — fifteen points behind[4]. Why? Semantic chunking produced fragments averaging just 43 tokens. Too small. Too fragmented. The chunks lost the surrounding context that made them useful. Note that this benchmark used academic texts; your enterprise corpus may behave differently.

The practical starting point is simpler than you think: recursive character splitting at 256-512 tokens with 10-20% overlap. That is roughly 50-100 tokens of overlap for 512-token chunks. This is a commonly validated default as of early 2026, and it outperforms fancier approaches on most enterprise corpora — though always benchmark against your own data before committing.

Common Chunking Mistakes
  • Defaulting to semantic chunking without benchmarking against recursive splitting

  • Using fixed 1024-token chunks regardless of document type

  • Zero overlap between chunks — losing context at boundaries

  • Treating all document types the same (code, prose, tables, logs)

  • Ignoring document structure (headers, sections, list hierarchies)

What Works in Production
  • Start with recursive 512-token splitting, then benchmark alternatives

  • Match chunk size to your embedding model's sweet spot (typically 256-512 tokens)

  • 10-20% overlap to preserve boundary context

  • Structure-aware splitting for Markdown, HTML, and code — use header hierarchy

  • Build a 50-100 pair test set before committing to any strategy

Late chunking is the most promising development in this space. Instead of splitting first and embedding each chunk independently, you feed the entire document into a long-context embedding model, then split the resulting embeddings. Each chunk retains awareness of the full document context — pronouns resolve correctly, headers carry through, and cross-references stay intact.

Jina's embeddings-v4 and Voyage AI's 32K-token context window both support this pattern. The tradeoff is compute cost: you embed the full document instead of individual chunks. For enterprise corpora where accuracy matters more than marginal embedding cost, late chunking is worth evaluating.

Embedding Strategy Selection

Choosing the right embedding model for your retrieval architecture

The embedding model market shifted significantly in 2025-2026. The old default — OpenAI's text-embedding-ada-002 — is now two generations behind. The current landscape has three tiers of options, each with distinct tradeoffs.

ModelContext WindowMTEB Retrieval ScorePrice per 1M tokensBest For
Voyage AI voyage-3-large32K tokensHighest (MTEB leader)$0.06-$0.18Maximum retrieval quality, long documents
OpenAI text-embedding-3-large8K tokensStrong baseline$0.13Broad ecosystem integration, balanced cost
Jina embeddings-v48K (dense) / 8K (ColBERT)Competitive with VoyageVariesMulti-modal retrieval, late interaction
Google Gemini EmbeddingUp to 3K tokensCross-lingual leaderFreeMultilingual corpora, cost-sensitive workloads
Open-source (BGE, E5)512-8K tokensVaries widelySelf-hosted costAir-gapped environments, full data control

Voyage AI's voyage-3-large outperforms OpenAI's text-embedding-3-large by approximately 9.74% and Cohere's embed-v3 by roughly 20.71% on MTEB retrieval benchmarks as of March 2026[8]. Its 32K-token context window means you can embed longer documents without chunking — or use late chunking to preserve context across boundaries. Benchmark rankings change as new models are released, so verify against the current MTEB leaderboard before making production commitments.

But raw benchmark scores are not the whole story. The right choice depends on your constraints:

  • Already in the OpenAI ecosystem? The text-embedding-3 family offers good enough quality with the simplest integration path. The small variant at $0.02/M tokens is hard to beat for cost-sensitive workloads.
  • Need multi-modal retrieval? Jina's v4 handles text and images through a unified pathway, supporting both dense and ColBERT-style multi-vector embeddings.
  • Multilingual corpus? Google's Gemini embedding leads on cross-lingual benchmarks and costs nothing through the API.
  • Regulatory constraints? Self-hosted open-source models like BGE-M3 or E5-Mistral give you full data sovereignty.

Vector Stores vs Knowledge Graphs vs Hybrid

Three retrieval architectures and when each one wins

The retrieval backend decision is one of the most consequential architectural choices in your context engineering stack. There are three viable patterns, and the right answer for most enterprises is some combination of both.

Hybrid Retrieval Architecture
A hybrid architecture routes queries through both vector similarity search and graph traversal, then merges and re-ranks results before context assembly.

Vector stores excel at semantic similarity over unstructured text. Ask "how do we handle customer refunds?" and the vector search finds documents about refund policies, return procedures, and chargeback handling — even if they never use the word "refund." This is the breadth play. Vector search covers messy, unstructured knowledge bases where the relationships between documents are implicit.

Knowledge graphs excel at structured relationships and multi-hop reasoning. Ask "which teams depend on the payments service and what are their SLAs?" and a graph traversal finds the answer through explicit entity relationships: payments-service → consumed-by → [checkout, subscriptions, invoicing] → each with SLA nodes. This is the depth play. Graphs shine when the question requires traversing relationships, enforcing permissions, or reasoning across connected entities[7].

The hybrid approach uses vectors for breadth and graphs for depth. A query router classifies the incoming question and fans out to both systems when needed. Results are merged, deduplicated, and re-ranked by a cross-encoder before context assembly. According to Schema App's published evaluation, teams have reported roughly 15-30% improvements in faithfulness and answer relevancy with hybrid retrieval[6] — though actual gains depend heavily on corpus structure, re-ranking configuration, and caching. Your mileage may vary.

When to use vector-only retrieval

  • Your corpus is primarily unstructured text (docs, emails, chat logs, support tickets)

  • Queries are open-ended and exploratory — users do not know exactly what they are looking for

  • You need to get to production fast and iterate — vector stores have simpler operational overhead

  • The relationships between documents are not well-defined or change frequently

When to add a knowledge graph

  • Questions require multi-hop reasoning across entities (team → service → SLA → incident)

  • Access control is critical — graphs model permissions as first-class relationships

  • Your domain has stable ontologies (org charts, service maps, compliance frameworks)

  • You need explainable retrieval — graph paths provide audit trails that vectors cannot

Context Window Management Under Pressure

How to allocate tokens when everything is competing for space

Models keep getting larger context windows — Claude supports 200K tokens, Gemini 2M, GPT-4o 128K. So why does context window management still matter?

Because attention degrades with length. Research from mid-2025 showed that retrieval quality drops measurably even for models with massive context windows when you stuff them with too much retrieved text. Shorter, more precise context consistently produces better answers than dumping 50K tokens of "potentially relevant" documents — though the optimal threshold varies by model and task type.

The enterprise context window is a budget, not a bucket. You are allocating a finite resource across competing demands.

10-15%
System prompt and instructions
50-60%
Retrieved context (documents, data)
15-20%
Conversation history
15-20%
Reserved for model output

The real challenge is not the static allocation — it is the dynamic rebalancing as conversations extend. As conversation history grows, the space available for retrieved context shrinks. You have four strategies to manage this:

Sliding window with summarization. Keep the last N turns verbatim and summarize everything older. The summary consumes fewer tokens while preserving key context. This is the simplest approach and works well for most conversational agents.

Retrieval budget scaling. Reduce the number of retrieved chunks as conversation history grows. Start with top-8 retrieval on the first turn, scale down to top-3 by turn fifteen. This trades retrieval breadth for conversation continuity.

Hierarchical context. Maintain two retrieval tiers — a compact summary layer (always included) and a detail layer (included only when the query requires depth). The summary layer costs 200-500 tokens and provides ambient context. The detail layer provides specifics on demand.

Context eviction with recency bias. Score each piece of context by relevance to the current query and recency of insertion. Evict the lowest-scoring items first. This requires scoring infrastructure but handles long-running sessions gracefully.

Building Institutional Memory That Agents Can Use

From scattered knowledge silos to a unified memory layer

Institutional memory is the accumulated knowledge an organization has about how it works — not just the documents, but the decisions, the context around those decisions, the informal rules that never got written down, and the relationships between all of these.

Most enterprises have this knowledge fragmented across dozens of silos: Confluence, Jira, SharePoint, Slack, Google Drive, CRMs, ticketing systems, code repositories, and the heads of long-tenured employees. Context engineering for enterprise AI means building a unified memory layer that agents can query across all of these sources with appropriate access controls.

  1. 1

    Inventory your knowledge sources

    Map every system that contains institutional knowledge. Categorize by freshness (real-time, daily, weekly, static), structure (structured, semi-structured, unstructured), and access model (public, role-based, sensitive). This inventory drives your ingestion pipeline design.

  2. 2

    Design your ingestion pipeline with freshness tiers

    Not everything needs real-time indexing. Separate your sources into freshness tiers: real-time (Slack, ticketing), daily (Confluence, SharePoint), and weekly (archived documents, historical data). Each tier gets its own sync cadence and resource allocation.

  3. 3

    Embed with metadata

    Every chunk needs metadata beyond the embedding vector: source system, document title, last modified date, author, access permissions, and a document version hash. This metadata enables filtered retrieval, freshness sorting, and access control at query time.

  4. 4

    Build the access control layer

    The retrieval layer must enforce permissions. When a user queries the system, the retrieved results must respect what that user is authorized to see. This means mapping identity from your auth provider to permissions on each chunk.

  5. 5

    Implement query routing and re-ranking

    A single retrieval strategy will not handle every query type. Build a query router that classifies incoming questions and dispatches to the appropriate retrieval backend — vector search for open-ended questions, graph traversal for entity lookups, keyword search for exact matches.

Memory Types Your Agents Need

Short-term, long-term, episodic, and procedural — each serves a different purpose

Production-grade agent memory in 2026 goes beyond simple document retrieval. Mature agent systems implement multiple memory types, each serving a distinct purpose in the agent's reasoning.

Memory TypeScopeStorageExample
Working memoryCurrent sessionContext window (ephemeral)The user's current question, recent tool outputs, active plan
Episodic memoryCross-sessionVector store + timestamps"Last week this user asked about the payments API migration"
Semantic memoryOrganizationalVector store + knowledge graphCompany policies, architecture docs, product specs
Procedural memoryAgent-specificPrompt templates + tool configsHow to query Jira, how to format a code review, escalation rules
Shared memoryMulti-agentShared store with namespacesResearch agent stores findings that writing agent later retrieves

The critical insight is that these memory types have different write cadences and eviction policies. Working memory is ephemeral — it exists only for the current interaction. Episodic memory accumulates per-user and should be summarized periodically to prevent unbounded growth. Semantic memory reflects organizational knowledge and updates on the same cadence as your source systems. Procedural memory changes only when agent behavior is updated.

Frameworks like Mem0 and LangGraph's checkpointing handle working and episodic memory reasonably well. Semantic memory is where most enterprise teams need custom infrastructure, because it ties directly into your ingestion pipeline and access control layer.

Implementation Patterns in Code

Concrete code for the retrieval and context assembly pipeline

lib/context-engine.ts
interface RetrievalResult {
  content: string;
  metadata: {
    source: string;
    lastModified: string;
    permissions: string[];
    score: number;
  };
}

interface ContextBudget {
  systemPrompt: number;     // tokens reserved for instructions
  retrievedContext: number; // tokens for RAG results
  conversationHistory: number; // tokens for past messages
  outputReserve: number;   // tokens reserved for generation
}

async function assembleContext(
  query: string,
  userId: string,
  history: Message[],
  budget: ContextBudget
): Promise<string> {
  // 1. Route query to appropriate retrieval backends
  const queryType = await classifyQuery(query);
  
  // 2. Retrieve with user permissions pre-filtered
  const vectorResults = await vectorStore.search(query, {
    topK: 10,
    filter: { permissions: { $in: getUserPermissions(userId) } },
  });

  const graphResults = queryType === 'entity_lookup'
    ? await knowledgeGraph.traverse(query, { maxDepth: 3 })
    : [];

  // 3. Merge and re-rank
  const merged = deduplicateResults([...vectorResults, ...graphResults]);
  const reranked = await crossEncoderRerank(query, merged);

  // 4. Trim to budget
  const contextChunks = fitToBudget(reranked, budget.retrievedContext);

  // 5. Summarize history if over budget
  const trimmedHistory = await trimHistory(
    history, budget.conversationHistory
  );

  return formatContext(contextChunks, trimmedHistory);
}
lib/ingestion-pipeline.ts
interface ChunkingConfig {
  strategy: 'recursive' | 'semantic' | 'structure-aware';
  chunkSize: number;      // target tokens per chunk
  chunkOverlap: number;   // overlap tokens between chunks
  respectBoundaries: boolean; // never split mid-sentence
}

const PRODUCTION_DEFAULTS: ChunkingConfig = {
  strategy: 'recursive',
  chunkSize: 512,
  chunkOverlap: 64,       // ~12% overlap
  respectBoundaries: true,
};

async function ingestDocument(
  doc: SourceDocument,
  config: ChunkingConfig = PRODUCTION_DEFAULTS
) {
  // 1. Detect document structure
  const docType = detectDocumentType(doc); // markdown, html, code, plain
  const effectiveConfig = docType === 'markdown'
    ? { ...config, strategy: 'structure-aware' as const }
    : config;

  // 2. Chunk with metadata preservation
  const chunks = await chunk(doc.content, {
    ...effectiveConfig,
    metadata: {
      sourceId: doc.id,
      sourceSystem: doc.system,
      lastModified: doc.updatedAt,
      permissions: doc.acl,
      versionHash: hash(doc.content),
    },
  });

  // 3. Check for existing version — skip if unchanged
  const existing = await vectorStore.getBySourceId(doc.id);
  if (existing?.versionHash === hash(doc.content)) return;

  // 4. Evict stale chunks, embed and store new ones
  if (existing) await vectorStore.deleteBySourceId(doc.id);
  const embeddings = await embedBatch(chunks.map(c => c.content));
  await vectorStore.upsertBatch(chunks, embeddings);
}

Context Engineering Anti-Patterns

Mistakes that look reasonable but fail at scale

Anti-Patterns to Avoid

Do not stuff the entire context window with retrieved text

More context is not always better. At long context lengths, retrieval quality degrades. Keep retrieved context to 50-60% of the available window and reserve space for conversation history and output.

Do not skip re-ranking after retrieval

Vector similarity is a rough proxy for relevance. A cross-encoder re-ranker examines query-document pairs jointly and catches false positives that embedding similarity misses. This is the single highest-ROI addition to most RAG pipelines.

Do not index once and forget

Enterprise knowledge changes constantly. Build change detection into your ingestion pipeline. Stale chunks that cite outdated information are worse than no retrieval at all — they produce confidently wrong answers.

Do not ignore chunk boundaries when documents reference each other

A chunk that says 'as described in the architecture doc' is useless without that referenced document. Use parent-child chunk relationships or include document-level summaries alongside fine-grained chunks.

Do not treat permissions as an afterthought

Retrofitting access control onto a flat vector store is painful. Design ACL-aware retrieval from day one. Pre-filter on permissions before similarity search, not after.

Production Readiness Checklist

What to verify before your context engine handles real queries

Enterprise Context Engine — Go-Live Checklist

  • Chunking strategy benchmarked against 50+ query-answer pairs from your corpus

  • Embedding model selected and tested — migration plan documented for future model switches

  • Ingestion pipeline running with change detection and stale chunk eviction

  • Access control pre-filtering verified with adversarial test queries

  • Cross-encoder re-ranker deployed between retrieval and context assembly

  • Context window budget allocation defined and enforced programmatically

  • Retrieval logging active — every query logged with results, scores, and latency

  • Freshness monitoring alerting on stale index segments

  • Fallback behavior defined for zero-retrieval and low-confidence scenarios

  • Load tested at 10x expected query volume with acceptable p95 latency

Measuring Context Quality

You cannot improve what you do not measure

Context quality is measurable, and the metrics are more straightforward than most teams expect. The three that matter most:

Retrieval recall — what percentage of the relevant chunks appear in the retrieved set? Measure this against your labeled test set. A target of roughly 85%+ is a reasonable starting point before tuning anything else — though the right threshold depends on your application's tolerance for missing context.

Context precision — what percentage of retrieved chunks are actually relevant to the query? Low precision means you are wasting context window budget on noise. A re-ranker is the fastest fix.

Answer faithfulness — does the model's response stay grounded in the retrieved context? This catches both hallucinations and cases where the model ignores the retrieved context in favor of its parametric knowledge. Frameworks like RAGAS and TruLens automate this evaluation.

~85%+
Retrieval recall on labeled test set — a reasonable starting target; calibrate to your corpus and risk tolerance
~90%+
Answer faithfulness (grounded in context) — target varies by application; higher for compliance use cases
< 500ms
P95 retrieval + re-ranking latency — as a starting point; complex corpora may require tuning
< 4 hrs
Maximum staleness for real-time sources — adjust based on how frequently your source data changes

Should I use a managed vector database or self-host?

For most enterprises, start managed. Pinecone, Weaviate Cloud, and Qdrant Cloud handle scaling, backups, and index optimization. Self-hosting makes sense when you have strict data residency requirements or need custom index configurations that managed offerings do not support. The operational overhead of self-hosted vector databases is real — plan for index compaction, backup strategies, and version upgrades.

How do I handle documents that change frequently?

Implement a change-detection pipeline using webhooks or polling with content hashing. When a document changes, evict its old chunks and re-embed the new version. Use a version hash stored alongside each chunk to make this comparison fast. For extremely dynamic sources like Slack, consider windowed indexing — only index the last 90 days and expire older messages.

Is RAG still necessary with 1M+ token context windows?

Yes. Even with massive context windows, you still need to decide WHAT to put in them. A million tokens of irrelevant documents produces worse results than 10K tokens of precisely relevant ones. Long context windows change the math — you can include more context per query — but retrieval, filtering, and ranking remain essential. RAG is evolving into context engineering, not disappearing.

How many chunks should I retrieve per query?

Start with 8-12 chunks retrieved, then re-rank down to 3-5 for context assembly. The retrieval-to-context ratio matters: retrieve broadly to maximize recall, then re-rank aggressively to maximize precision. Monitor the marginal relevance of each additional chunk — if chunks 6-12 are consistently irrelevant after re-ranking, reduce your initial retrieval to save latency.

What about GraphRAG — is it production-ready?

GraphRAG is maturing fast. As of early 2026, it is production-viable for domains with well-defined entity relationships — org structures, service architectures, compliance frameworks. The main limitation is graph construction: building the knowledge graph from unstructured text still requires significant engineering effort. Start with a manually curated graph for your highest-value entities and expand incrementally.

We spent three months optimizing prompts before realizing the problem was what we were feeding the model, not how we were asking. Once we rebuilt the ingestion pipeline with proper chunking, metadata, and freshness controls, answer quality jumped 40% without changing a single prompt.

Dana Reeves, Head of AI Platform, Series C Enterprise SaaS
Key terms in this piece
context engineeringenterprise AI memoryRAG architectureembedding strategieschunking decisionsvector storeknowledge graphcontext window managementinstitutional memoryretrieval augmented generation
Sources
  1. [1]The New StackMemory For AI Agents: A New Paradigm Of Context Engineering(thenewstack.io)
  2. [2]AnthropicEffective Context Engineering For AI Agents(anthropic.com)
  3. [3]WeaviateContext Engineering(weaviate.io)
  4. [4]PremAIRAG Chunking Strategies: The 2026 Benchmark Guide(blog.premai.io)
  5. [5]FirecrawlBest Chunking Strategies For RAG(firecrawl.dev)
  6. [6]Schema AppWhy Hybrid Graph + Vector RAG Is The Future Of Enterprise AI(schemaapp.com)
  7. [7]Machine Learning MasteryVector Databases vs Graph RAG For Agent Memory: When To Use Which(machinelearningmastery.com)
  8. [8]ElephasBest Embedding Models(elephas.app)
  9. [9]Towards Data ScienceBeyond RAG(towardsdatascience.com)
Share this article