Cosine similarity scores look fine while your RAG pipeline gives wrong answers. Four failure modes that produce confident, wrong outputs — and the retrieval stack that actually fixes them.
Why cosine similarity scores look healthy while RAG answers are wrong — the alibi problem
Four failure modes with distinct operational signatures: chunking, vocabulary mismatch, semantic smear, lost-in-middle
The two-stage retrieval stack that closes the gap: hybrid BM25+dense retrieval with RRF, then cross-encoder reranking
Exact latency budgets and candidate pool sizes for cross-encoder reranking in production
How to set a confidence floor that makes your system abstain rather than hallucinate from noise
Building a retrieval eval harness from production query logs — no labeled dataset required
Your RAG pipeline logged zero retrieval errors last week. Every query returned results. Cosine similarity scores averaged 0.82. What the system didn't record: how many of those answers were wrong.
Cosine similarity is not a relevance signal. It measures proximity in embedding space, which correlates loosely with semantic similarity and not at all with whether the retrieved chunk contained the information the user actually needed. A 0.82 similarity score on the wrong document is not a near-miss. It's a confident hallucination with a clean audit trail.
Dense-only retrieval on heterogeneous enterprise corpora achieves Recall@10 of roughly 0.58–0.65 [2] — meaning the right chunk doesn't make the candidate list on 35–42% of queries. Not occasionally. Routinely, on a predictable slice of your query distribution. And because the system always returns something, nothing in your logs signals failure. The RAG failure modes that matter aren't exceptions. They're the default behavior of naive retrieval on production data.
The alibi problem: similarity scores provide cover for retrieving the wrong thing.
Most vector stores return top-k results unconditionally. Ask about a deprecated API endpoint, and the store returns the five most similar chunks — whether or not any of them contain the actual answer. The LLM receives them and generates a plausible response grounded in whatever fragments it got. The response is fluent, confident, internally consistent.
It's also wrong.
This failure shape differs from a runtime error in one critical way: nothing fires. No latency spike. No exception. The system behaved exactly as designed — retrieved documents, injected them into context, generated a response grounded in those documents. The failure lives entirely in the gap between retrieved and relevant. That gap is invisible to your monitoring stack.
One production team rebuilding a team knowledge assistant documented this precisely. Their baseline system — embedding model plus ChromaDB with cosine similarity top-k — was confidently wrong roughly 20% of the time from day one [1]. The system had no mechanism to know. The error rate wasn't discovered through observability. It was discovered by manually reviewing outputs. The system's own metrics showed green: queries answered, latency stable, similarity scores in range.
That is the specific shape of this failure. The feedback loop is broken at the source. Without a signal that retrieval failed, there's no pressure to fix it.
On heterogeneous enterprise corpora. The right chunk doesn't appear in the top 10 on roughly 35–42% of queries. [2]
In enterprise RAG over technical docs, contracts, or support tickets. Dense embeddings miss this slice systematically. [2]
BM25 + dense vector with RRF fusion over dense-only. Observed in production rebuilds before any reranking. [1]
When the right chunk reaches position 1–3, LLM accuracy reaches 92%. Ranking position is the primary bottleneck. [6]
Each has a different operational signature. The root cause is treating similarity as a proxy for relevance.
| Failure Mode | Where It Hides | Production Symptom | Diagnostic Signal |
|---|---|---|---|
| Chunking destroys context | Index build | Answer split across chunk boundaries — neither retrieves well | Low similarity on queries where the answer spans a paragraph break |
| Vocabulary mismatch | Retrieval stage | Exact identifiers, error codes, version strings missed | High failure rate on queries with proper nouns or numeric identifiers |
| Semantic smear | Retrieval stage | Topically similar chunks returned, specific fact absent | High similarity scores, low faithfulness — model hedges or fabricates |
| Lost in the middle | Context assembly | Right chunk at rank 5–8, model ignores it | High Recall@10, low end-to-end accuracy on synthesis queries |
Chunking destroys context. Fixed-size chunking splits documents every 512 or 1024 characters regardless of where a thought ends. A question whose answer spans a paragraph boundary gets split: the first chunk contains the setup, the second contains the conclusion. Neither chunk, scored against the query in isolation, embeds the full reasoning. The similarity score for each partial chunk sits below threshold. The right answer slips out of the candidate set.
Semantic chunking addresses this by measuring cosine similarity between adjacent sentences and opening a new chunk when similarity drops below a threshold — detecting natural topic boundaries rather than counting characters. The tradeoff is real: semantic chunking runs roughly 14x slower at index build time compared to fixed-size splitting, and the accuracy gains are inconsistent across corpus types. A clinical decision support study found adaptive topic-aligned chunking hit 87% retrieval accuracy versus 13% for fixed-size baselines [11] — but a multi-strategy benchmark of academic papers placed recursive 512-token splitting ahead of semantic chunking on final RAGAS scores. The right call depends on your corpus: semantic chunking pays off on long-form prose where topic transitions are unmarked; recursive splitting is cheaper and often sufficient on structured documents with consistent formatting.
Vocabulary mismatch. Dense embedding models handle semantic paraphrase well — "compensation" and "salary" sit close in embedding space. What they don't handle is lexical precision. Query for error code ERR_CONN_REFUSED or model version claude-opus-4-7, and the embedding model maps these to approximate representations based on surrounding vocabulary context. The actual document containing the specific identifier may rank far below documents that are thematically related but factually irrelevant. In enterprise RAG over technical documentation, contracts, or support tickets, exact-match queries account for 20–40% of production traffic [2]. Dense-only retrieval fails this slice systematically, and because the failures look like normal successful retrievals, this failure mode persists undetected for months.
Semantic smear is subtler. The retriever returns documents about the same topic as the query, but not documents that contain the specific fact. A query about the refund policy for annual subscription plans might surface five chunks about billing procedures, cancellation flows, and account management settings. All are topically similar. None contain the actual refund policy clause. Similarity scores look healthy — 0.78, 0.74, 0.72. The LLM receives five documents, none of which have the answer, and either hedges with generic information or fills in a plausible-sounding response from training data. The interaction is logged as successful.
This is where cosine similarity functions as an alibi. The system can point to five chunks with scores above 0.72 and assert it tried. The audit trail is clean. The answer is wrong.
Lost in the middle. Retrieval can return the right chunk — but ranked at position 6 out of 10. The LLM then processes a 10-document context window with the relevant passage buried somewhere inside. A well-documented architectural effect in transformer models causes attention to concentrate at the beginning and end of the context window while systematically underweighting the middle positions — performance can degrade by more than 30% when the relevant passage shifts from the first or last position to a middle slot [9]. The underlying mechanism is Rotary Position Embedding (RoPE): its long-term decay factor causes models to de-emphasize tokens as their distance from the current position grows, and middle-document tokens accumulate this decay from both ends simultaneously.
The relevant chunk was retrieved. It was in context. The model still missed it.
The fix is not more context — it's better ranking. A cross-encoder reranker that promotes the right chunk to position 1 or 2 matters more than expanding the context window to accommodate additional noise. Thomson Reuters, analyzing their deployed customer support RAG system, found that when relevant documents rank in the top 3, their system generates accurate responses in 92% of cases [6]. That number dropped substantially when the relevant chunk ranked at position 6 or later. Position is the bottleneck, not recall.
Two changes — hybrid retrieval and cross-encoder reranking — cover most of the failure surface.
Fixed-size character chunking (512–1024 chars)
Dense-only bi-encoder retrieval
Top-k ranked by cosine similarity
No confidence threshold — always returns k results
Similarity rank == LLM input rank, no reranking
No abstention mechanism — fabricates from noise
Eval set built from queries that already worked
Semantic or recursive chunking at topic/markup boundaries
Hybrid BM25 + dense retrieval, RRF fusion
Retrieve top-50 candidates, rerank to top-5 via cross-encoder
Confidence gate — abstain if reranker score < threshold
Cross-encoder precision: right chunk at position 1–2
Explicit abstention: 'insufficient context in knowledge base'
Eval set sampled from production query logs
Hybrid retrieval is the highest-leverage single change. Run BM25 alongside your dense vector search, then fuse the ranked lists using Reciprocal Rank Fusion. Dense retrieval handles semantic paraphrase — finding documents that express the same concept in different words. Sparse retrieval handles lexical precision — exact keyword matching for product names, error codes, identifiers, version strings. These failure modes are complementary: queries that fail dense retrieval typically succeed with BM25, and vice versa.
RRF sidesteps the score-scale incompatibility problem. BM25 and cosine similarity are on completely different scales and can't be directly combined. RRF converts each ranked list into position-based contributions — 1/(k + rank) where k=60 by convention — so a document that ranks high in both lists accumulates a stronger combined score regardless of the raw similarity values. The tradeoff: RRF discards score magnitude, so a document ranked first with 0.99 cosine similarity gets the same RRF contribution as a document ranked first with 0.51. When your corpus has genuine quality information embedded in the similarity scores, not just ranks, that information is lost.
One alternative worth knowing: convex combination (alpha * dense_score + (1 - alpha) * bm25_score) preserves score magnitude and can outperform RRF — but requires tuning the alpha parameter on labeled data. With as few as 40 query-relevance pairs, tuned convex combination consistently outperforms RRF both in-domain and out-of-domain [8]. The 2025 frontier approach takes this further with per-query dynamic alpha: detect at query time whether the incoming query is keyword-heavy or semantically diffuse, and adjust alpha accordingly rather than fixing it per-collection. If you have labeled data, start there instead of RRF.
BM25 alone achieves roughly 40% retrieval precision. Dense vectors alone achieve roughly 58%. Hybrid search with RRF reaches roughly 79% before any reranking [1] — a 21-point lift from a structural change, not a model upgrade.
Precision gains of 25–40% in exchange for 50–300ms — when that trade is worth making and when it isn't.
A cross-encoder reads the query and each candidate document as a pair and scores them jointly. That joint scoring is what makes it accurate — and expensive. The bi-encoder that drove your initial retrieval embeds the query and documents independently; the cross-encoder sees them together, so it can catch signal the bi-encoder missed.
The latency numbers are concrete: reranking 30 candidates takes roughly 100–150ms on CPU, 30–50ms on a T4 GPU. Push the candidate pool above 100 and you're looking at 300ms or more [10]. That changes the math on when reranking is worth it.
If your LLM generation step runs 2–3 seconds (typical for a streaming response), adding 150ms of reranking is negligible from a user-experience standpoint. If your end-to-end target is under 500ms, reranking on CPU will eat 20–30% of your budget and you'll need either a self-hosted GPU-backed model (BGE reranker v2 family under Apache 2.0) or a managed API like Cohere Rerank 3.5 (~$2.00 per 1,000 queries). FlashRank can rerank 50 candidates in under 20ms on CPU — a practical option when latency is tight and a smaller model's precision is sufficient [10].
The practical default: rerank the top-30 candidates and return the top-5. This fits most sub-500ms SLA budgets on CPU hardware and delivers 25–40% improvement in Precision@5 and NDCG@5 [10]. The 10:1 compression ratio — 50 noisy candidates down to 5 precise results — is where the gains live. Don't try to rerank 100 candidates if you can't afford the latency; trim the candidate pool and accept a minor recall cost.
Setting a retrieval abstention threshold is a product decision about when to admit ignorance.
Every vector store returns top-k results unconditionally. This is the mechanism behind what one team called "hallucinations grounded in noise" — the model cites real documents but answers the wrong question, because the retrieved documents were the least-wrong available rather than actually useful [1]. The LLM never knew retrieval failed. Neither did you.
A confidence floor turns retrieval failure into an explicit signal. After cross-encoder reranking, check whether the top-1 document's relevance score clears a minimum threshold. If it doesn't, return empty context and respond with "I don't have sufficient information to answer this" rather than generating from whatever noise the retriever surfaced. A threshold around 0.3–0.4 on normalized cross-encoder output is a reasonable starting point for most domains [2]; calibrate it against 50 hand-reviewed queries from your specific corpus.
The threshold is an editorial decision disguised as a number. Setting it lower preserves coverage — more queries get an answer — but some answers will be fabricated from low-quality context. Setting it higher reduces fabrication but increases abstention rate. Where you draw that line depends on the downstream cost of a wrong answer. A support agent acting on a wrong refund policy has a different risk profile than a developer reading an architecture summary.
The counterintuitive consequence: a system that occasionally says "I don't know" earns more trust than one that always produces an answer. Users build mental models of system behavior over time. A system that abstains teaches users that abstention is meaningful — when it answers, the answer is grounded. A system that never abstains trains users to treat every answer with equal suspicion.
The eval bootstrap trap: test sets built from queries that already work miss every failure mode that matters.
Most teams build their retrieval evaluation set from queries where they already knew the expected answer — which systematically excludes the query types where retrieval fails. The eval set passes. Production failures stay invisible. This is why the problem persists for months after deployment: the system is being evaluated on a distribution it was never actually failing on.
The bootstrap approach starts from production query logs, not constructed test cases. Pull 200–400 real queries from the past 30 days. For each query, use an LLM judge to determine whether the top-5 retrieved chunks contain enough information to support a correct answer. This is a cheap classification task — relevant / partial / not relevant — and it covers the actual query distribution your system encounters, not the one you expected when you built the demo.
Calibrate the judge against 20–30 human-reviewed examples to catch systematic biases in its scoring. Then run it weekly. A drop in the "relevant" category signals retrieval degradation before users start filing tickets.
For ongoing monitoring, track three metrics: context precision (what fraction of the top-5 chunks are actually useful for the query — measurable without labels via LLM judge), faithfulness (does the generated answer stay within the retrieved context — measurable without labels using RAGAS or DeepEval), and abstention rate (rising abstention without corpus changes signals query distribution shift, not a retrieval bug). Production consensus as of 2026: target faithfulness ≥ 0.75 and context precision ≥ 0.70 before shipping any retrieval change, and treat scores above 0.80 on both as production-ready [12]. None of these require a hand-labeled ground-truth dataset to start.
Set a deployment gate on context precision: if it drops more than 5 percentage points from the baseline on your query sample, block the change. That gate, applied consistently, catches corpus drift problems that would otherwise reach users silently.
Pull 200–400 real queries from the past 30 days of logs. Stratify by query type if possible: factual lookups, synthesis questions, identifier-heavy searches. This covers failure modes your demo set missed.
For each query, retrieve top-5 chunks and classify: relevant / partial / not relevant. Use a zero-shot prompt with two or three reference examples. Budget ~$0.01 per query — 400 queries costs under $5.
Manual spot-check catches systematic judge biases: overconfidence on short chunks, false-positives on topic-adjacent content. Adjust the judge prompt until its scores match human agreement at 85%+.
Record context precision, faithfulness (via RAGAS), and abstention rate on your production query sample. These are your baseline values. Targets: faithfulness ≥ 0.75, context precision ≥ 0.70.
Block any retrieval or chunking change that drops context precision more than 5pp from baseline on the same sample. Run this check in CI on every index rebuild. Rising abstention rate without corpus changes = query distribution shift; investigate before deploying.
Log every abstention with the query text. This log is your corpus gap map. What the system consistently can't answer tells you exactly where your knowledge base needs coverage.
The bottleneck is almost never where teams assume it is.
The instinct when RAG fails is to swap the embedding model — try a newer model, a domain-specific one, or a larger one. This is usually the wrong order of operations. Swapping embedding models requires re-embedding the entire corpus: a batch operation with downtime risk, and the side effect that existing vectors and new vectors live in incompatible spaces until re-indexing completes.
Do this first: fix chunking, add hybrid retrieval, add cross-encoder reranking. Measure context precision and faithfulness before and after each change. If context precision is above 0.80 and faithfulness is above 0.75 after those changes, the embedding model probably isn't your bottleneck. If you're still seeing failures concentrated on a specific query type — semantic smear on policy documents, vocabulary mismatch on product identifiers — then profile where those failures live before reaching for a new model.
When you do swap: run both indexes in parallel during a transition period, route a fraction of traffic to the new index, and measure context precision on the same production query sample. Don't trust offline benchmarks on synthetic datasets — they rarely predict production behavior on heterogeneous enterprise corpora.
When do I add a reranker vs. swap the embedding model?
Add the reranker first. Swapping embedding models requires re-embedding your entire corpus — a batch operation with downtime risk — and the side effect that existing vectors and new vectors live in incompatible spaces until re-indexing completes. A cross-encoder reranker layers on top of existing retrieval without touching the index. Fix Recall@50 with hybrid search, then add reranking, then evaluate whether the embedding model is still the bottleneck. In most production corpora it isn't — the chunking and retrieval architecture are the binding constraints.
Does adding more retrieval signals (query expansion, fusion) always improve end-to-end accuracy?
No. A 2026 arXiv evaluation showed recall gains from retrieval fusion are frequently neutralized by downstream reranking and context truncation [5]. In several configurations, fusion variants underperformed single-query baselines on end-to-end accuracy. More retrieval signals produce a larger, noisier candidate set that the reranker has to sort through. If the reranker capacity is fixed and the context window truncates additional candidates anyway, the upstream recall gain disappears. Improve your reranking layer before adding retrieval breadth.
What's a safe starting threshold for the confidence gate?
0.3–0.4 on normalized cross-encoder score is a reasonable starting point. Calibrate against 50 manually reviewed queries from your corpus: find the score at which chunks transition from 'could answer the question' to 'cannot answer the question.' The correct threshold varies by corpus density and query specificity. Technical documentation with sparse coverage needs a lower threshold than a knowledge base with high topical density.
How do I handle abstention without breaking the user experience?
Treat abstention as a first-class response type, not a fallback. The response should be specific about what failed: 'The knowledge base doesn't contain information about X. You can ask about Y or Z.' This is more useful than a vague 'I don't know' and more accurate than a hallucinated answer. Log every abstention with the query — that log is your corpus gap map. What the system consistently cannot answer tells you exactly where your knowledge base needs coverage.
Should I use semantic chunking or fixed-size chunking?
It depends on your corpus structure. Semantic chunking — splitting at measured topic boundaries rather than character counts — pays off on long-form prose where answers span unmarked transitions. It runs roughly 14x slower at index build time, so the accuracy gain needs to justify the cost. For structured documents (support tickets, contracts with consistent formatting, code files), recursive fixed-size splitting at 512–1024 tokens often matches or beats semantic chunking at a fraction of the compute. Measure context precision on your corpus before committing to the slower approach.
RRF vs. convex combination — which should I use by default?
RRF if you have no labeled data. Convex combination if you have 40 or more query-relevance pairs — with that little data, tuned alpha consistently outperforms RRF both in-domain and out-of-domain [8]. RRF's advantage is that it requires nothing beyond the two ranked lists. Its limitation is that it discards score magnitude, so a marginally relevant top-1 gets the same contribution as a highly relevant top-1. When score quality is meaningful in your corpus, that information is lost.
Retrieval quality is a first-class engineering concern. It determines what context the model receives, which determines what users get, which determines whether the system is trusted. The gap between 58% and 91% retrieval precision [1] is not a model problem. It's a retrieval architecture problem — and dense-only retrieval with no reranking and no confidence gate is the architecture that produces that gap.
Hybrid retrieval, cross-encoder reranking, and explicit abstention close it. All three are available without fine-tuning, without a new embedding model, and without a labeled dataset to start. A retrieval system without a confidence floor is a hallucination engine with a polite interface: it generates plausible responses from whatever noise it retrieved, logs every interaction as success, and gives you no signal that anything went wrong. The failure is invisible until someone manually reviews outputs — and by then, user trust is already gone.
Most production agent failures are not model failures. They are missing constraints — business rules carried in four engineers' heads with no formal representation agents can query. The fix is a versioned, governed context store the data team owns instead of answers.
Eight in ten agentic AI projects stall on data, not models. Score your environment on ten dimensions before the agent surfaces the gaps. Four tiers, calibrated thresholds, structural fixes ordered before operational ones.
60% of agentic projects stall on data, not models. A 30-minute, three-tier gate — Foundation, Workflow, Autonomous — that decides what autonomy your data can actually support, with a retrofit pattern for legacy systems you cannot rewrite.