What this covers
- ✓
Why cosine similarity scores look healthy while RAG answers are wrong — the alibi problem
- ✓
Four failure modes with distinct operational signatures: chunking, vocabulary mismatch, semantic smear, lost-in-middle
- ✓
The two-stage retrieval stack that closes the gap: hybrid BM25+dense retrieval, then cross-encoder reranking
- ✓
How to set a confidence floor that makes your system abstain rather than hallucinate from noise
- ✓
Building a retrieval eval harness from production query logs — no labeled dataset required
Your RAG pipeline logged zero retrieval errors last week. Every query returned results. Cosine similarity scores averaged 0.82. What the system didn't record: how many of those answers were wrong.
Cosine similarity is not a relevance signal. It measures proximity in embedding space, which correlates loosely with semantic similarity and not at all with whether the retrieved chunk contained the information the user actually needed. A 0.82 similarity score on the wrong document is not a near-miss. It is a confident hallucination with a clean audit trail.
Dense-only retrieval on heterogeneous enterprise corpora achieves Recall@10 of roughly 0.58–0.65 [2] — meaning the right chunk doesn't make the candidate list on 35–42% of queries. Not occasionally. Routinely, on a predictable slice of your query distribution. And because the system always returns something, nothing in your logs signals failure. The RAG failure modes that matter aren't exceptions. They are the default behavior of naive retrieval on production data.
The System That Logs Success While Failing
The alibi problem: similarity scores provide cover for retrieving the wrong thing.
Most vector stores return top-k results unconditionally. Ask about a deprecated API endpoint, and the store returns the five most similar chunks — whether or not any of them contain the actual answer. The LLM receives them and generates a plausible response grounded in whatever fragments it got. The response is fluent, confident, internally consistent.
It is also wrong.
This failure shape differs from a runtime error in one critical way: nothing fires. No latency spike. No exception. The system behaved exactly as designed — retrieved documents, injected them into context, generated a response grounded in those documents. The failure lives entirely in the gap between retrieved and relevant. That gap is invisible to your monitoring stack.
One production team rebuilding a team knowledge assistant documented this precisely. Their baseline system — embedding model plus ChromaDB with cosine similarity top-k — was confidently wrong roughly 20% of the time from day one [1]. The system had no mechanism to know. The error rate wasn't discovered through observability. It was discovered by manually reviewing outputs. The system's own metrics showed green: queries answered, latency stable, similarity scores in range.
That is the specific shape of this failure. The feedback loop is broken at the source. Without a signal that retrieval failed, there is no pressure to fix it.
On heterogeneous enterprise corpora. The right chunk doesn't appear in the top 10 on roughly 35–42% of queries. [2]
In enterprise RAG over technical docs, contracts, or support tickets. Dense embeddings miss this slice systematically. [2]
BM25 + dense vector with RRF fusion over dense-only. Observed in production rebuilds before any reranking. [1]
When the right chunk reaches position 1–3, LLM accuracy reaches 92%. Ranking position is the primary bottleneck. [6]
Four Failure Modes, One Common Cause
Each has a different operational signature. The root cause is treating similarity as a proxy for relevance.
| Failure Mode | Where It Hides | Production Symptom | Diagnostic Signal |
|---|---|---|---|
| Chunking destroys context | Index build | Answer split across chunk boundaries — neither retrieves well | Low similarity on queries where the answer spans a paragraph break |
| Vocabulary mismatch | Retrieval stage | Exact identifiers, error codes, version strings missed | High failure rate on queries with proper nouns or numeric identifiers |
| Semantic smear | Retrieval stage | Topically similar chunks returned, specific fact absent | High similarity scores, low faithfulness — model hedges or fabricates |
| Lost in the middle | Context assembly | Right chunk at rank 5–8, model ignores it | High Recall@10, low end-to-end accuracy on synthesis queries |
Chunking destroys context. Fixed-size chunking splits documents every 512 or 1024 characters regardless of where a thought ends. A question whose answer spans a paragraph boundary gets split: the first chunk contains the setup, the second contains the conclusion. Neither chunk, scored against the query in isolation, embeds the full reasoning. The similarity score for each partial chunk sits below threshold. The right answer slips out of the candidate set.
Semantic chunking addresses this by measuring cosine similarity between adjacent sentences and opening a new chunk when similarity drops below a threshold — detecting natural topic boundaries rather than counting characters. The index takes longer to build. The retrieval quality improvement is consistent across corpus types and worth it in every production context where answers depend on multi-sentence reasoning.
Vocabulary mismatch. Dense embedding models handle semantic paraphrase well — "compensation" and "salary" sit close in embedding space. What they don't handle is lexical precision. Query for error code ERR_CONN_REFUSED or model version claude-opus-4-7, and the embedding model maps these to approximate representations based on their surrounding vocabulary context. The actual document containing the specific identifier may rank far below documents that are thematically related but factually irrelevant. In enterprise RAG over technical documentation, contracts, or support tickets, exact-match queries account for 20–40% of production traffic [2]. Dense-only retrieval fails this slice systematically, and because the failures look like normal successful retrievals, this failure mode persists undetected for months.
Semantic smear is subtler. The retriever returns documents about the same topic as the query, but not documents that contain the specific fact. A query about the refund policy for annual subscription plans might surface five chunks about billing procedures, cancellation flows, and account management settings. All are topically similar. None contain the actual refund policy clause. Similarity scores look healthy — 0.78, 0.74, 0.72. The LLM receives five documents, none of which have the answer, and either hedges with generic information or fills in a plausible-sounding response from training data. The interaction is logged as successful.
This is where cosine similarity functions as an alibi. The system can point to five chunks with scores above 0.72 and assert it tried. The audit trail is clean. The answer is wrong.
Lost in the middle. Retrieval can return the right chunk — but ranked at position 6 out of 10. The LLM then processes a 10-document context window with the relevant passage buried somewhere inside. A 2023 Stanford paper documented that LLMs reliably attend to content at the beginning and end of their context window and systematically underweight the middle positions. The relevant chunk was retrieved. It was in context. The model still missed it.
The fix is not more context — it's better ranking. A cross-encoder reranker that promotes the right chunk to position 1 or 2 matters more than expanding the context window to accommodate additional noise. Thomson Reuters, analyzing their deployed customer support RAG system, found that when relevant documents rank in the top 3, their system generates accurate responses in 92% of cases [6]. That number dropped substantially when the relevant chunk ranked at position 6 or later. Position is the bottleneck, not recall.
The Retrieval Stack That Actually Holds
Two changes — hybrid retrieval and cross-encoder reranking — cover most of the failure surface.
Fixed-size character chunking (512–1024 chars)
Dense-only bi-encoder retrieval
Top-k ranked by cosine similarity
No confidence threshold — always returns k results
Similarity rank == LLM input rank, no reranking
No abstention mechanism — fabricates from noise
Eval set built from queries that already worked
Semantic chunking at topic boundaries
Hybrid BM25 + dense retrieval, RRF fusion
Retrieve top-50 candidates, rerank to top-5 via cross-encoder
Confidence gate — abstain if reranker score < threshold
Cross-encoder precision: right chunk at position 1–2
Explicit abstention: 'insufficient context in knowledge base'
Eval set sampled from production query logs
Hybrid retrieval is the highest-leverage single change. Run BM25 alongside your dense vector search, then fuse the ranked lists using Reciprocal Rank Fusion. Dense retrieval handles semantic paraphrase — finding documents that express the same concept in different words. Sparse retrieval handles lexical precision — exact keyword matching for product names, error codes, identifiers, version strings. These failure modes are complementary: queries that fail dense retrieval typically succeed with BM25, and vice versa.
RRF sidesteps the score-scale incompatibility problem. BM25 and cosine similarity are on completely different scales and cannot be directly combined. RRF converts each ranked list into position-based contributions — 1/(k + rank) where k=60 by convention — so a document that ranks high in both lists accumulates a stronger combined score regardless of the raw similarity values. The tradeoff: RRF discards score magnitude, so a document ranked first with 0.99 cosine similarity gets the same RRF contribution as one ranked first with 0.51. When your corpus has genuine quality information embedded in the similarity scores, not just ranks, that information is lost. This is a real limitation, not a footnote.
BM25 alone achieves roughly 40% retrieval precision. Dense vectors alone achieve roughly 58%. Hybrid search with RRF reaches roughly 79% before any reranking [1]. That is a 21-point lift from a structural change, not a model upgrade.
retrieval.pyfrom rank_bm25 import BM25Okapi
from typing import List, Tuple
def reciprocal_rank_fusion(
dense_results: List[Tuple[str, float]], # (chunk_id, similarity_score)
bm25_results: List[Tuple[str, float]],
k: int = 60, # standard default; lower k amplifies top-rank differences
) -> List[Tuple[str, float]]:
scores: dict[str, float] = {}
for rank, (chunk_id, _) in enumerate(dense_results):
scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank + 1)
for rank, (chunk_id, _) in enumerate(bm25_results):
scores[chunk_id] = scores.get(chunk_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def retrieve(
query: str,
vector_store,
bm25_index,
reranker,
top_k: int = 50,
min_reranker_score: float = 0.3, # tune per domain; abstain below this
) -> List[dict]:
# Stage 1: hybrid candidate retrieval — recall over precision here
dense_hits = vector_store.query(query, top_k=top_k)
bm25_hits = bm25_index.query(query, top_k=top_k)
fused = reciprocal_rank_fusion(dense_hits, bm25_hits)
candidates = [chunks[cid] for cid, _ in fused[:50]]
# Stage 2: cross-encoder reranking — precision over recall here
reranked = reranker.rank(query, candidates)
top_hits = reranked[:5]
# Stage 3: confidence gate — return empty rather than fabricate from noise
if not top_hits or top_hits[0].score < min_reranker_score:
return [] # caller should respond: 'insufficient context in knowledge base'
return top_hitsThe Confidence Floor Is an Editorial Decision, Not a Hyperparameter
Setting a retrieval abstention threshold is a product decision about when to admit ignorance.
Every vector store returns top-k results unconditionally. This is the mechanism behind what one team called "hallucinations grounded in noise" — the model cites real documents but answers the wrong question, because the retrieved documents were the least-wrong available rather than actually useful [1]. The LLM never knew retrieval failed. Neither did you.
A confidence floor turns retrieval failure into an explicit signal. After cross-encoder reranking, check whether the top-1 document's relevance score clears a minimum threshold. If it doesn't, return empty context and respond with "I don't have sufficient information to answer this" rather than generating from whatever noise the retriever surfaced. A threshold around 0.3–0.4 on normalized cross-encoder output is a reasonable starting point for most domains [2]; calibrate it against 50 hand-reviewed queries from your specific corpus.
The threshold is an editorial decision disguised as a number. Setting it lower preserves coverage — more queries get an answer — but some answers will be fabricated from low-quality context. Setting it higher reduces fabrication but increases abstention rate. Where you draw that line depends on the downstream cost of a wrong answer. A support agent acting on a wrong refund policy has a different risk profile than a developer reading an architecture summary.
The counterintuitive consequence: a system that occasionally says "I don't know" earns more trust than one that always produces an answer. Users model system behavior over time. A system that abstains teaches users that abstention is meaningful — that when it does answer, the answer is grounded. A system that never abstains trains users to treat every answer with equal suspicion.
How to Measure Retrieval Quality Before You Have Labels
The eval bootstrap trap: test sets built from queries that already work miss every failure mode that matters.
Most teams build their retrieval evaluation set from queries where they already knew the expected answer — which systematically excludes the query types where retrieval fails. The eval set passes. Production failures stay invisible. This is why the problem persists for months after deployment: the system is being evaluated on a distribution it was never actually failing on.
The bootstrap approach starts from production query logs, not constructed test cases. Pull 200–400 real queries from the past 30 days. For each query, use an LLM judge to determine whether the top-5 retrieved chunks contain enough information to support a correct answer. This is a cheap classification task — relevant / partial / not relevant — and it covers the actual query distribution your system encounters, not the one you expected when you built the demo.
Calibrate the judge against 20–30 human-reviewed examples to catch systematic biases in its scoring. Then run it weekly. A drop in the "relevant" category signals retrieval degradation before users start filing tickets.
For ongoing monitoring, track three metrics: context precision (what fraction of the top-5 chunks are actually useful for the query — measurable without labels via LLM judge), faithfulness (does the generated answer stay within the retrieved context — measurable without labels using RAGAS or DeepEval), and abstention rate (rising abstention without corpus changes signals query distribution shift, not a retrieval bug). None of these require a hand-labeled ground-truth dataset to start.
Set a deployment gate on context precision: if it drops more than 5 percentage points from the baseline on your query sample, block the change. That gate, applied consistently, catches the corpus drift problems that would otherwise reach users silently.
Retrieval Eval Harness
Production query sample — 200+ real queries from last 30 days, not hand-crafted test cases
LLM-as-judge relevance scoring — classifies whether top-5 chunks support a correct answer (relevant / partial / not relevant)
Context precision tracking — fraction of retrieved chunks that are actually useful, measured weekly
Faithfulness score via RAGAS or DeepEval — target >= 0.85 before shipping any retrieval change
Abstention rate dashboard — rising abstention without corpus changes signals query distribution shift
Similarity score distribution monitor — downward drift in average top-1 similarity flags embedding index rot over time
Deploy gate on context precision — block changes that drop precision more than 5pp from baseline on production query sample
When do I add a reranker vs. swap the embedding model?
Add the reranker first. Swapping embedding models requires re-embedding your entire corpus — a batch operation with downtime risk, and the side effect that existing vectors and new vectors live in incompatible spaces until re-indexing completes. A cross-encoder reranker layers on top of existing retrieval without touching the index. Fix Recall@50 with hybrid search, then add reranking, then evaluate whether the embedding model is still the bottleneck. In most production corpora it isn't — the chunking and retrieval architecture are the binding constraints, not embedding model quality.
Does adding more retrieval signals (query expansion, fusion) always improve end-to-end accuracy?
No. A 2026 arXiv evaluation showed recall gains from retrieval fusion are frequently neutralized by downstream reranking and context truncation [5]. In several configurations, fusion variants underperformed single-query baselines on end-to-end accuracy. More retrieval signals produce a larger, noisier candidate set that the reranker has to sort through. If the reranker capacity is fixed and the context window truncates additional candidates anyway, the upstream recall gain disappears. Improve your reranking layer before adding retrieval breadth.
What's a safe starting threshold for the confidence gate?
0.3–0.4 on normalized cross-encoder score is a reasonable starting point. Calibrate against 50 manually reviewed queries from your corpus: find the score at which chunks transition from 'could answer the question' to 'cannot answer the question.' The correct threshold varies by corpus density and query specificity. Technical documentation with sparse coverage needs a lower threshold than a knowledge base with high topical density.
How do I handle abstention without breaking the user experience?
Treat abstention as a first-class response type, not a fallback. The response should be specific about what failed: 'The knowledge base doesn't contain information about X. You can ask about Y or Z.' This is more useful than a vague 'I don't know' and more accurate than a hallucinated answer. Log every abstention with the query — that log is your corpus gap map. What the system consistently cannot answer tells you exactly where your knowledge base needs coverage.
Retrieval quality is a first-class engineering concern. It determines what context the model receives, which determines what users get, which determines whether the system is trusted. The gap between 58% and 91% retrieval precision [1] is not a model problem. It is a retrieval architecture problem. Dense-only retrieval with no reranking and no confidence gate is the architecture that produces that gap.
Hybrid retrieval, cross-encoder reranking, and explicit abstention close it — and all three are available without fine-tuning, without a new embedding model, and without a labeled dataset to start. A retrieval system without a confidence floor is a hallucination engine with a polite interface. It generates plausible responses from whatever noise it retrieved, logs every interaction as success, and gives you no signal that anything went wrong.
- [1]Angel Kurten — RAG Pipeline: From 58% to 91% Retrieval Precision(angelkurten.com)↩
- [2]Tensoria — Dense RAG Fails on Rare Terms. Hybrid Search Fixes It [2026 Guide](tensoria.fr)↩
- [3]ML Journey — RAG in Production: Fixing Retrieval Failures with Hybrid Search and Reranking(mljourney.com)↩
- [4]Mudassir Khan — Production RAG: Why Retrieval Fails and How to Fix It(mudassirkhan.me)↩
- [5]arXiv 2026 — Retrieval Fusion Under Production Constraints: Diminishing Returns After Reranking(arxiv.org)↩
- [6]Thomson Reuters Labs — Retrieval Enhancements for RAG: Insights from a Deployed Customer Support Chatbot(aclanthology.org)↩
- [7]Tianpan — Hybrid Search in Production: Why BM25 Still Wins on the Queries That Matter(tianpan.co)↩