RAG Pipeline Returning Irrelevant Results? How to Debug Chunking, Embeddings, and Retrieval
When a RAG pipeline returns irrelevant chunks, the cause is almost always one of four things: bad chunking (wrong size or no overlap), weak embeddings (wrong model for your domain), single-modal retrieval (pure vector when you need hybrid), or no re-ranker. Fix them in that order, measure with MRR and nDCG, and you will recover most lost accuracy within a day of tuning.
Key Insight
When a RAG pipeline returns irrelevant chunks, the cause is almost always one of four things: bad chunking (wrong size or no overlap), weak embeddings (wrong model for your domain), single-modal retrieval (pure vector when you need hybrid), or no re-ranker. Fix them in that order, measure with MRR and nDCG, and you will recover most lost accuracy within a day of tuning.
Your RAG Pipeline Is Lying to You
You built a retrieval-augmented generation system. It demos well. Then a product manager asks a real question and the model answers confidently with content that is adjacent to the truth but not actually correct.
The reflex is to blame the LLM. It is almost never the LLM.
In our experience debugging dozens of production RAG systems, irrelevant answers trace back to retrieval failure in roughly 80 percent of cases. The right chunks are not in the context window, so the model does the only thing it can — it pattern-matches on the closest thing it was given and hallucinates the gap.
This guide is a systematic debugging playbook. If you follow it in order, you will isolate and fix the problem within a day.
If you are new to the concept, start with our primer on what RAG is and how retrieval-augmented generation works.
Step 1: Build an Evaluation Set Before You Change Anything
You cannot debug what you cannot measure. Before touching your chunker, your embedding model, or your vector database, build an eval set.
The cheap way: for each document in your corpus, use a capable model like GPT-4 or Claude to generate two or three questions whose answer lives in that document. Save the (question, gold_document_id, gold_chunk_ids) tuples as your evaluation ground truth.
from openai import OpenAI
import json
client = OpenAI()
def generate_eval_questions(document_text: str, doc_id: str) -> list[dict]:
prompt = f"""Generate 3 distinct questions that can be answered \
ONLY from the following document. Return JSON: [{{"question": str, \
"answer": str, "chunk_hint": str}}].
Document:
{document_text}
"""
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)Run this across 100 to 300 documents and you have a 300 to 900 question eval set for about five dollars of API spend.
The metrics that matter:
- MRR at 10 (Mean Reciprocal Rank): how high is the correct chunk ranked in the top 10? A score of 1.0 means always first, 0.5 means usually second.
- nDCG at 10 (Normalized Discounted Cumulative Gain): accounts for graded relevance across multiple relevant chunks.
- Context precision: of the chunks you retrieved, how many are actually relevant? Frameworks like Ragas compute this automatically.
- Context recall: of the chunks needed to answer, how many did you retrieve?
Run these once on your current system to establish a baseline. Every change you make should move one of these numbers.
Step 2: Diagnose Chunking — The Single Highest-Leverage Knob
Chunking is where most RAG pipelines quietly fail. Default chunk size is often 1000 characters with no overlap, inherited from a LangChain tutorial. For many corpora that is too big — the semantic signal is diluted across paragraphs on unrelated subtopics.
Symptoms of bad chunking
- Retrieved chunks look related to the topic but do not contain the specific answer
- Long documents dominate results because they generate many high-similarity chunks
- Tables and lists are split across chunks, making structured answers incoherent
- Short, specific questions return long general-purpose chunks
The chunking experiment
Run a sweep: 256, 512, 1024, and 2048 tokens with 10 percent overlap. Measure MRR at 10 on your eval set for each.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
def build_index(documents, chunk_size: int, chunk_overlap: int):
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
return Chroma.from_documents(chunks, OpenAIEmbeddings())
# Sweep
for size in [256, 512, 1024, 2048]:
overlap = int(size * 0.1)
index = build_index(corpus, size, overlap)
mrr = evaluate(index, eval_set)
print(f"chunk_size={size} mrr={mrr:.3f}")In practice, 512 tokens wins on most prose, 256 wins on FAQ-style content, and 1024 or larger wins only when individual answers span multiple paragraphs.
Semantic chunking
If fixed-size chunking is the baseline, semantic chunking is the upgrade. It splits where the embedding similarity between consecutive sentences drops below a threshold, so each chunk is a coherent idea rather than a character count.
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=OpenAIEmbedding(),
)
nodes = splitter.get_nodes_from_documents(documents)Semantic chunking costs more at index time (it embeds sentences to find splits) but consistently wins on retrieval quality for long-form content. See the LlamaIndex documentation for more advanced patterns.
Special cases
- Code: always split on syntactic units using a tree-sitter-based parser, not characters.
- Tables: store each row as its own chunk with a summary header prepended (generated by an LLM once at index time).
- PDFs with layout: tools like Unstructured and LlamaParse preserve reading order and reduce chunk pollution.
Step 3: Diagnose Embeddings — Wrong Model, Wrong Results
The embedding model is the lens through which all retrieval happens. If the lens is wrong for your corpus, nothing downstream can rescue you.
The current state of embedding models (2026)
- OpenAI text-embedding-3-large (3072 dims): strong general-purpose, excellent instruction following, closed source.
- Voyage-3: often tops the MTEB leaderboard for English, specialized variants for code and finance.
- Cohere embed-v4: strong multilingual, good for retrieval with query and document asymmetry.
- BGE-M3: best open-weights model for multilingual, supports dense, sparse, and multi-vector in one pass.
- Jina embeddings v3: competitive with closed models, late-chunking support for long documents.
How to pick
Do not pick from the leaderboard. Pick by running your eval set against three candidate models and comparing MRR. A model that wins by 2 points on MTEB but loses by 5 points on your medical corpus is the wrong model for you.
from langchain_openai import OpenAIEmbeddings
from langchain_voyageai import VoyageAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
models = {
"openai-3-large": OpenAIEmbeddings(model="text-embedding-3-large"),
"voyage-3": VoyageAIEmbeddings(model="voyage-3"),
"bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3"),
}
for name, model in models.items():
index = build_index_with_embeddings(corpus, model)
mrr = evaluate(index, eval_set)
print(f"{name}: MRR@10 = {mrr:.3f}")Dimensions matter, but not the way you think
Larger embedding dimensions (3072 vs 768) marginally improve accuracy but substantially increase index size, memory, and query latency. Truncating OpenAI embeddings to 1024 dimensions typically loses less than 1 percent MRR and halves storage. Use the Matryoshka truncation supported by text-embedding-3 models if cost is a constraint.
Step 4: Add Hybrid Search — The Free Upgrade
Pure vector search fails on exact-match needs. If a user searches for "error code 0x8024001A" or "ERC-4337 EntryPoint", a vector model may retrieve conceptually similar but textually different chunks. BM25 will match the exact token and win.
The fix is hybrid search: run both retrievers, fuse the rankings with Reciprocal Rank Fusion (RRF), and return the merged top-k.
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 50
vector = chroma_store.as_retriever(search_kwargs={"k": 50})
hybrid = EnsembleRetriever(
retrievers=[bm25, vector],
weights=[0.4, 0.6], # tune on your eval set
)
docs = hybrid.get_relevant_documents("ERC-4337 EntryPoint address")Most serious vector databases support hybrid search natively: Weaviate, Qdrant, Elastic, and pgvector-with-tsvector all expose RRF fusion as a first-class operation. Use the native implementation when you can — it is faster and has better tokenization.
On every technical corpus we have tested, hybrid adds 10 to 20 percent MRR over pure vector. It is nearly free to add and should be a default.
Step 5: Add a Re-ranker — The Non-Negotiable Second Stage
A re-ranker is a cross-encoder model that scores a (query, document) pair together, producing a much more accurate relevance score than the dot product of independent embeddings.
The standard two-stage pattern: retrieve 50 to 100 candidates fast (hybrid or vector), then re-rank the top 10 with a cross-encoder.
import cohere
co = cohere.Client()
def rerank_chunks(query: str, candidates: list[str], top_n: int = 5):
resp = co.rerank(
model="rerank-v3.5",
query=query,
documents=candidates,
top_n=top_n,
)
return [candidates[r.index] for r in resp.results]
candidates = hybrid.get_relevant_documents(query) # top 50
final = rerank_chunks(query, [c.page_content for c in candidates], top_n=5)Re-ranker options:
- Cohere Rerank 3.5 — closed, fast, multilingual, pay per 1000 searches.
- Jina Rerank v2 — competitive, cheaper.
- BGE-reranker-v2-m3 — open weights, self-host on a single GPU.
- Voyage Rerank — strong on enterprise search.
In our evaluations, adding a re-ranker lifts nDCG by 15 to 30 percent on top of any retrieval baseline. If you are debugging retrieval and not using a re-ranker, add one before you change anything else.
Step 6: Query Transformation for Bad Questions
Sometimes the retrieval pipeline is fine and the query is the problem. "What about the thing?" is not going to match anything useful.
HyDE: Hypothetical Document Embeddings
HyDE asks an LLM to generate a plausible answer first, embeds that, and retrieves on the hypothetical answer instead of the raw query. It works because documents retrieve documents better than questions retrieve documents.
def hyde_retrieve(query: str, llm, retriever):
hypothetical = llm.invoke(
f"Write a one-paragraph hypothetical answer to: {query}"
).content
return retriever.get_relevant_documents(hypothetical)Multi-query rewriting
Ask an LLM to rewrite the user query into 3 or 4 variations, retrieve for each, and merge.
from langchain.retrievers.multi_query import MultiQueryRetriever
multi = MultiQueryRetriever.from_llm(
retriever=vector.as_retriever(),
llm=llm,
)
docs = multi.get_relevant_documents("how do I stop my AI agent forgetting?")For a deeper look at why AI agents lose context and how to recover it, see why your AI agent loses context and how to fix it.
Query classification
Not every query needs HyDE or multi-query. Use a lightweight classifier (a small LLM or even a rules engine) to route simple queries through the fast path and hard queries through the expensive path.
Step 7: Metadata Filtering — The Cheapest Precision Win
If your documents have structure (tenant id, date, document type, product line), use it. A metadata filter before vector search collapses the search space and boosts precision for free.
results = vector_store.similarity_search(
query,
k=10,
filter={
"tenant_id": current_user.tenant_id,
"doc_type": "support_ticket",
"created_at": {"$gte": "2026-01-01"},
},
)Every production vector database supports this natively — pgvector, Qdrant, Weaviate, Pinecone, Chroma. The only trick is making sure your filter fields are indexed, or the filter will force a full scan.
A Complete Debugging Flowchart
When retrieval is bad, work through this in order:
- Is there an eval set? If not, build one. You cannot debug without numbers.
- Chunk size experiment: sweep 256, 512, 1024, 2048 and pick the winner by MRR.
- Try semantic or domain-aware chunking if prose is still underperforming.
- Swap the embedding model: test 3 candidates against your eval.
- Add hybrid search (BM25 + vector with RRF fusion).
- Add a re-ranker on the top 50 candidates.
- Add metadata filters if your data has structure.
- For bad queries, add HyDE or multi-query rewriting conditionally.
- Only now, if MRR is high but answers are still wrong, look at the generation prompt.
In our experience, most teams stop debugging at step 1 or 2 and never reach step 5. The gains from steps 4 through 7 are usually larger than anything upstream.
What to Read Next
- What is RAG? Retrieval-Augmented Generation Explained — the conceptual foundation
- Complete Guide to Artificial Intelligence — broader AI context
- Why Your AI Agent Loses Context and How to Fix It — context window management
External references:
- LlamaIndex evaluation docs
- Ragas evaluation framework
- MTEB embedding leaderboard
- Cohere Rerank documentation
- Weaviate hybrid search
This post is part of our AI engineering coverage. For the foundational concept, see our [complete guide to RAG](/blog/what-is-rag-retrieval-augmented-generation-explained).
Key Takeaways
- Chunk size is the single highest-leverage knob — 512 tokens is the safe default, but semantic chunking almost always beats fixed-size splits
- Embedding model choice matters more than most teams realize — OpenAI text-embedding-3-large, Voyage-3, and BGE-M3 each dominate different domains
- Hybrid search (BM25 plus dense vectors) fixes the acronym, product-code, and exact-phrase failures that pure vector search cannot
- A re-ranker like Cohere Rerank or Jina Rerank typically adds 15 to 30 percent nDCG on top of any retrieval baseline
- Query transformation techniques like HyDE and multi-query rewriting rescue vague or under-specified user queries before retrieval
- Metadata filtering collapses the search space and is the cheapest way to raise precision when users can be scoped by tenant, date, or document type
- Measure with MRR at 10, nDCG at 10, and context precision on a static eval set before you change anything else — otherwise you are guessing
Frequently Asked Questions
Why is my RAG pipeline returning chunks that look related but do not actually answer the question?
This is the classic symptom of embedding-only retrieval with chunks that are too large or too generic. Large chunks dilute the semantic signal so the vector matches the topic but not the specific fact. Start by shrinking chunks to 256 to 512 tokens, adding 10 to 20 percent overlap, and layering a re-ranker like Cohere Rerank 3. Also verify that your embedding model was trained on text similar to your corpus — a general-purpose model will miss domain-specific nuance in medical, legal, or code repositories.
Should I use fixed-size, recursive, or semantic chunking?
Recursive character splitting is a reasonable default for unstructured prose. Semantic chunking (splitting on embedding similarity drops) produces measurably better retrieval on long-form content but costs more compute to index. For code, always split on syntactic boundaries using a language-aware splitter like the LlamaIndex CodeSplitter. For tables and structured data, row-level chunks with summary headers outperform every general-purpose splitter we have tested.
Which embedding model should I use in 2026?
For English general-purpose, OpenAI text-embedding-3-large and Voyage-3 are the current leaders on MTEB. For multilingual, BGE-M3 is state of the art open weights. For code search, Voyage-code-2 is the default. The only way to know for your corpus is to run a retrieval eval on a few hundred labeled questions — do not trust leaderboard numbers blindly, because domain drift matters more than the benchmark rank.
What is hybrid search and when should I use it?
Hybrid search combines lexical retrieval (BM25, which matches on exact tokens) with dense vector retrieval (which matches on semantic meaning) and fuses the two ranked lists, usually with Reciprocal Rank Fusion. Use it whenever your domain contains product codes, error codes, API names, legal citations, or rare entities that vector search tends to over-smooth. On any technical corpus, hybrid is almost always a free 10 to 20 percent improvement.
Do I need a re-ranker if I am already using a good embedding model?
Yes, in nearly every production RAG system. Embedding models are optimized for fast approximate recall across millions of chunks; re-rankers are cross-encoders that score query-document pairs jointly and are much more accurate. The standard pattern is to retrieve 50 to 100 candidates with vectors or hybrid, then re-rank the top 10 for the LLM. Cohere Rerank 3, Jina Rerank v2, and bge-reranker-v2-m3 are all strong choices.
How do I evaluate my RAG pipeline without spending weeks on human labeling?
Build a synthetic eval set with an LLM: for each document, have GPT-4 or Claude generate two or three questions whose answer is contained in that document. Then measure MRR at 10 (did the correct chunk land in the top 10?) and context precision (what fraction of retrieved chunks were actually relevant?). Ragas, TruLens, and DeepEval all automate this workflow. Keep the eval set frozen so you can compare changes over time.
What is HyDE and does it actually work?
HyDE (Hypothetical Document Embeddings) asks an LLM to generate a fake answer to the user's query, embeds that hypothetical document, and retrieves on it instead of the raw query. It works remarkably well when user queries are short, underspecified, or written in a different register than the corpus (for example, casual questions against formal documentation). It adds latency and an LLM call, so it is best deployed conditionally based on query length or a complexity classifier.
My retrieval quality is fine but the LLM still hallucinates — what now?
This is no longer a retrieval problem; it is a generation or prompting problem. Add strict instructions to the prompt ("Answer only from the provided context, cite every claim"), use structured output with citations, and run a faithfulness check on the output with a second LLM call. If context is correct and the model still invents facts, you may need a larger or more instruction-tuned model. For background on why agents lose context, see our deep dive on context window failures linked below.
About the Author
Aisha Patel
Senior AI Researcher & Technical Writer
PhD in Computer Science, MIT | Former AI Research Lead at DeepMind
Aisha Patel is a senior AI researcher and technical writer with over eight years of experience in machine learning, natural language processing, and computer vision. She holds a PhD in Computer Science from MIT, where her dissertation focused on transformer architectures for multimodal learning. Before joining Web3AIBlog, Aisha spent three years as an AI Research Lead at DeepMind, where she contributed to breakthroughs in reinforcement learning and published over 20 peer-reviewed papers. She is passionate about demystifying complex AI concepts and making cutting-edge research accessible to developers, entrepreneurs, and curious minds alike. Aisha regularly speaks at NeurIPS, ICML, and industry conferences on the practical applications of generative AI.