RAG Pipeline Returning Irrelevant Results? How to Debug Chunking, Embeddings, and Retrieval

By Aisha Patel, AI Editorial Desk · April 18, 2026 · 22 min read

Updated April 18, 2026

Quick Answer

When a RAG pipeline returns irrelevant chunks, the cause is almost always one of four things: bad chunking (wrong size or no overlap), weak embeddings (wrong model for your domain), single-modal retrieval (pure vector when you need hybrid), or no re-ranker. Fix them in that order, measure with MRR and nDCG, and you will recover most lost accuracy within a day of tuning.

Your RAG Pipeline Is Lying to You

You built a retrieval-augmented generation system. It demos well. Then a product manager asks a real question and the model answers confidently with content that is adjacent to the truth but not actually correct.

The reflex is to blame the LLM. It is almost never the LLM.

In our experience debugging dozens of production RAG systems, irrelevant answers trace back to retrieval failure in roughly 80 percent of cases. The right chunks are not in the context window, so the model does the only thing it can — it pattern-matches on the closest thing it was given and hallucinates the gap.

This guide is a systematic debugging playbook. If you follow it in order, you will isolate and fix the problem within a day.

If you are new to the concept, start with our primer on what RAG is and how retrieval-augmented generation works.

Step 1: Build an Evaluation Set Before You Change Anything

You cannot debug what you cannot measure. Before touching your chunker, your embedding model, or your vector database, build an eval set.

The cheap way: for each document in your corpus, use a capable model like GPT-4 or Claude to generate two or three questions whose answer lives in that document. Save the (question, gold_document_id, gold_chunk_ids) tuples as your evaluation ground truth.

python

from openai import OpenAI
import json

client = OpenAI()

def generate_eval_questions(document_text: str, doc_id: str) -> list[dict]:
    prompt = f"""Generate 3 distinct questions that can be answered \
    ONLY from the following document. Return JSON: [{{"question": str, \
    "answer": str, "chunk_hint": str}}].

    Document:
    {document_text}
    """
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

Run this across 100 to 300 documents and you have a 300 to 900 question eval set for about five dollars of API spend.

The metrics that matter:

MRR at 10 (Mean Reciprocal Rank): how high is the correct chunk ranked in the top 10? A score of 1.0 means always first, 0.5 means usually second.
nDCG at 10 (Normalized Discounted Cumulative Gain): accounts for graded relevance across multiple relevant chunks.
Context precision: of the chunks you retrieved, how many are actually relevant? Frameworks like Ragas compute this automatically.
Context recall: of the chunks needed to answer, how many did you retrieve?

Run these once on your current system to establish a baseline. Every change you make should move one of these numbers.

Step 2: Diagnose Chunking — The Single Highest-Leverage Knob

Chunking is where most RAG pipelines quietly fail. Default chunk size is often 1000 characters with no overlap, inherited from a LangChain tutorial. For many corpora that is too big — the semantic signal is diluted across paragraphs on unrelated subtopics.

Symptoms of bad chunking

Retrieved chunks look related to the topic but do not contain the specific answer
Long documents dominate results because they generate many high-similarity chunks
Tables and lists are split across chunks, making structured answers incoherent
Short, specific questions return long general-purpose chunks

The chunking experiment

Run a sweep: 256, 512, 1024, and 2048 tokens with 10 percent overlap. Measure MRR at 10 on your eval set for each.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

def build_index(documents, chunk_size: int, chunk_overlap: int):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.split_documents(documents)
    return Chroma.from_documents(chunks, OpenAIEmbeddings())

# Sweep
for size in [256, 512, 1024, 2048]:
    overlap = int(size * 0.1)
    index = build_index(corpus, size, overlap)
    mrr = evaluate(index, eval_set)
    print(f"chunk_size={size} mrr={mrr:.3f}")

In practice, 512 tokens wins on most prose, 256 wins on FAQ-style content, and 1024 or larger wins only when individual answers span multiple paragraphs.

Semantic chunking

If fixed-size chunking is the baseline, semantic chunking is the upgrade. It splits where the embedding similarity between consecutive sentences drops below a threshold, so each chunk is a coherent idea rather than a character count.

python

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding(),
)
nodes = splitter.get_nodes_from_documents(documents)

Semantic chunking costs more at index time (it embeds sentences to find splits) but consistently wins on retrieval quality for long-form content. See the LlamaIndex documentation for more advanced patterns.

Special cases

Code: always split on syntactic units using a tree-sitter-based parser, not characters.
Tables: store each row as its own chunk with a summary header prepended (generated by an LLM once at index time).
PDFs with layout: tools like Unstructured and LlamaParse preserve reading order and reduce chunk pollution.

Step 3: Diagnose Embeddings — Wrong Model, Wrong Results

The embedding model is the lens through which all retrieval happens. If the lens is wrong for your corpus, nothing downstream can rescue you.

The current state of embedding models (2026)

OpenAI text-embedding-3-large (3072 dims): strong general-purpose, excellent instruction following, closed source.
Voyage-3: often tops the MTEB leaderboard for English, specialized variants for code and finance.
Cohere embed-v4: strong multilingual, good for retrieval with query and document asymmetry.
BGE-M3: best open-weights model for multilingual, supports dense, sparse, and multi-vector in one pass.
Jina embeddings v3: competitive with closed models, late-chunking support for long documents.

How to pick

Do not pick from the leaderboard. Pick by running your eval set against three candidate models and comparing MRR. A model that wins by 2 points on MTEB but loses by 5 points on your medical corpus is the wrong model for you.

python

from langchain_openai import OpenAIEmbeddings
from langchain_voyageai import VoyageAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

models = {
    "openai-3-large": OpenAIEmbeddings(model="text-embedding-3-large"),
    "voyage-3": VoyageAIEmbeddings(model="voyage-3"),
    "bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3"),
}

for name, model in models.items():
    index = build_index_with_embeddings(corpus, model)
    mrr = evaluate(index, eval_set)
    print(f"{name}: MRR@10 = {mrr:.3f}")

Dimensions matter, but not the way you think

Larger embedding dimensions (3072 vs 768) marginally improve accuracy but substantially increase index size, memory, and query latency. Truncating OpenAI embeddings to 1024 dimensions typically loses less than 1 percent MRR and halves storage. Use the Matryoshka truncation supported by text-embedding-3 models if cost is a constraint.

Step 4: Add Hybrid Search — The Free Upgrade

Pure vector search fails on exact-match needs. If a user searches for "error code 0x8024001A" or "ERC-4337 EntryPoint", a vector model may retrieve conceptually similar but textually different chunks. BM25 will match the exact token and win.

The fix is hybrid search: run both retrievers, fuse the rankings with Reciprocal Rank Fusion (RRF), and return the merged top-k.

python

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 50

vector = chroma_store.as_retriever(search_kwargs={"k": 50})

hybrid = EnsembleRetriever(
    retrievers=[bm25, vector],
    weights=[0.4, 0.6],  # tune on your eval set
)

docs = hybrid.get_relevant_documents("ERC-4337 EntryPoint address")

Most serious vector databases support hybrid search natively: Weaviate, Qdrant, Elastic, and pgvector-with-tsvector all expose RRF fusion as a first-class operation. Use the native implementation when you can — it is faster and has better tokenization.

On every technical corpus we have tested, hybrid adds 10 to 20 percent MRR over pure vector. It is nearly free to add and should be a default.

Step 5: Add a Re-ranker — The Non-Negotiable Second Stage

A re-ranker is a cross-encoder model that scores a (query, document) pair together, producing a much more accurate relevance score than the dot product of independent embeddings.

The standard two-stage pattern: retrieve 50 to 100 candidates fast (hybrid or vector), then re-rank the top 10 with a cross-encoder.

python

import cohere

co = cohere.Client()

def rerank_chunks(query: str, candidates: list[str], top_n: int = 5):
    resp = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [candidates[r.index] for r in resp.results]

candidates = hybrid.get_relevant_documents(query)  # top 50
final = rerank_chunks(query, [c.page_content for c in candidates], top_n=5)

Re-ranker options:

Cohere Rerank 3.5 — closed, fast, multilingual, pay per 1000 searches.
Jina Rerank v2 — competitive, cheaper.
BGE-reranker-v2-m3 — open weights, self-host on a single GPU.
Voyage Rerank — strong on enterprise search.

In published evaluations and practitioner reports alike, adding a re-ranker typically lifts nDCG by 15 to 30 percent on top of any retrieval baseline. If you are debugging retrieval and not using a re-ranker, add one before you change anything else.

Step 6: Query Transformation for Bad Questions

Sometimes the retrieval pipeline is fine and the query is the problem. "What about the thing?" is not going to match anything useful.

HyDE: Hypothetical Document Embeddings

HyDE asks an LLM to generate a plausible answer first, embeds that, and retrieves on the hypothetical answer instead of the raw query. It works because documents retrieve documents better than questions retrieve documents.

python

def hyde_retrieve(query: str, llm, retriever):
    hypothetical = llm.invoke(
        f"Write a one-paragraph hypothetical answer to: {query}"
    ).content
    return retriever.get_relevant_documents(hypothetical)

Multi-query rewriting

Ask an LLM to rewrite the user query into 3 or 4 variations, retrieve for each, and merge.

python

from langchain.retrievers.multi_query import MultiQueryRetriever

multi = MultiQueryRetriever.from_llm(
    retriever=vector.as_retriever(),
    llm=llm,
)
docs = multi.get_relevant_documents("how do I stop my AI agent forgetting?")

For a deeper look at why AI agents lose context and how to recover it, see why your AI agent loses context and how to fix it.

Query classification

Not every query needs HyDE or multi-query. Use a lightweight classifier (a small LLM or even a rules engine) to route simple queries through the fast path and hard queries through the expensive path.

Step 7: Metadata Filtering — The Cheapest Precision Win

If your documents have structure (tenant id, date, document type, product line), use it. A metadata filter before vector search collapses the search space and boosts precision for free.

python

results = vector_store.similarity_search(
    query,
    k=10,
    filter={
        "tenant_id": current_user.tenant_id,
        "doc_type": "support_ticket",
        "created_at": {"$gte": "2026-01-01"},
    },
)

Every production vector database supports this natively — pgvector, Qdrant, Weaviate, Pinecone, Chroma. The only trick is making sure your filter fields are indexed, or the filter will force a full scan.

A Complete Debugging Flowchart

When retrieval is bad, work through this in order:

Is there an eval set? If not, build one. You cannot debug without numbers.
Chunk size experiment: sweep 256, 512, 1024, 2048 and pick the winner by MRR.
Try semantic or domain-aware chunking if prose is still underperforming.
Swap the embedding model: test 3 candidates against your eval.
Add hybrid search (BM25 + vector with RRF fusion).
Add a re-ranker on the top 50 candidates.
Add metadata filters if your data has structure.
For bad queries, add HyDE or multi-query rewriting conditionally.
Only now, if MRR is high but answers are still wrong, look at the generation prompt.

In our experience, most teams stop debugging at step 1 or 2 and never reach step 5. The gains from steps 4 through 7 are usually larger than anything upstream.

Key Takeaways

Chunk size is the single highest-leverage knob — 512 tokens is the safe default, but semantic chunking almost always beats fixed-size splits
Embedding model choice matters more than most teams realize — OpenAI text-embedding-3-large, Voyage-3, and BGE-M3 each dominate different domains
Hybrid search (BM25 plus dense vectors) fixes the acronym, product-code, and exact-phrase failures that pure vector search cannot
A re-ranker like Cohere Rerank or Jina Rerank typically adds 15 to 30 percent nDCG on top of any retrieval baseline
Query transformation techniques like HyDE and multi-query rewriting rescue vague or under-specified user queries before retrieval
Metadata filtering collapses the search space and is the cheapest way to raise precision when users can be scoped by tenant, date, or document type
Measure with MRR at 10, nDCG at 10, and context precision on a static eval set before you change anything else — otherwise you are guessing

Frequently Asked Questions

Why is my RAG pipeline returning chunks that look related but do not actually answer the question?

This is the classic symptom of embedding-only retrieval with chunks that are too large or too generic. Large chunks dilute the semantic signal so the vector matches the topic but not the specific fact. Start by shrinking chunks to 256 to 512 tokens, adding 10 to 20 percent overlap, and layering a re-ranker like Cohere Rerank 3. Also verify that your embedding model was trained on text similar to your corpus — a general-purpose model will miss domain-specific nuance in medical, legal, or code repositories.

Should I use fixed-size, recursive, or semantic chunking?

Recursive character splitting is a reasonable default for unstructured prose. Semantic chunking (splitting on embedding similarity drops) produces measurably better retrieval on long-form content but costs more compute to index. For code, always split on syntactic boundaries using a language-aware splitter like the LlamaIndex CodeSplitter. For tables and structured data, row-level chunks with summary headers outperform every general-purpose splitter we have tested.

Which embedding model should I use in 2026?

For English general-purpose, OpenAI text-embedding-3-large and Voyage-3 are the current leaders on MTEB. For multilingual, BGE-M3 is state of the art open weights. For code search, Voyage-code-2 is the default. The only way to know for your corpus is to run a retrieval eval on a few hundred labeled questions — do not trust leaderboard numbers blindly, because domain drift matters more than the benchmark rank.

What is hybrid search and when should I use it?

Hybrid search combines lexical retrieval (BM25, which matches on exact tokens) with dense vector retrieval (which matches on semantic meaning) and fuses the two ranked lists, usually with Reciprocal Rank Fusion. Use it whenever your domain contains product codes, error codes, API names, legal citations, or rare entities that vector search tends to over-smooth. On any technical corpus, hybrid is almost always a free 10 to 20 percent improvement.

Do I need a re-ranker if I am already using a good embedding model?

Yes, in nearly every production RAG system. Embedding models are optimized for fast approximate recall across millions of chunks; re-rankers are cross-encoders that score query-document pairs jointly and are much more accurate. The standard pattern is to retrieve 50 to 100 candidates with vectors or hybrid, then re-rank the top 10 for the LLM. Cohere Rerank 3, Jina Rerank v2, and bge-reranker-v2-m3 are all strong choices.

How do I evaluate my RAG pipeline without spending weeks on human labeling?

Build a synthetic eval set with an LLM: for each document, have GPT-4 or Claude generate two or three questions whose answer is contained in that document. Then measure MRR at 10 (did the correct chunk land in the top 10?) and context precision (what fraction of retrieved chunks were actually relevant?). Ragas, TruLens, and DeepEval all automate this workflow. Keep the eval set frozen so you can compare changes over time.

What is HyDE and does it actually work?

HyDE (Hypothetical Document Embeddings) asks an LLM to generate a fake answer to the user's query, embeds that hypothetical document, and retrieves on it instead of the raw query. It works remarkably well when user queries are short, underspecified, or written in a different register than the corpus (for example, casual questions against formal documentation). It adds latency and an LLM call, so it is best deployed conditionally based on query length or a complexity classifier.

My retrieval quality is fine but the LLM still hallucinates — what now?

This is no longer a retrieval problem; it is a generation or prompting problem. Add strict instructions to the prompt ("Answer only from the provided context, cite every claim"), use structured output with citations, and run a faithfulness check on the output with a second LLM call. If context is correct and the model still invents facts, you may need a larger or more instruction-tuned model. For background on why agents lose context, see our deep dive on context window failures linked below.

About the Author

Aisha Patel

AI Editorial Desk

AI Editorial Desk · Web3AIBlog

Aisha Patel is a pen name for our AI editorial desk. Posts under this byline are written and reviewed by our team of contributors with backgrounds in machine learning, large language models, AI infrastructure, and applied research. The desk covers frontier model releases, agent architectures, retrieval-augmented generation, on-device inference, and the engineering tradeoffs that matter when shipping AI in production. Every technical claim is verified against primary sources before publication.

@web3aiblog LinkedIn

RAG Pipeline Returning Irrelevant Results? How to Debug Chunking, Embeddings, and Retrieval

Your RAG Pipeline Is Lying to You

Step 1: Build an Evaluation Set Before You Change Anything

Step 2: Diagnose Chunking — The Single Highest-Leverage Knob

Symptoms of bad chunking

The chunking experiment

Semantic chunking

Special cases

Step 3: Diagnose Embeddings — Wrong Model, Wrong Results

The current state of embedding models (2026)

How to pick

Dimensions matter, but not the way you think

Step 4: Add Hybrid Search — The Free Upgrade

Step 5: Add a Re-ranker — The Non-Negotiable Second Stage

Step 6: Query Transformation for Bad Questions

HyDE: Hypothetical Document Embeddings

Multi-query rewriting

Query classification

Step 7: Metadata Filtering — The Cheapest Precision Win

A Complete Debugging Flowchart

What to Read Next

Key Takeaways

Frequently Asked Questions

About the Author

Aisha Patel

Explore More Topics

Your RAG Pipeline Is Lying to You

Step 1: Build an Evaluation Set Before You Change Anything

Step 2: Diagnose Chunking — The Single Highest-Leverage Knob

Symptoms of bad chunking

The chunking experiment

Semantic chunking

Special cases

Step 3: Diagnose Embeddings — Wrong Model, Wrong Results

The current state of embedding models (2026)

How to pick

Dimensions matter, but not the way you think

Step 4: Add Hybrid Search — The Free Upgrade

Step 5: Add a Re-ranker — The Non-Negotiable Second Stage

Step 6: Query Transformation for Bad Questions

HyDE: Hypothetical Document Embeddings

Multi-query rewriting

Query classification

Step 7: Metadata Filtering — The Cheapest Precision Win

A Complete Debugging Flowchart

What to Read Next

Key Takeaways

Frequently Asked Questions

About the Author

Aisha Patel

Explore More Topics

Related Articles

What Is RAG? Retrieval Augmented Generation Explained

Vector Database Showdown May 2026: Pinecone vs Weaviate vs Qdrant vs LanceDB vs Chroma

Embedding Models Compared 2026: OpenAI vs Voyage vs Cohere vs Gemini vs Nomic