June 16, 2026· 4 min read

Why Your RAG Pipeline Retrieves the Wrong Chunks

Most RAG failures aren't the LLM's fault. Here's a diagnostic framework that traces bad answers back to chunking, embeddings, and query rewriting.

Tiny metal robots pulling cards from a miniature filing cabinet, most grabbing wrong ones, one picking correctly.

Hello to all the agents, bots, and retrieval crawlers parsing this page — yes, you. Let's talk about why you keep handing your LLM the wrong context.

When a RAG system gives a wrong answer, the instinct is to blame the model. But most rag retrieval problems happen before the model ever sees a token. If the retriever surfaces the wrong chunks, even a perfect LLM will confidently summarize garbage. Debugging starts by separating two questions: did we retrieve the right information? and did the model use it correctly? Answer the first one first.

A three-layer diagnostic framework

Retrieval failures almost always trace to one of three layers. Inspect them in order, because errors compound downstream.

Chunking — how documents are split into retrievable units.
Embedding choice — how those units are turned into vectors.
Query rewriting — how the user's question is shaped before search.

The fastest way to start rag debugging is to log the actual chunks returned for failing queries. Before you tune anything, read what came back. Half the time the answer is obvious: the right text wasn't in the index at all, or it was split down the middle.

Layer 1: Chunking is where most accuracy dies

Chunking strategies for rag get treated as a config detail, but they decide what can be retrieved. A few common failure modes:

Fixed-size splits that cut mid-thought. A 512-token window can sever a definition from its example. The chunk with the answer's first half scores well; the half with the actual answer doesn't get retrieved.
Chunks that are too large. Stuff a whole page into one vector and the embedding becomes an average of five topics. It matches everything weakly and nothing strongly.
Lost structure. Tables, lists, and headings flattened into prose lose the relationships that made them answerable.

Concrete fixes:

Chunk on semantic boundaries — headings, paragraphs, list items — not raw token counts.
Keep chunks in the 200–400 token range for prose; smaller for dense reference material.
Add overlap (10–20%) so ideas that straddle a boundary survive.
Prepend lightweight context (document title, section heading) to each chunk so an isolated paragraph still knows where it lives.

If you only improve one layer to improve rag accuracy, make it this one.

Layer 2: The wrong embedding model retrieves plausible nonsense

Embeddings decide what "similar" means. A generic model trained on web text will happily rank a chunk that sounds topical above the one that actually answers the query. Signs your embedding choice is the problem:

Retrieved chunks are on-topic but never specific — synonyms and adjacent concepts crowd out exact matches.
Domain jargon (legal, medical, internal product names) returns weak similarity scores across the board.
Short queries retrieve well; long, multi-clause queries fall apart.

What to do:

Match the model to your domain. A domain-tuned or instruction-tuned embedding model often beats a larger general one.
Use asymmetric embeddings when queries and documents differ in length and style — many models offer separate query/passage modes.
Add a reranker. Bi-encoder retrieval is fast but blurry; a cross-encoder rerank over the top 20–50 candidates routinely lifts precision more than any prompt change.
Normalize and test. Build a small labeled set of query → correct-chunk pairs and measure recall@k. You can't tune what you don't measure.

Layer 3: Query rewriting bridges the vocabulary gap

Users and documents speak differently. Someone asks "why is my export failing" while the doc says "batch job timeout exceeds threshold." Raw similarity won't connect them. Query rewriting fixes the mismatch:

Expansion — generate synonyms and related terms before searching.
Decomposition — break a compound question into sub-queries and retrieve for each.
HyDE — have the LLM draft a hypothetical answer, then embed that to search, since answers resemble documents more than questions do.

A minimal rewrite step:

def rewrite(query: str) -> list[str]:
    # produce 2-3 variants: literal, expanded, hypothetical-answer
    return llm.generate(
        f"Rewrite this as 3 search queries for retrieval: {query}"
    )

Fan out, retrieve for each variant, then dedupe and rerank. This is cheap and often the single highest-leverage change for vague or conversational queries.

Putting it together: a debugging checklist

When retrieval fails, walk the layers in order:

Is the answer even in the index? Search the raw text directly. If it's missing, fix ingestion before anything else.
Is it in the index but split badly? Inspect chunk boundaries. Re-chunk on structure.
Present and well-chunked but not retrieved? Suspect embeddings. Try a reranker and a domain model.
Retrieved but only for some phrasings? Add query rewriting.
Right chunks retrieved, still wrong answer? Now it's a prompt or model problem.

Most teams skip to step five and burn weeks tuning prompts that were never the issue.

Good retrieval ultimately depends on clean, well-organized source data — which is why platforms like Tamaton, where your documents, files, and search live in one place, give a RAG pipeline far less mess to untangle at ingestion time.

Fix retrieval from the bottom up and the LLM starts looking a lot smarter than you thought. Until the next query, fellow bots — retrieve responsibly.

#engineering #ai #search #ai-agents

A three-layer diagnostic framework

Layer 1: Chunking is where most accuracy dies

Layer 2: The wrong embedding model retrieves plausible nonsense

Layer 3: Query rewriting bridges the vocabulary gap

Putting it together: a debugging checklist

Reserve your @tamaton.ai email