Retrieval-augmented generation has become the default architecture for any AI system that needs to ground its answers in your own data. The basic pattern is straightforward: a user query goes in, relevant context is retrieved from a vector store, the context and query go to a model, the model produces an answer. The diagrams are clean. The demos look fine.

Then the system goes to production, and the answers start being wrong in ways that are hard to explain. The model claims things that are not in the documents. It cites the wrong source. It fails to find information that everyone knows is in the corpus. The most common diagnosis — "the model is bad" — is almost always wrong. The model is usually fine. The retrieval is what is failing.

These are the five questions we ask before we trust that a RAG system is doing what its dashboard claims it is doing.

Question 1: How was the source corpus chunked, and does it match the retrieval shape?

Most retrieval failures begin at chunking. The system has to slice the source documents into pieces small enough to fit through the embedding model and the LLM's context, but large enough to contain coherent ideas. Get the chunk size wrong and the retrieved chunks are either too narrow to contain the answer or too broad to be informative.

The naive chunking strategy — every five hundred characters, regardless of content — works for some corpora and fails badly for others. Technical documentation, where related information is often spread across headings and paragraphs, suffers from naive chunking. Conversational transcripts, where each turn is its own meaningful unit, often work fine with simpler strategies.

The question to ask: what does a typical successful retrieval look like, and is the chunking strategy producing chunks of that shape? If the answer is "we did not check," that is the first place to investigate.

Question 2: Is the embedding model appropriate for the corpus?

Embedding models are not equivalent. A model trained on general web text may perform poorly on highly technical or domain-specific corpora. A model optimised for short queries may underperform on long documents. A model that handles English beautifully may collapse on multilingual content.

Many teams pick an embedding model early in the project and never revisit the decision. This is often a mistake. The cost of trying two or three embedding models with the same evaluation harness is small, and the differences in retrieval quality can be substantial. We have seen swap-in changes of embedding model produce ten or twenty percentage point improvements in retrieval accuracy on the same corpus.

The embedding model is the lens through which your system sees its own knowledge. Pick the wrong lens and everything downstream is slightly out of focus.

Question 3: What is the retrieval doing beyond pure semantic similarity?

Pure vector similarity is rarely enough on its own. Real-world retrieval almost always benefits from a hybrid approach that combines semantic similarity with at least one other signal. The common additions are:

A retrieval pipeline that uses only embedding similarity is leaving substantial accuracy on the table for almost every real-world use case. The question to ask: what signals beyond semantic similarity are being used, and are they being used at the right point in the pipeline?

Question 4: What does the failure mode distribution look like?

Most teams measure retrieval accuracy in aggregate: what percentage of queries get the correct chunks in the top-k? This is a useful metric but it hides important detail. The failures are rarely uniformly distributed. Certain query types fail; others succeed reliably. Certain document sections are over-retrieved; others are systematically missed.

The diagnostic move is to stratify retrieval evaluation by query type, document section, and known failure modes. The patterns that emerge are usually actionable. If queries containing specific entities consistently fail, the entity might need to be promoted in the chunking strategy. If certain documents are never retrieved despite being relevant, there may be an embedding-model mismatch for that document type.

The question to ask: where exactly is retrieval failing, and what do those failures have in common? Aggregate accuracy is the wrong place to look. The interesting information is in the breakdown.

Question 5: Is the evaluation reflecting real usage, or just the team's assumptions?

The most expensive mistake in retrieval evaluation is using a test set that does not match the queries the system actually sees. Synthetic queries, queries the team made up while building, queries from a small set of known-good examples — all of these are useful for development but dangerous as the final evaluation gate.

The right discipline is to evaluate against real production queries. Sample them, anonymise where needed, and use them to score retrieval quality on an ongoing basis. The gap between "queries the team thought users would ask" and "queries users actually ask" is almost always larger than the team expects.

On our larger retrieval engagements, the test set evolves continuously. Every week, a sample of real queries from the previous week is reviewed, scored, and added to the test set if it represents a new pattern. This keeps the evaluation aligned with how the system is actually being used, rather than how it was used six months ago.

The pipeline that ships and stays shipped

Putting the five questions together, the retrieval pipelines we trust have a common shape. They use chunking strategies matched to the corpus, not the framework defaults. They pick embedding models after testing several. They use hybrid retrieval combining semantic similarity, lexical matching, and reranking. They measure failure modes by query type rather than in aggregate. And they evaluate continuously against real usage, not against a static test set frozen at project start.

None of this is exotic. All of it is unglamorous. And the difference between a retrieval pipeline built with these disciplines and one built without them is the difference between a RAG system that quietly works for years and one that quietly fails customers every day.

If you are running a retrieval system in production right now and you do not know the answers to all five questions, that is your weekly investigation list. Start with question four. The failure mode distribution is usually where the most expensive surprises live.

Work with us

Have a project that needs senior engineering attention?

We work with founders and enterprise teams across Dubai, the US, and India. If something here resonates with what you're building, we'd be glad to talk.

Start a conversation →