Evaluating RAG Pipelines: Mastering Recall, Precision, and Faithfulness
Apr, 22 2026
Most people building AI agents eventually hit a wall where the model starts making things up, even when the right data is right there in the prompt. This is the core struggle of Retrieval-Augmented Generation (RAG). You can't just plug in a vector database and hope for the best; you need a way to prove your system actually works. The problem is that traditional LLM metrics like perplexity don't tell you if your retriever is failing or if your generator is just ignoring the facts. To fix this, you need a RAG strategy that separates the search process from the writing process.
Evaluating a RAG pipeline isn't a single task-it's a three-part diagnostic. You have to figure out if the system failed to find the information (Retrieval), if it found the info but ignored it (Generation), or if it found the wrong info and tried to make it work (Faithfulness). If you treat these as one big black box, you'll spend weeks tweaking prompts when the real issue was actually your chunking strategy.
The Retrieval Layer: Did we find the right needles?
Before the LLM even sees a word, your retriever has to do the heavy lifting. If the retriever fails, the generator has no chance. In this stage, we focus on how well the system narrows down millions of documents to the few that actually matter. Recall is a metric that measures the ability of the system to retrieve all relevant documents from the knowledge base. If your recall is low, your AI is essentially "blind" to the truth, no matter how smart the LLM is.
Then there is Precision, which tells you how much "noise" you're feeding the model. High precision means the retrieved chunks are almost entirely relevant. If precision is low, you're stuffing the prompt with irrelevant data, which often leads to the model getting distracted or hitting context window limits.
To get a real sense of retrieval health, we use specific benchmarks. For instance, Recall@k measures if the correct document appears within the top k results. If you're pulling 5 chunks but the answer is always in the 7th, your Recall@5 is 0%, and your user gets a wrong answer every time.
| Metric | What it Measures | Failure Symptom |
|---|---|---|
| Recall@k | % of relevant docs in top k results | Model says "I don't know" despite data existing |
| Precision | Ratio of relevant docs to total retrieved | Model gets confused by irrelevant "noise" |
| MRR (Mean Reciprocal Rank) | How high the first correct doc is ranked | Slow convergence or poor top-1 accuracy |
| Latency | Time to fetch documents | Laggy user experience (high TTFT) |
The Generation Layer: Truth vs. Hallucination
Once the retriever hands over the documents, the generator takes over. This is where Faithfulness comes in. Faithfulness is the degree to which the generated answer is derived solely from the retrieved context without adding external, unverified information. This is the primary weapon against hallucinations. A response can be factually correct based on the LLM's internal training, but if it isn't in the retrieved documents, it isn't "faithful." In a corporate legal or medical setting, a "correct" answer that isn't grounded in your specific documents is a liability.
We also look at Groundedness. While similar to faithfulness, groundedness specifically checks if every claim in the response can be traced back to a specific sentence in the source. If the LLM says "The company grew by 20%" but the source only says "The company grew," the model is hallucinating the specific number. This is where FactScore becomes useful, breaking down a response into individual atomic facts and verifying each one against the source.
Another critical check is Completeness. Does the answer actually solve the user's problem, or did the model just summarize the first paragraph of the retrieved text and stop? A faithful but incomplete answer is just as useless as a complete but unfaithful one.
The End-to-End View: Putting it all Together
You can have a perfect retriever and a faithful generator, and the system can still fail. This is why end-to-end evaluation is the final step. This often involves LLM-as-a-Judge, where a more powerful model (like GPT-4o or Claude 3.5) acts as the grader. The judge is given the question, the retrieved context, and the generated answer, then asked to score it on a scale of 1-5 based on specific rubrics.
To make this scientific, many teams use a Golden Dataset-a hand-curated set of question-answer pairs that represent the "perfect" behavior of the system. By comparing the system's live output to this ground truth using semantic similarity, you can track if a new prompt version actually improves performance or just changes the wording.
One professional tip for those in the trenches: look at Context Overlap. This measures the intersection between the answer and the retrieved documents. If you see high overlap but low correctness, your retriever is finding the right documents, but your LLM is failing to synthesize them. If you see low overlap and low correctness, your retriever is the problem.
How to Optimize Based on Your Findings
Once your metrics show you where the leak is, don't just change the prompt. Target the specific failure point with these tactics:
- For Low Recall: Try Semantic Chunking, which breaks documents based on meaning rather than fixed character counts. If you cut a sentence in half, the retriever might miss the core concept.
- For Low Precision: Implement a Reranker. A reranker is a second-stage model that looks at the top 20 results from the retriever and re-sorts them more accurately before passing the top 5 to the LLM.
- For Low Faithfulness: Use Prompt Scaffolding. Explicitly tell the model: "Use ONLY the provided context. If the answer isn't there, say 'I don't know.' Do not use outside knowledge."
- For Domain-Specific Failures: Fine-tune your retriever using a contrastive loss function. For example, in a medical RAG, you want the system to know that "Cold" as a symptom is very different from "Cold" as a temperature.
Testing these is an iterative game. You should test different chunk sizes (e.g., 400 vs 1200 characters) and different retrieval strategies (e.g., passing matched chunks vs passing the entire parent document) to see which combination maximizes your faithfulness and recall.
What is the difference between Groundedness and Correctness?
Groundedness only cares if the answer is based on the provided context. Correctness cares if the answer is actually true in the real world. For example, if your retrieved document contains a mistake saying "The moon is made of cheese" and the LLM repeats that, the answer is highly grounded (faithful to the source) but completely incorrect.
Why is Recall more important than Precision in some RAG pipelines?
In many cases, it's easier for a smart LLM to ignore a few irrelevant chunks (low precision) than it is for an LLM to invent an answer because the retriever missed the key document entirely (low recall). If the data isn't in the prompt, the model cannot be faithful.
How does LLM-as-a-Judge work in practice?
You provide a "judge" LLM with a detailed rubric-for example, a 1-5 scale where 1 is "completely hallucinatory" and 5 is "perfectly grounded and complete." The judge analyzes the triple (Query, Context, Response) and provides a score and a reasoning string, which helps developers understand why the system failed.
What is the best way to reduce hallucinations in a RAG system?
The most effective approach is a combination of high-precision retrieval (using a reranker), strict system prompting (forcing the model to cite its sources), and measuring faithfulness via a groundedness metric to identify and fix failure patterns during development.
Does semantic chunking always improve retrieval?
Not always, but usually. While fixed-size chunking is faster and simpler, semantic chunking ensures that related ideas stay together. This significantly improves Recall because the vector embedding of a complete thought is much more accurate than the embedding of a fragment.