RAG as an Agent Pattern

Publish at:

Retrieval-Augmented Generation — RAG — is how an agent grounds its answers in real data instead of relying on whatever the model memorized during training. The idea is straightforward: before the model generates a response, retrieve relevant documents from an external knowledge base and inject them into the context window. The model reads the evidence, then answers. This is the single most effective technique for reducing hallucination and keeping agent output anchored to facts that are current, private, or domain-specific.

We covered the retrieval pipeline at a high level when we discussed memory and context engineering. Here we go deeper — into the engineering decisions that determine whether RAG actually works in practice: how to split documents, how to represent them, how to get the right chunks back, and how the pattern evolves from a simple lookup into something an agent orchestrates dynamically.

The Core Pipeline #

Every RAG system, no matter how sophisticated, is built on the same four-stage pipeline: chunk, embed, retrieve, generate. Documents go in one end, grounded answers come out the other.

                        Indexing (offline)
  ┌────────────┐     ┌──────────┐     ┌──────────────┐
  │ Documents  │───▶ │  Chunk   │───▶ │   Embed &    │
  │ (PDF, DB,  │     │  (split  │     │   store in   │
  │ wiki, API) │     │   text)  │     │ vector store │
  └────────────┘     └──────────┘     └──────────────┘

                        Querying (online)
  ┌───────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  User     │───▶ │  Embed   │───▶ │ Retrieve │───▶ │ Generate │
  │  query    │     │  query   │     │  top-k   │     │  answer  │
  └───────────┘     └──────────┘     │  chunks  │     │  (LLM)   │
                                     └──────────┘     └──────────┘

During indexing, you break your documents into chunks, convert each chunk into a vector embedding, and store those embeddings in a vector database alongside the original text. During querying, you embed the user's question with the same model, search the vector store for the most similar chunks, and pass those chunks to the LLM as context for generating an answer.

This two-phase design is what makes RAG practical. The heavy work — processing and embedding thousands of documents — happens offline. The online path is fast: one embedding call, one vector search, one generation call. The model never sees the entire knowledge base. It sees only the handful of chunks the retrieval step deemed most relevant.

def rag_query(question: str, top_k: int = 5) -> str:
    # Embed the question
    query_vector = embedding_model.encode(question)

    # Retrieve the most similar chunks
    results = vector_store.search(query_vector, limit=top_k)
    context = "\n\n".join([r.text for r in results])

    # Generate an answer grounded in the retrieved context
    response = llm.complete(
        system="Answer the question using only the provided context. "
               "If the context does not contain the answer, say so.",
        prompt=f"Context:\n{context}\n\nQuestion: {question}"
    )
    return response.text

Simple as it looks, this pattern handles a surprising range of use cases — FAQ bots, documentation search, internal knowledge assistants. The engineering challenge is not the glue code. It is the quality of each stage: how you chunk, how you embed, and how you retrieve.

Chunking and Embedding #

Chunking is where most RAG systems succeed or fail, and it gets far less attention than it deserves. The goal is to split documents into pieces that are small enough to be specific but large enough to be self-contained. A chunk that is too small loses context — a sentence fragment about "the policy" is useless if the reader does not know which policy. A chunk that is too large dilutes the signal — a ten-page section retrieves well for any question about any topic it mentions, but buries the specific answer in noise.

Chunking Strategies #

Fixed-size chunking splits text into blocks of N tokens (or characters) with some overlap between adjacent blocks. It is simple, predictable, and works well for homogeneous text. The overlap — typically 10-20% of the chunk size — ensures that sentences split across a boundary still appear in at least one chunk. This is the default starting point.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)

Semantic chunking uses the structure of the document to find natural break points — section headings, paragraph boundaries, topic shifts. It produces chunks that align with the author's intent, which generally retrieves better because each chunk covers a coherent idea. The trade-off is that it requires more preprocessing (Markdown headers, HTML tags, or an NLP model to detect topic boundaries), and chunk sizes vary unpredictably.

Hierarchical chunking creates chunks at multiple granularities — paragraphs, sections, and full documents — and indexes all of them. At retrieval time, you can match at the paragraph level for precision or the section level for broader context. Some systems retrieve a paragraph and then expand to its parent section before passing it to the model, giving the LLM both the specific hit and the surrounding context.

The right chunk size depends on the embedding model and the domain. Most embedding models are trained on sequences of 256-512 tokens, and their similarity scores degrade for inputs much longer or shorter than that range. Experimentation is unavoidable: try two or three chunk sizes, run a set of test queries, and measure retrieval quality before committing.

Embeddings #

An embedding model converts text into a dense vector — a list of floating-point numbers, typically 384 to 1536 dimensions — that captures the semantic meaning of the input. Two pieces of text about the same concept produce vectors that are close together in the embedding space, even if they use completely different words. This is the mechanism that makes semantic search possible: you find relevant chunks by measuring the distance between vectors. Keywords are irrelevant during search.

The choice of embedding model matters a lot. A model trained on general web text will produce decent embeddings for most domains, but a model fine-tuned on your specific domain — legal documents, medical records, codebases — will capture domain-specific nuances that a general model misses. The practical trade-off is between embedding quality and operational simplicity. Fine-tuning an embedding model is an investment; using an off-the-shelf model is fast and often good enough.

from sentence_transformers import SentenceTransformer

# General-purpose model — good starting point
model = SentenceTransformer("all-MiniLM-L6-v2")  # 384 dimensions

# Embed a chunk
vector = model.encode("The patient presented with acute respiratory distress.")
# vector.shape = (384,)

One decision that is easy to overlook: embed the chunks and the queries with the same model. If the chunk embeddings come from Model A and the query embedding comes from Model B, the vector spaces will not align and similarity scores become meaningless. This sounds obvious, but it trips you up if upgrading embeddings is not followed by a re-index.

Retrieval Strategies #

The retrieval step determines what the model actually sees, which makes it the highest-leverage point in the entire pipeline. A perfect chunking strategy with a mediocre retrieval strategy still produces mediocre results. Retrieval is where you choose between speed and precision, simplicity and sophistication.

The baseline approach: embed the query, compute cosine similarity (or dot product) against every chunk embedding in the store, and return the top-k results. Modern vector databases handle this efficiently even at millions of chunks using approximate nearest neighbor (ANN) algorithms — they trade a small amount of accuracy for dramatically faster search.

Vector search excels at semantic matching. A query about "employee termination procedures" will retrieve chunks about "letting staff go" or "separation policies" even though the keywords do not overlap. But it struggles with exact-match needs — if the user asks for "error code E-4021," a semantic search might return chunks about error handling in general instead of the specific code.

Hybrid search combines vector similarity with keyword matching (BM25 or similar). The idea is to get the best of both worlds: semantic understanding for conceptual queries and lexical precision for specific terms. Your production RAG system will end up here.

  Query: "error code E-4021 in the payment module"
          │
          ├──▶ Vector search  ──▶  semantically similar chunks
          │
          └──▶ Keyword search ──▶  chunks containing "E-4021"
                    │
                    ▼
              Merge & rerank  ──▶  final top-k results

The merging step is where hybrid search gets interesting. You can use reciprocal rank fusion (RRF), which combines the rank positions from both search methods without needing to normalize scores. Or you can use a weighted combination of similarity scores, tuning the balance between semantic and lexical results. The right weight depends on your data — technical documentation with lots of codes and identifiers needs more lexical weight; conversational knowledge bases lean semantic.

Reranking #

A reranker is a second model that scores the relevance of each candidate chunk against the original query, after the initial retrieval step. The retrieval step is fast but approximate — it works from embeddings alone. The reranker is slower but more accurate — it reads both the query and the chunk text together and produces a fine-grained relevance score.

def retrieve_with_rerank(query: str, top_k: int = 5) -> list[str]:
    # Stage 1: fast retrieval — cast a wide net
    candidates = vector_store.search(embed(query), limit=top_k * 4)

    # Stage 2: rerank — score each candidate against the query
    scored = reranker.rank(query, [c.text for c in candidates])
    scored.sort(key=lambda x: x.score, reverse=True)

    return [s.text for s in scored[:top_k]]

Reranking typically improves precision by 10-25% over vector-only retrieval, but it adds a model call per query and scales with the number of candidates. The standard pattern is to over-retrieve by 3-4x in the first stage, then rerank down to the final top-k. This keeps the reranker's input manageable while giving it a richer candidate pool to work with.

Graph-Based Retrieval #

Standard vector search treats every chunk as an isolated island. It has no concept of relationships between entities or connections across documents. Graph-based retrieval addresses this by extracting entities and relationships from the text, building a knowledge graph alongside the vector index, and using graph traversal during retrieval.

During indexing, an LLM extracts entities (people, organizations, concepts) and their relationships from each chunk, then organizes them into a graph. Each entity and relationship gets a text summary and embedding. During retrieval, the system matches at two levels: low-level retrieval finds specific entities and their direct relationships, while high-level retrieval traverses the graph to find broader themes and multi-hop connections.

  ┌──────────────────────────────────────────────────┐
  │              Knowledge Graph                     │
  │                                                  │
  │   [Cardiologist]──diagnoses──▶[Heart Disease]    │
  │        │                           │             │
  │     treats                     requires          │
  │        │                           │             │
  │        ▼                           ▼             │
  │   [Patient]◀──covered by──[Insurance Policy]     │
  │                                                  │
  └──────────────────────────────────────────────────┘

  Low-level query:  "Who diagnoses heart disease?"
    → retrieves: Cardiologist ──diagnoses──▶ Heart Disease

  High-level query: "How does cardiac care relate to insurance?"
    → traverses: Cardiologist → treats → Patient → covered by → Insurance Policy

Graph-based retrieval shines when the answer requires connecting information scattered across multiple documents — "which team owns the service that caused last week's outage?" requires linking a team, a service, and an incident across three different sources. The cost is significant: building and maintaining the graph requires LLM calls for entity extraction, deduplication logic, and ongoing updates as documents change. It is worth the complexity when your data has rich entity relationships and your queries frequently span multiple concepts.

The RAG Spectrum #

Not every RAG system needs to be the same. There is a spectrum of complexity, and the right choice depends on the query patterns, accuracy requirements, and budget. Understanding where your use case falls on this spectrum saves you from over-engineering simple problems or under-engineering hard ones.

Naive RAG is the simplest form: retrieve chunks by keyword or basic embedding similarity, pass them straight to the model, and return whatever it generates. There is no reranking, no query rewriting, no validation. It works for narrow domains with predictable queries — an internal FAQ bot, a documentation search where the questions closely match the document language. It is fast, cheap, and easy to debug.

Simple RAG adds the quality-of-life features that make retrieval more robust: hybrid search, chunk overlap, a relevance threshold that filters out low-scoring results rather than always returning top-k. It is the sweet spot for most production deployments. The incremental cost over naive RAG is small, and the quality improvement is significant.

Self-RAG introduces a feedback loop. Before retrieving, the system rewrites the query to make it more specific or to fill in implied context. After generating, it evaluates whether the answer is actually supported by the retrieved evidence. If the answer fails the check, it reformulates and tries again. This is valuable when queries are vague or ambiguous, but it doubles or triples the latency because of the extra model calls.

def self_rag_query(question: str) -> str:
    # Step 1: rewrite the query for better retrieval
    refined_query = llm.complete(
        prompt=f"Rewrite this question to be more specific and search-friendly:\n{question}"
    ).text

    # Step 2: retrieve and generate
    context = retrieve(refined_query)
    answer = generate(question, context)

    # Step 3: self-evaluate — is the answer grounded?
    evaluation = llm.complete(
        prompt=f"Is this answer fully supported by the context?\n"
               f"Context: {context}\nAnswer: {answer}\n"
               f"Respond YES or NO with a brief explanation."
    ).text

    if "NO" in evaluation.upper():
        # Reformulate and retry
        better_query = llm.complete(
            prompt=f"The answer was not well-supported. Suggest a better search query.\n"
                   f"Original: {question}\nPrevious query: {refined_query}"
        ).text
        context = retrieve(better_query)
        answer = generate(question, context)

    return answer

Corrective RAG focuses specifically on post-retrieval validation. After retrieving chunks, a separate evaluation step scores each chunk for relevance. Low-scoring chunks are dropped. If too few high-quality chunks remain, the system triggers a new retrieval with a modified query — sometimes falling back to a web search. The corrective step acts as a quality gate between retrieval and generation.

Agentic RAG is where the retrieval pattern merges with the agent loop we discussed earlier. Instead of a fixed retrieve-then-generate pipeline, an agent decides how to retrieve. It might query one knowledge base, examine the results, decide they are insufficient, query a different source, combine the findings, and only then generate an answer. The retrieval strategy becomes part of the agent's reasoning — a tool it wields.

tools = [
    {
        "name": "search_docs",
        "description": "Search the internal documentation knowledge base",
        "parameters": {"query": "string", "top_k": "integer"}
    },
    {
        "name": "search_tickets",
        "description": "Search resolved support tickets",
        "parameters": {"query": "string", "status": "string"}
    },
    {
        "name": "search_code",
        "description": "Search the codebase for function definitions and usage",
        "parameters": {"query": "string", "language": "string"}
    }
]

# The agent decides WHICH knowledge base to query, and HOW to combine results.
# It might search docs first, realize it needs code context,
# then search the codebase, and finally synthesize an answer.

Agentic RAG is the most flexible and the most expensive. Each retrieval decision costs a model call, and the agent might make several before it is satisfied. But for complex questions that span multiple data sources — "why is service X returning 500 errors after last Tuesday's deploy?" — the agent's ability to reason about what to retrieve and where to look is exactly what a fixed pipeline cannot provide.

The progression is clear: each step up the spectrum adds a feedback loop or a decision point, trading latency and cost for retrieval quality and answer accuracy. Most projects should start at simple RAG and move up only when they can measure the quality gap.

Grounding and Retrieval Quality #

RAG exists to ground the model's output in evidence. But retrieval is only as good as what comes back — and measuring retrieval quality is surprisingly tricky.

Failure Modes #

Retrieval can fail in several ways, each with different downstream effects.

False positives — retrieving chunks that look relevant but are not. A chunk about "Python memory management" retrieves for a query about "Python data types" because the embedding space sees them as related. The model reads the irrelevant chunk, gets confused, and either generates a wrong answer or hedges unnecessarily. False positives are insidious because the user often cannot tell whether the retrieved evidence was good.

False negatives — missing chunks that are relevant. The answer exists in the knowledge base, but the retrieval step does not find it. This happens when the query and the relevant chunk use different vocabulary, when the chunk is buried in a large document and the embedding is diluted, or when the chunk size is wrong for the query. The model, seeing no relevant evidence, either says "I don't know" (good) or hallucinates an answer from its training data (bad).

Stale data — retrieving chunks from documents that are out of date. The knowledge base says the API endpoint is /v2/users, but it was changed to /v3/users six months ago. The model confidently returns the old endpoint. This is a data freshness problem, not a retrieval problem, but it manifests the same way — the model produces a grounded but wrong answer.

Measuring Quality #

The standard retrieval metrics are precision (what fraction of retrieved chunks are relevant) and recall (what fraction of relevant chunks are retrieved). In practice, you measure these with a test set: a collection of questions paired with the chunks that should be retrieved for each.

def evaluate_retrieval(test_set: list[dict], retriever) -> dict:
    precisions, recalls = [], []

    for case in test_set:
        query = case["question"]
        expected = set(case["relevant_chunk_ids"])
        retrieved = set(retriever.search(query, top_k=5))

        if retrieved:
            precision = len(expected & retrieved) / len(retrieved)
            recall = len(expected & retrieved) / len(expected)
        else:
            precision, recall = 0.0, 0.0

        precisions.append(precision)
        recalls.append(recall)

    return {
        "mean_precision": sum(precisions) / len(precisions),
        "mean_recall": sum(recalls) / len(recalls),
    }

End-to-end metrics matter too. Retrieval precision tells you whether you are fetching the right chunks, but it does not tell you whether the final answer is correct. For that, you need answer correctness (does the generated answer match the expected answer) and faithfulness (does the generated answer stick to what the retrieved chunks say, or does it hallucinate beyond the evidence). Faithfulness is the RAG-specific metric — it measures whether the grounding actually worked.

Practical Improvements #

A few techniques consistently improve RAG quality across domains.

Metadata filtering. Attach metadata to each chunk at indexing time — source document, date, category, access level — and filter on it during retrieval. A query about "2025 Q4 revenue" should not retrieve chunks from 2023. Metadata filtering narrows the search space before vector similarity even runs, which improves both precision and speed.

Query expansion. Generate multiple search queries from the user's question, retrieve for each, then merge the results. This catches relevant chunks that a single query phrasing might miss. It costs extra embedding calls, but the recall improvement is often worth it.

def expanded_retrieval(question: str, top_k: int = 5) -> list[str]:
    # Generate alternative query phrasings
    expansions = llm.complete(
        prompt=f"Generate 3 alternative search queries for: {question}\n"
               f"Return one per line, no numbering."
    ).text.strip().split("\n")

    # Retrieve for each query and merge
    all_results = {}
    for query in [question] + expansions:
        for result in vector_store.search(embed(query), limit=top_k):
            if result.id not in all_results:
                all_results[result.id] = result

    # Rerank the merged pool
    merged = list(all_results.values())
    scored = reranker.rank(question, [r.text for r in merged])
    return [s.text for s in scored[:top_k]]

Contextual enrichment. When indexing, prepend each chunk with a brief summary of the document it came from or the section heading it falls under. This gives the embedding model more context about what the chunk is about, reducing the chance that a chunk about "the policy" embeds generically instead of specifically.

Incremental indexing. When documents change, re-index only the affected chunks rather than rebuilding the entire vector store. This keeps the knowledge base fresh without the cost or downtime of a full re-index. It requires tracking which chunks came from which source documents and maintaining a mapping between them.

Conclusion #

RAG is the bridge between an agent's knowledge base and its context window — the mechanism that turns a general-purpose model into one that can answer questions about your data.

Key takeaways:

  • The core pipeline is chunk, embed, retrieve, generate — simple in concept, but each stage involves engineering trade-offs that directly affect answer quality
  • Chunking strategy matters — chunk size, overlap, and boundary detection determine whether the right information is retrievable at all
  • Hybrid search (vector + keyword) outperforms pure vector search in most production settings, and reranking adds another measurable quality boost
  • Graph-based retrieval captures entity relationships that flat vector search misses, at the cost of significant indexing complexity
  • The RAG spectrum — from naive to agentic — represents a trade-off between simplicity and retrieval sophistication; start simple and move up only when you can measure the gap
  • Grounding is not automatic; measuring retrieval precision, answer faithfulness, and building feedback loops are what separate a demo from a production system
  • RAG becomes most powerful when it is a tool the agent can reason about — deciding what to retrieve, where to look, and when to try again