AI/ML Engineering

RAG Explained: Building AI Applications That Know Your Data

A practical guide to building Retrieval-Augmented Generation pipelines -- from document chunking and embedding to hybrid retrieval and evaluation metrics.

A
Abhishek Patel9 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

RAG Explained: Building AI Applications That Know Your Data
RAG Explained: Building AI Applications That Know Your Data

Why Your AI App Gives Wrong Answers -- and How RAG Fixes It

Retrieval-Augmented Generation -- RAG -- is the dominant pattern for building AI applications that can answer questions about your own data. Instead of cramming everything into a fine-tuned model or hoping the LLM memorized what you need, RAG retrieves relevant documents at query time and feeds them to the model as context. It's the difference between asking someone to recall a book from memory versus handing them the relevant pages and asking them to answer.

I've built RAG pipelines for everything from internal knowledge bases to customer support systems, and the pattern is deceptively simple on the surface. The devil is in the chunking, retrieval, and evaluation details. Get those wrong, and you'll get confident-sounding garbage. Get them right, and you have a system that's genuinely useful.

What Is Retrieval-Augmented Generation (RAG)?

Definition: Retrieval-Augmented Generation (RAG) is an architecture pattern where an AI application retrieves relevant documents from an external knowledge base at query time, then passes those documents as context to a large language model to generate grounded, accurate responses.

The core idea is straightforward: instead of relying solely on what the LLM learned during pre-training, you give it access to your specific data at inference time. This solves three major problems with vanilla LLMs:

  • Hallucination -- the model makes up facts because it doesn't have the right information
  • Stale knowledge -- the model's training data has a cutoff date
  • No access to private data -- the model never saw your internal docs, databases, or proprietary content

The RAG Pipeline: End to End

A RAG system has two phases: an ingestion pipeline (offline, runs ahead of time) and a query pipeline (online, runs per user request). Here's how they fit together.

Step 1: Document Ingestion

Collect your source documents -- PDFs, web pages, Markdown files, database records, Confluence pages, Slack messages. Each source needs a loader that extracts clean text. Libraries like LangChain, LlamaIndex, and Unstructured provide pre-built loaders for most formats. The quality of your extraction directly limits the quality of your answers.

Step 2: Chunking

Split documents into smaller pieces. This is where most RAG pipelines succeed or fail. Chunks need to be small enough to be specific but large enough to carry meaningful context. There are three main strategies:

StrategyHow It WorksBest ForTypical Size
Fixed-sizeSplit every N tokens/characters with overlapQuick prototypes, uniform content256-512 tokens
SemanticSplit at natural boundaries (paragraphs, sections) using embedding similarityLong-form documents, varied structure200-800 tokens
HierarchicalMaintain parent-child relationships (doc > section > paragraph)Complex documents, multi-level retrievalMultiple levels

Pro tip: Start with fixed-size chunks of 512 tokens and 50-token overlap. This works surprisingly well for most use cases. Only move to semantic or hierarchical chunking when you've confirmed that chunk boundaries are actually causing retrieval failures.

Step 3: Embedding Generation

Convert each chunk into a vector embedding -- a dense numerical representation that captures semantic meaning. Two chunks about the same topic will have similar embeddings, even if they use different words.

from openai import OpenAI

client = OpenAI()

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks
    )
    return [item.embedding for item in response.data]

Popular embedding models include OpenAI's text-embedding-3-small (1536 dimensions, cheap), Cohere's embed-v3, and open-source options like BAAI/bge-large-en-v1.5 that you can self-host.

Step 4: Vector Storage

Store embeddings in a vector database alongside the original text and metadata. At query time, you'll search this store for chunks similar to the user's question. Options range from PostgreSQL with pgvector to purpose-built databases like Pinecone, Weaviate, and Qdrant.

Step 5: Retrieval

When a user asks a question, embed their query using the same embedding model, then find the most similar chunks in your vector store. This is where retrieval strategy matters enormously.

Retrieval TypeMechanismStrengthsWeaknesses
Dense retrievalCosine similarity between query and chunk embeddingsSemantic understanding, handles synonymsMisses exact keyword matches
Sparse retrievalBM25 / TF-IDF keyword matchingPrecise keyword matching, fastNo semantic understanding
Hybrid retrievalCombines dense + sparse with reciprocal rank fusionBest of both worldsMore complex to implement and tune

Watch out: Dense retrieval alone will miss queries that use exact terminology, product names, or codes. If a user asks about "ERR-4012" and your chunks contain that error code, BM25 will find it instantly while dense retrieval might not. Always benchmark hybrid retrieval against dense-only for your specific data.

Step 6: Context Injection and Generation

Take the retrieved chunks, assemble them into a prompt with the user's question, and send it to the LLM. The prompt template matters -- you need to instruct the model to answer based on the provided context and to say "I don't know" when the context doesn't contain the answer.

def build_rag_prompt(question: str, chunks: list[str]) -> str:
    context = "\n\n---\n\n".join(chunks)
    return f"""Answer the question based ONLY on the following context.
If the context doesn't contain enough information, say "I don't have
enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

Common RAG Failure Modes

I've debugged dozens of RAG systems, and the failures fall into predictable categories:

  1. Bad chunking -- relevant information is split across chunk boundaries, so no single chunk contains a complete answer
  2. Wrong retrieval -- the top-K chunks don't contain the answer, even though it exists in the corpus. Usually a mismatch between query phrasing and document phrasing
  3. Lost in the middle -- the LLM ignores relevant context buried in the middle of a long prompt. Put the most relevant chunks first
  4. Context poisoning -- irrelevant or contradictory chunks confuse the model. Fewer high-quality chunks beat many low-quality ones
  5. Stale embeddings -- source documents changed but embeddings weren't re-generated

How to Evaluate a RAG Pipeline

Direct answer: Evaluate RAG pipelines using three metrics: faithfulness (does the answer match the retrieved context?), answer relevance (does the answer address the question?), and context precision (are the retrieved chunks actually relevant?). Tools like RAGAS automate this evaluation.

Key Evaluation Metrics

MetricWhat It MeasuresHow to Compute
FaithfulnessIs the answer grounded in retrieved context?LLM-as-judge checks each claim against context
Answer RelevanceDoes the answer actually address the question?Generate questions from the answer, compare to original
Context PrecisionAre retrieved chunks relevant?Check if top-K chunks contain ground-truth answer
Context RecallDid we retrieve all relevant chunks?Compare retrieved set to complete relevant set
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Prepare evaluation dataset
eval_data = {
    "question": ["What is our refund policy?"],
    "answer": ["Our refund policy allows returns within 30 days."],
    "contexts": [["Refund Policy: Customers may return items within 30 days..."]],
    "ground_truth": ["Customers can return items within 30 days of purchase."]
}

result = evaluate(eval_data, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)

RAG Pipeline Costs: What to Expect

RAG isn't free, and costs can surprise you at scale. Here's a realistic breakdown:

ComponentCost DriverTypical Range
Embedding generationPer-token API pricing$0.02-0.13 per 1M tokens
Vector storageVectors stored + queries/sec$0-70/month for small-to-mid
LLM generationInput + output tokens per query$0.15-15 per 1M input tokens
Re-indexingHow often source data changesEmbedding cost x corpus size

Pro tip: The single biggest cost lever is how many tokens you feed to the LLM per query. Retrieve 5 chunks of 512 tokens each, and you're sending 2,500 extra input tokens per request. At GPT-4o pricing, that's roughly $0.006 per query. At 100K queries/month, that's $600 just for context tokens. Optimize retrieval precision to reduce chunks needed.

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

RAG retrieves external knowledge at query time and injects it into the prompt. Fine-tuning modifies the model's weights so it internalizes specific knowledge or behavior. RAG is better for factual, frequently-updated data. Fine-tuning is better for consistent style, format, or specialized reasoning. Many production systems combine both.

How many chunks should I retrieve per query?

Start with 3-5 chunks and measure answer quality. More chunks provide more context but increase cost and risk of confusing the model with irrelevant content. Use a reranker (like Cohere Rerank or a cross-encoder) to score and filter retrieved chunks before passing them to the LLM.

Can RAG work with images and other non-text data?

Yes. Multimodal RAG pipelines use vision models to extract descriptions from images, then embed those descriptions alongside text. You can also use multimodal embedding models like CLIP to embed images directly. This is still maturing but works well for product catalogs and documentation with diagrams.

What's the best vector database for RAG?

If you already use PostgreSQL, start with pgvector. It handles millions of vectors without adding infrastructure. Move to Pinecone, Weaviate, or Qdrant when you need sub-10ms latency at scale, advanced filtering, or features like built-in hybrid search. Most teams over-invest in the vector DB and under-invest in chunking.

How do I handle documents that change frequently?

Implement an incremental ingestion pipeline that watches for changes (via webhooks, polling, or change data capture) and re-embeds only the affected chunks. Track document versions and chunk lineage so you can invalidate stale embeddings. Most vector databases support upsert operations for this purpose.

Does RAG eliminate hallucinations completely?

No. RAG significantly reduces hallucinations by grounding the model in retrieved context, but the model can still hallucinate details not present in the chunks, misinterpret the context, or combine information from multiple chunks incorrectly. Always evaluate faithfulness and consider adding citation extraction to your pipeline.

What is hybrid search in RAG and why does it matter?

Hybrid search combines dense vector retrieval (semantic similarity) with sparse keyword retrieval (BM25). Dense retrieval understands meaning but can miss exact terms. Sparse retrieval finds exact matches but misses synonyms. Combining both with reciprocal rank fusion gives the best retrieval accuracy for most real-world queries.

Build It Incrementally

The best RAG systems I've seen were built incrementally. Start with the simplest possible pipeline -- fixed-size chunks, dense retrieval, a basic prompt template. Measure with a small evaluation dataset. Then improve one component at a time: better chunking, hybrid retrieval, reranking, prompt optimization. Each change should measurably improve your evaluation metrics. If it doesn't, revert it. RAG is an engineering discipline, not a collection of tricks.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.