2/8

Lesson 2 12 min

RAG Architecture: The Three Stages

Explore the complete RAG pipeline: indexing (document preparation), retrieval (semantic search), and generation (grounded response). Understand how each stage connects.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

RAG systems have three stages, each with its own challenges and optimization opportunities. Understanding the full pipeline is essential before diving into each component.

The Complete RAG Pipeline

STAGE 1: INDEXING (Offline — runs when documents change)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Documents → Parse → Chunk → Embed → Store in Vector DB

STAGE 2: RETRIEVAL (Online — runs per query)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
User Query → Embed Query → Search Vector DB → Rank Results

STAGE 3: GENERATION (Online — runs per query)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Retrieved Context + Query → LLM → Grounded Response

Stage 1: Indexing

Indexing converts your documents into a searchable format. This happens offline — before any user asks a question.

Step 1: Parse Documents

Extract text from different formats:

Format	Challenge	Solution
PDF	Tables, images, multi-column layouts	Specialized PDF parsers (PyMuPDF, Unstructured)
HTML	Navigation, ads, boilerplate	Content extraction libraries
Word/Excel	Formatting, embedded objects	Document parsing libraries
Markdown	Structured headings, code blocks	Native text processing

Step 2: Chunk Documents

Split documents into smaller pieces that fit in the LLM’s context and can be retrieved individually. We’ll cover chunking strategies in depth in Lesson 3.

Step 3: Generate Embeddings

Convert each chunk into a vector — a numerical representation that captures its meaning. Similar meanings produce similar vectors, enabling semantic search.

Step 4: Store in Vector Database

Save the embeddings and their source text in a vector database for fast similarity search.

✅ Quick Check: Your company has 10,000 PDF documents. You index them all, creating 150,000 chunks in your vector database. A month later, 200 documents are updated. Do you need to re-index all 150,000 chunks? (Answer: No — only re-index the chunks from the 200 updated documents. Efficient indexing pipelines track which documents have changed and process only the delta. Re-indexing everything wastes compute and money. This is why document metadata (source file, last modified date, chunk position) matters — it enables incremental updates.)

Stage 2: Retrieval

When a user asks a question, retrieval finds the most relevant chunks from the vector database.

How Semantic Search Works

User query: "What is our return policy for electronics?"
                        ↓
Embed the query → [0.23, -0.45, 0.67, ...]  (same model as indexing)
                        ↓
Search vector DB → Find chunks with similar vectors
                        ↓
Top results:
  1. "Electronics can be returned within 30 days..."  (similarity: 0.92)
  2. "All returns require original packaging..."       (similarity: 0.87)
  3. "Electronics department hours: 9 AM - 9 PM..."   (similarity: 0.71)

The key insight: the query embedding and document chunk embeddings must use the same embedding model. Different models produce incompatible vector spaces.

Retrieval Parameters

Parameter	What It Controls	Typical Range
Top-K	Number of chunks returned	3-10
Similarity threshold	Minimum relevance score	0.7-0.85
Metadata filters	Narrow search by document type, date, category	Depends on your schema

Stage 3: Generation

The LLM receives the retrieved chunks along with the original question and produces an answer grounded in that context.

The Generation Prompt

<system>
You are a helpful assistant. Answer questions using ONLY the
provided context. If the context doesn't contain the answer,
say "I don't have information about that in our knowledge base."
Do not use your general knowledge to fill gaps.
</system>

<context>
{retrieved_chunk_1}

Source: electronics-return-policy.pdf, page 3

{retrieved_chunk_2}

Source: general-return-guidelines.pdf, page 1
</context>

<question>{user_question}</question>

The grounding instruction (“Answer ONLY using the provided context”) is critical. Without it, the LLM will mix retrieved facts with its own knowledge — which may be incorrect or outdated.

Citation in Generation

Well-designed RAG systems tell users where the answer came from:

Answer: Electronics can be returned within 30 days of
purchase with original packaging and receipt.
[Source: electronics-return-policy.pdf, page 3]

All returns are subject to a 15% restocking fee for
opened items.
[Source: general-return-guidelines.pdf, page 1]

Citations build trust and let users verify the answer against the source document.

✅ Quick Check: Your RAG system retrieves 5 chunks, but only 2 are relevant to the question. The LLM uses information from an irrelevant chunk in its answer. What’s the fix? (Answer: Two fixes work together: (1) Add a reranking step between retrieval and generation to filter out low-relevance chunks, and (2) strengthen the generation prompt to instruct the LLM to ignore context that isn’t relevant to the specific question. Retrieval casts a wide net; generation should be selective about what it uses.)

Naive RAG vs. Advanced RAG

The pipeline above is “Naive RAG” — the simplest implementation. Advanced RAG adds optimizations at each stage:

Stage	Naive RAG	Advanced RAG
Indexing	Fixed-size chunks	Semantic chunking, metadata enrichment
Retrieval	Vector similarity only	Hybrid search + reranking + query rewriting
Generation	Basic prompt with context	Grounding checks, citation, faithfulness scoring

You’ll learn each advanced technique in the upcoming lessons.

Practice Exercise

Pick a document collection you work with (policies, manuals, reports)
Trace a question through all three stages: How would you chunk the documents? How would the query find the right chunk? What would the generation prompt look like?
Identify where each stage could fail and what the user would see

Key Takeaways

RAG has three stages: indexing (offline, per-document), retrieval (online, per-query), and generation (online, per-query)
Indexing converts documents into searchable embeddings — run it once and update incrementally
Retrieval uses semantic search to find relevant chunks — the query and document embeddings must use the same model
Generation combines retrieved context with the question — grounding instructions prevent the LLM from inventing answers
Naive RAG is the starting point; advanced techniques optimize each stage for production quality

Up Next

In the next lesson, you’ll dive deep into the indexing stage — specifically document chunking: how to split documents into pieces that are the right size, preserve meaning, and retrieve well.

Knowledge Check

1. In a RAG system, when does the indexing stage run?

Every time a user asks a question Once during setup and again whenever documents are added, updated, or removed — it's an offline process that prepares documents for fast retrieval, not something that runs per-query Only during the initial system setup

2. A user asks 'What is our return policy for electronics?' The retrieval stage returns 5 document chunks. Three are about the electronics return policy, one is about the general return policy, and one is about electronics product specifications. Which chunks should the generation stage receive?

All 5 — more context is always better The 3 electronics return policy chunks and possibly the general return policy chunk — the product specs chunk is irrelevant and will dilute the context. Filtering for relevance before generation improves answer quality Only the first chunk — keep it simple

3. Your RAG system retrieves the correct document chunk containing the answer, but the LLM's response contradicts the retrieved information. What went wrong?

The vector database is broken The generation prompt doesn't adequately instruct the LLM to base its answer on the provided context — without strong grounding instructions, the LLM may rely on its training data instead of the retrieved documents, especially when they conflict The embedding model is wrong

Answer all questions to check

Complete the quiz above first