RAG Architecture: The Three Stages
Explore the complete RAG pipeline: indexing (document preparation), retrieval (semantic search), and generation (grounded response). Understand how each stage connects.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
RAG systems have three stages, each with its own challenges and optimization opportunities. Understanding the full pipeline is essential before diving into each component.
The Complete RAG Pipeline
STAGE 1: INDEXING (Offline — runs when documents change)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Documents → Parse → Chunk → Embed → Store in Vector DB
STAGE 2: RETRIEVAL (Online — runs per query)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
User Query → Embed Query → Search Vector DB → Rank Results
STAGE 3: GENERATION (Online — runs per query)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Retrieved Context + Query → LLM → Grounded Response
Stage 1: Indexing
Indexing converts your documents into a searchable format. This happens offline — before any user asks a question.
Step 1: Parse Documents
Extract text from different formats:
| Format | Challenge | Solution |
|---|---|---|
| Tables, images, multi-column layouts | Specialized PDF parsers (PyMuPDF, Unstructured) | |
| HTML | Navigation, ads, boilerplate | Content extraction libraries |
| Word/Excel | Formatting, embedded objects | Document parsing libraries |
| Markdown | Structured headings, code blocks | Native text processing |
Step 2: Chunk Documents
Split documents into smaller pieces that fit in the LLM’s context and can be retrieved individually. We’ll cover chunking strategies in depth in Lesson 3.
Step 3: Generate Embeddings
Convert each chunk into a vector — a numerical representation that captures its meaning. Similar meanings produce similar vectors, enabling semantic search.
Step 4: Store in Vector Database
Save the embeddings and their source text in a vector database for fast similarity search.
✅ Quick Check: Your company has 10,000 PDF documents. You index them all, creating 150,000 chunks in your vector database. A month later, 200 documents are updated. Do you need to re-index all 150,000 chunks? (Answer: No — only re-index the chunks from the 200 updated documents. Efficient indexing pipelines track which documents have changed and process only the delta. Re-indexing everything wastes compute and money. This is why document metadata (source file, last modified date, chunk position) matters — it enables incremental updates.)
Stage 2: Retrieval
When a user asks a question, retrieval finds the most relevant chunks from the vector database.
How Semantic Search Works
User query: "What is our return policy for electronics?"
↓
Embed the query → [0.23, -0.45, 0.67, ...] (same model as indexing)
↓
Search vector DB → Find chunks with similar vectors
↓
Top results:
1. "Electronics can be returned within 30 days..." (similarity: 0.92)
2. "All returns require original packaging..." (similarity: 0.87)
3. "Electronics department hours: 9 AM - 9 PM..." (similarity: 0.71)
The key insight: the query embedding and document chunk embeddings must use the same embedding model. Different models produce incompatible vector spaces.
Retrieval Parameters
| Parameter | What It Controls | Typical Range |
|---|---|---|
| Top-K | Number of chunks returned | 3-10 |
| Similarity threshold | Minimum relevance score | 0.7-0.85 |
| Metadata filters | Narrow search by document type, date, category | Depends on your schema |
Stage 3: Generation
The LLM receives the retrieved chunks along with the original question and produces an answer grounded in that context.
The Generation Prompt
<system>
You are a helpful assistant. Answer questions using ONLY the
provided context. If the context doesn't contain the answer,
say "I don't have information about that in our knowledge base."
Do not use your general knowledge to fill gaps.
</system>
<context>
{retrieved_chunk_1}
Source: electronics-return-policy.pdf, page 3
{retrieved_chunk_2}
Source: general-return-guidelines.pdf, page 1
</context>
<question>{user_question}</question>
The grounding instruction (“Answer ONLY using the provided context”) is critical. Without it, the LLM will mix retrieved facts with its own knowledge — which may be incorrect or outdated.
Citation in Generation
Well-designed RAG systems tell users where the answer came from:
Answer: Electronics can be returned within 30 days of
purchase with original packaging and receipt.
[Source: electronics-return-policy.pdf, page 3]
All returns are subject to a 15% restocking fee for
opened items.
[Source: general-return-guidelines.pdf, page 1]
Citations build trust and let users verify the answer against the source document.
✅ Quick Check: Your RAG system retrieves 5 chunks, but only 2 are relevant to the question. The LLM uses information from an irrelevant chunk in its answer. What’s the fix? (Answer: Two fixes work together: (1) Add a reranking step between retrieval and generation to filter out low-relevance chunks, and (2) strengthen the generation prompt to instruct the LLM to ignore context that isn’t relevant to the specific question. Retrieval casts a wide net; generation should be selective about what it uses.)
Naive RAG vs. Advanced RAG
The pipeline above is “Naive RAG” — the simplest implementation. Advanced RAG adds optimizations at each stage:
| Stage | Naive RAG | Advanced RAG |
|---|---|---|
| Indexing | Fixed-size chunks | Semantic chunking, metadata enrichment |
| Retrieval | Vector similarity only | Hybrid search + reranking + query rewriting |
| Generation | Basic prompt with context | Grounding checks, citation, faithfulness scoring |
You’ll learn each advanced technique in the upcoming lessons.
Practice Exercise
- Pick a document collection you work with (policies, manuals, reports)
- Trace a question through all three stages: How would you chunk the documents? How would the query find the right chunk? What would the generation prompt look like?
- Identify where each stage could fail and what the user would see
Key Takeaways
- RAG has three stages: indexing (offline, per-document), retrieval (online, per-query), and generation (online, per-query)
- Indexing converts documents into searchable embeddings — run it once and update incrementally
- Retrieval uses semantic search to find relevant chunks — the query and document embeddings must use the same model
- Generation combines retrieved context with the question — grounding instructions prevent the LLM from inventing answers
- Naive RAG is the starting point; advanced techniques optimize each stage for production quality
Up Next
In the next lesson, you’ll dive deep into the indexing stage — specifically document chunking: how to split documents into pieces that are the right size, preserve meaning, and retrieve well.
Knowledge Check
Complete the quiz above first
Lesson completed!