8/8

Lesson 8 15 min

Capstone: Design Your RAG System

Apply everything from the course: design a complete RAG system with document processing, retrieval strategy, generation prompt, evaluation plan, and production architecture.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

You’ve learned every component of a RAG system. Now design a complete one — making real decisions at each stage and justifying each choice.

🔄 Quick Recall: Across this course you’ve covered: why RAG matters (Lesson 1), the three-stage architecture (Lesson 2), document chunking (Lesson 3), embeddings and vector databases (Lesson 4), retrieval strategies (Lesson 5), generation and grounding (Lesson 6), and evaluation (Lesson 7). This capstone integrates all of them.

Capstone Exercise: Company Knowledge Base

Design a RAG system for a mid-size company (500 employees) that needs to answer questions about internal policies, procedures, and documentation.

The Knowledge Base

Document Type	Count	Format	Update Frequency
HR policies	50	PDF	Annually
Product documentation	200	Markdown	Monthly
Meeting notes	2,000	Google Docs	Weekly
Support ticket archives	10,000	JSON	Daily
Training materials	100	PowerPoint/PDF	Quarterly

Total: ~12,350 documents, estimated 500,000 chunks after processing.

Step 1: Document Processing

Decision: Chunking strategy per document type

Document Type	Strategy	Why
HR policies	Structure-aware (by section/clause)	Policies have clear headings; each section is a self-contained topic
Product docs	Structure-aware (by heading)	Markdown headings define logical sections
Meeting notes	Semantic chunking	Notes don’t have consistent structure; split by topic shift
Support tickets	Document-level (one ticket = one chunk)	Each ticket is a self-contained Q&A unit
Training materials	Structure-aware + slide-level	Each slide or section is a distinct learning unit

Chunk size: 300-500 tokens with 20% overlap for fixed/semantic chunks.

Metadata: Source file, document type, department, last updated, author, confidentiality level.

✅ Quick Check: A new employee asks “What’s the PTO policy?” The answer spans sections 3.1 (vacation days), 3.2 (sick leave), and 3.3 (personal days) of the HR handbook. With structure-aware chunking, each section is a separate chunk. Will the system retrieve all three? (Answer: Maybe — it depends on the top-K setting and how relevant each section scores against the query “PTO policy.” To improve this, add a parent-child relationship: a “Chapter 3: Time Off” parent chunk that references all three subsections. The retrieval first finds the parent, then includes the children. Alternatively, increase top-K to 5-7 for policy questions.)

Step 2: Embeddings and Storage

Embedding model: OpenAI text-embedding-3-small

Reason: Good balance of quality and cost for English text. 500K chunks × $0.02 per 1M tokens ≈ $5 total indexing cost.

Vector database: pgvector (PostgreSQL extension)

Reason: Company already uses PostgreSQL. 500K vectors is well within pgvector’s range. No new infrastructure needed.
If growth exceeds 2M vectors: migrate to Weaviate Cloud.

Step 3: Retrieval Strategy

User Query
    ↓
Query Rewriting (LLM-based)
    ↓
Metadata Filter (department, document type)
    ↓
Hybrid Search (vector + BM25 via pgvector)
    ↓
Reranking (Cohere Rerank, top 5)
    ↓
Generation

Why this stack:

Query rewriting handles casual employee language (“Where do I find the thing about PTO?”)
Metadata filtering by department prevents HR questions from surfacing engineering docs
Hybrid search catches both semantic and exact term matches (policy numbers, acronyms)
Reranking ensures the final 5 chunks are truly relevant, not just topically related

Step 4: Generation

Grounding prompt:

<system>
You are the company knowledge base assistant. Answer questions
using ONLY the provided context documents.

Rules:
1. Answer based ONLY on provided context — never use general knowledge
2. Cite every fact: [Source: filename, section]
3. If context partially answers the question, say what you know
   and what's missing
4. If context doesn't answer at all: "I couldn't find information
   about that. Try asking HR directly or checking the intranet."
5. For policy questions, include the policy effective date
6. For conflicting information, cite both sources and note the
   more recent one
</system>

Temperature: 0.1 (near-deterministic for policy accuracy)

Step 5: Evaluation Plan

Test suite: 50 questions

20 factual (policy, procedure, product specs)
10 multi-document (questions requiring info from 2+ sources)
8 ambiguous (vague or underspecified questions)
7 unanswerable (questions our knowledge base shouldn’t answer)
5 adversarial (injection attempts, out-of-scope requests)

Target metrics:

Faithfulness: > 95%
Context Relevance: > 85%
Answer Relevance: > 90%
Latency: < 3 seconds per query

Monitoring: Weekly evaluation against the full test suite. Alert on any metric dropping > 5% from baseline.

Step 6: Production Architecture

┌─────────────────────────────────────────────┐
│              User Interface (Chat)          │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  Semantic Cache (Redis)                      │
│  Hit? → Return cached answer                 │
│  Miss? ↓                                     │
├──────────────────────────────────────────────┤
│  Query Processing                            │
│  Rewrite → Filter → Hybrid Search → Rerank  │
├──────────────────────────────────────────────┤
│  Generation (Claude/GPT-4)                   │
│  Grounding prompt + Context → Response       │
├──────────────────────────────────────────────┤
│  Output Guardrails                           │
│  PII masking, citation check, policy check   │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  PostgreSQL + pgvector                       │
│  500K vectors + metadata + full text         │
└─────────────────────────────────────────────┘

Course Recap

Lesson	Core Concept	Key Takeaway
1. Welcome	Why RAG matters	LLMs hallucinate, RAG grounds them in real data
2. Architecture	Three stages	Indexing (offline) → Retrieval (per-query) → Generation (per-query)
3. Chunking	Document processing	Match strategy to document structure; overlap prevents boundary splits
4. Embeddings	Vector representation	Same model for indexing and querying; choose DB for your stage
5. Retrieval	Advanced search	Hybrid search + reranking > vector-only search
6. Generation	Grounding and citation	“ONLY based on” not “using”; cite every claim
7. Evaluation	RAGAS metrics	Measure each stage separately to diagnose problems
8. Capstone	Complete system design	Design decisions cascade — each stage affects the next

RAG Design Checklist

Document Processing:
□ Chunking strategy matched to each document type
□ Overlap configured (10-25%) to prevent boundary splits
□ Metadata extracted (source, date, type, department)

Embeddings & Storage:
□ Embedding model selected and consistent across all documents
□ Vector database appropriate for current scale
□ Migration path defined for growth

Retrieval:
□ Hybrid search (vector + keyword) for technical content
□ Reranking for precision
□ Query rewriting for casual/ambiguous queries
□ Metadata filtering for scoped searches

Generation:
□ Restrictive grounding prompt ("ONLY based on")
□ Citation format defined and enforced
□ Gap handling (partial answers, conflicts, unknowns)
□ Low temperature (0.0-0.2)

Evaluation:
□ Test suite with 50+ questions across all categories
□ RAGAS metrics tracked: faithfulness, context relevance, answer relevance
□ Weekly evaluation against full test suite
□ Alerts on metric degradation

Production:
□ Semantic cache for frequent queries
□ Output guardrails (PII, citations, policy compliance)
□ Incremental indexing for document updates
□ Monitoring and cost tracking

Key Takeaways

RAG system design is a series of cascading decisions — each stage affects the next
Match document processing to document structure — no universal chunking strategy works for all content
Start with the simplest architecture that works (pgvector, not Milvus) and scale when needed
Grounding and citation are non-negotiable for trustworthy RAG
Measure each stage independently with RAGAS metrics — end-to-end testing alone can’t diagnose which stage failed
Semantic caching handles the 60%+ of queries that are variations of the same questions

Knowledge Check

1. You're designing a RAG system for a hospital's medical knowledge base. Doctors will ask questions about drug interactions, treatment protocols, and clinical guidelines. What's the most critical design decision?

Which vector database to use The grounding and citation strategy — in a medical context, every claim must be traceable to a specific guideline or study. Hallucinated medical information could harm patients. Faithfulness and citation aren't nice-to-haves; they're safety requirements How fast the system responds

2. Your RAG system processes 10,000 queries per day. You notice that 60% of queries are variations of the same 200 questions. What optimization should you implement?

A larger vector database A semantic cache that stores answers to frequently asked questions. When a new query semantically matches a cached question (using embedding similarity), return the cached answer instead of running the full RAG pipeline — cutting costs by 60% and reducing latency A faster LLM

3. After completing this course, what's the single most important principle for building reliable RAG systems?

Use the most expensive embedding model available Evaluate at every stage — measure retrieval quality and generation quality separately, so when something breaks, you know exactly which stage to fix. Without stage-specific metrics, you're guessing Maximize the number of retrieved chunks

Answer all questions to check

Complete the quiz above first