Capstone: Design Your RAG System
Apply everything from the course: design a complete RAG system with document processing, retrieval strategy, generation prompt, evaluation plan, and production architecture.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
You’ve learned every component of a RAG system. Now design a complete one — making real decisions at each stage and justifying each choice.
🔄 Quick Recall: Across this course you’ve covered: why RAG matters (Lesson 1), the three-stage architecture (Lesson 2), document chunking (Lesson 3), embeddings and vector databases (Lesson 4), retrieval strategies (Lesson 5), generation and grounding (Lesson 6), and evaluation (Lesson 7). This capstone integrates all of them.
Capstone Exercise: Company Knowledge Base
Design a RAG system for a mid-size company (500 employees) that needs to answer questions about internal policies, procedures, and documentation.
The Knowledge Base
| Document Type | Count | Format | Update Frequency |
|---|---|---|---|
| HR policies | 50 | Annually | |
| Product documentation | 200 | Markdown | Monthly |
| Meeting notes | 2,000 | Google Docs | Weekly |
| Support ticket archives | 10,000 | JSON | Daily |
| Training materials | 100 | PowerPoint/PDF | Quarterly |
Total: ~12,350 documents, estimated 500,000 chunks after processing.
Step 1: Document Processing
Decision: Chunking strategy per document type
| Document Type | Strategy | Why |
|---|---|---|
| HR policies | Structure-aware (by section/clause) | Policies have clear headings; each section is a self-contained topic |
| Product docs | Structure-aware (by heading) | Markdown headings define logical sections |
| Meeting notes | Semantic chunking | Notes don’t have consistent structure; split by topic shift |
| Support tickets | Document-level (one ticket = one chunk) | Each ticket is a self-contained Q&A unit |
| Training materials | Structure-aware + slide-level | Each slide or section is a distinct learning unit |
Chunk size: 300-500 tokens with 20% overlap for fixed/semantic chunks.
Metadata: Source file, document type, department, last updated, author, confidentiality level.
✅ Quick Check: A new employee asks “What’s the PTO policy?” The answer spans sections 3.1 (vacation days), 3.2 (sick leave), and 3.3 (personal days) of the HR handbook. With structure-aware chunking, each section is a separate chunk. Will the system retrieve all three? (Answer: Maybe — it depends on the top-K setting and how relevant each section scores against the query “PTO policy.” To improve this, add a parent-child relationship: a “Chapter 3: Time Off” parent chunk that references all three subsections. The retrieval first finds the parent, then includes the children. Alternatively, increase top-K to 5-7 for policy questions.)
Step 2: Embeddings and Storage
Embedding model: OpenAI text-embedding-3-small
- Reason: Good balance of quality and cost for English text. 500K chunks × $0.02 per 1M tokens ≈ $5 total indexing cost.
Vector database: pgvector (PostgreSQL extension)
- Reason: Company already uses PostgreSQL. 500K vectors is well within pgvector’s range. No new infrastructure needed.
- If growth exceeds 2M vectors: migrate to Weaviate Cloud.
Step 3: Retrieval Strategy
User Query
↓
Query Rewriting (LLM-based)
↓
Metadata Filter (department, document type)
↓
Hybrid Search (vector + BM25 via pgvector)
↓
Reranking (Cohere Rerank, top 5)
↓
Generation
Why this stack:
- Query rewriting handles casual employee language (“Where do I find the thing about PTO?”)
- Metadata filtering by department prevents HR questions from surfacing engineering docs
- Hybrid search catches both semantic and exact term matches (policy numbers, acronyms)
- Reranking ensures the final 5 chunks are truly relevant, not just topically related
Step 4: Generation
Grounding prompt:
<system>
You are the company knowledge base assistant. Answer questions
using ONLY the provided context documents.
Rules:
1. Answer based ONLY on provided context — never use general knowledge
2. Cite every fact: [Source: filename, section]
3. If context partially answers the question, say what you know
and what's missing
4. If context doesn't answer at all: "I couldn't find information
about that. Try asking HR directly or checking the intranet."
5. For policy questions, include the policy effective date
6. For conflicting information, cite both sources and note the
more recent one
</system>
Temperature: 0.1 (near-deterministic for policy accuracy)
Step 5: Evaluation Plan
Test suite: 50 questions
- 20 factual (policy, procedure, product specs)
- 10 multi-document (questions requiring info from 2+ sources)
- 8 ambiguous (vague or underspecified questions)
- 7 unanswerable (questions our knowledge base shouldn’t answer)
- 5 adversarial (injection attempts, out-of-scope requests)
Target metrics:
- Faithfulness: > 95%
- Context Relevance: > 85%
- Answer Relevance: > 90%
- Latency: < 3 seconds per query
Monitoring: Weekly evaluation against the full test suite. Alert on any metric dropping > 5% from baseline.
Step 6: Production Architecture
┌─────────────────────────────────────────────┐
│ User Interface (Chat) │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ Semantic Cache (Redis) │
│ Hit? → Return cached answer │
│ Miss? ↓ │
├──────────────────────────────────────────────┤
│ Query Processing │
│ Rewrite → Filter → Hybrid Search → Rerank │
├──────────────────────────────────────────────┤
│ Generation (Claude/GPT-4) │
│ Grounding prompt + Context → Response │
├──────────────────────────────────────────────┤
│ Output Guardrails │
│ PII masking, citation check, policy check │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ PostgreSQL + pgvector │
│ 500K vectors + metadata + full text │
└─────────────────────────────────────────────┘
Course Recap
| Lesson | Core Concept | Key Takeaway |
|---|---|---|
| 1. Welcome | Why RAG matters | LLMs hallucinate, RAG grounds them in real data |
| 2. Architecture | Three stages | Indexing (offline) → Retrieval (per-query) → Generation (per-query) |
| 3. Chunking | Document processing | Match strategy to document structure; overlap prevents boundary splits |
| 4. Embeddings | Vector representation | Same model for indexing and querying; choose DB for your stage |
| 5. Retrieval | Advanced search | Hybrid search + reranking > vector-only search |
| 6. Generation | Grounding and citation | “ONLY based on” not “using”; cite every claim |
| 7. Evaluation | RAGAS metrics | Measure each stage separately to diagnose problems |
| 8. Capstone | Complete system design | Design decisions cascade — each stage affects the next |
RAG Design Checklist
Document Processing:
□ Chunking strategy matched to each document type
□ Overlap configured (10-25%) to prevent boundary splits
□ Metadata extracted (source, date, type, department)
Embeddings & Storage:
□ Embedding model selected and consistent across all documents
□ Vector database appropriate for current scale
□ Migration path defined for growth
Retrieval:
□ Hybrid search (vector + keyword) for technical content
□ Reranking for precision
□ Query rewriting for casual/ambiguous queries
□ Metadata filtering for scoped searches
Generation:
□ Restrictive grounding prompt ("ONLY based on")
□ Citation format defined and enforced
□ Gap handling (partial answers, conflicts, unknowns)
□ Low temperature (0.0-0.2)
Evaluation:
□ Test suite with 50+ questions across all categories
□ RAGAS metrics tracked: faithfulness, context relevance, answer relevance
□ Weekly evaluation against full test suite
□ Alerts on metric degradation
Production:
□ Semantic cache for frequent queries
□ Output guardrails (PII, citations, policy compliance)
□ Incremental indexing for document updates
□ Monitoring and cost tracking
Key Takeaways
- RAG system design is a series of cascading decisions — each stage affects the next
- Match document processing to document structure — no universal chunking strategy works for all content
- Start with the simplest architecture that works (pgvector, not Milvus) and scale when needed
- Grounding and citation are non-negotiable for trustworthy RAG
- Measure each stage independently with RAGAS metrics — end-to-end testing alone can’t diagnose which stage failed
- Semantic caching handles the 60%+ of queries that are variations of the same questions
Knowledge Check
Complete the quiz above first
Lesson completed!