RAG Workflows: AI That Knows Your Data

🔄 Your agent has tools (Lesson 4) and memory (Lesson 5). But when a customer asks “What’s your refund policy?”, the agent has to guess — because it’s never read your documentation. RAG changes that. Instead of hoping the LLM knows the answer, you give it your actual documents and let it find the relevant information before responding.

What Is RAG?

Retrieval-Augmented Generation is a three-step process:

Embed — Convert your documents into numerical vectors (embeddings) and store them in a vector database
Retrieve — When a question arrives, find the most relevant document chunks by comparing the question’s embedding to stored embeddings
Generate — Feed the retrieved chunks + the question to an LLM, which generates an answer grounded in your actual data

The result: an AI that answers questions from your documents instead of making things up. The LLM’s response is grounded in your data — and it can cite where the information came from.

User Question → Embed Question → Search Vector Store → Top 3 Chunks
                                                            ↓
                                        LLM: "Based on these chunks, here's the answer..."

Why RAG instead of stuffing documents into the prompt? Token limits. GPT-4o has a 128K context window — big, but a mid-size company’s documentation easily exceeds that. RAG lets you search millions of documents and only feed the relevant chunks to the LLM.

✅ Quick Check: A company has 500 pages of product documentation. Why can’t they just paste all of it into the LLM prompt? (Answer: Token limits and cost. Even 128K token models can’t hold 500 pages of text. And even if they could, processing all that context for every question would be extremely expensive and slow. RAG retrieves only the 3-5 most relevant passages, keeping costs low and responses fast.)

The RAG Pipeline in n8n

n8n’s RAG pipeline uses four types of nodes:

1. Document Loaders — Ingest your source material

PDF Loader, Google Drive Loader, Notion Loader, Web Scraper
Converts documents into text that can be chunked and embedded

2. Text Splitters — Break documents into chunks

Recursive Character Text Splitter (default, works well for most text)
Token Text Splitter (splits by token count — more precise for LLM input)
Configure chunk size (typically 500-1000 tokens) and overlap (10-20%)

3. Embeddings — Convert text chunks into vectors

OpenAI Embeddings (text-embedding-3-small is cheap and effective)
Cohere, HuggingFace, or local models via Ollama

4. Vector Stores — Store and search embeddings

Supabase (pgvector) — free tier available, persistent, SQL-queryable
Pinecone — managed service, high performance, free tier
Qdrant — open source, self-hostable, great for privacy
In-Memory — testing only (resets on restart)

Build: RAG Knowledge Base Bot

You’ll build a bot that answers questions from a set of documents. We’ll use Supabase as the vector store because it’s free, persistent, and pairs well with n8n.

Part A: Document Ingestion Pipeline

This workflow embeds your documents into the vector store. You run it once (or whenever documents change).

Step 1: Create the ingestion workflow

New workflow → add a Manual Trigger
Add a Document Loader — for this example, use the Default Data Loader and paste some test text, or use the PDF Loader if you have a PDF to upload
Add a Recursive Character Text Splitter:
- Chunk Size: 800 characters
- Chunk Overlap: 100 characters
Add a Supabase Vector Store node in Insert mode:
- Connect to your Supabase instance (create a free project at supabase.com)
- Table: documents (you’ll need to enable the pgvector extension and create this table — Supabase has a one-click setup for this)
Add OpenAI Embeddings sub-node to the vector store:
- Model: text-embedding-3-small
- This converts each text chunk into a 1536-dimensional vector

Connect them: Trigger → Loader → Splitter → Vector Store (with Embeddings attached)

Step 2: Run the ingestion

Click “Test workflow.” Watch as your documents are chunked, embedded, and stored. Check your Supabase dashboard — you should see rows in the documents table, each with a text chunk and its vector embedding.

Part B: Question-Answering Workflow

This workflow takes user questions and retrieves answers from your stored documents.

Step 1: Create the Q&A workflow

New workflow → add a Chat Trigger
Add a Q&A Chain node (not the AI Agent — Q&A Chain is optimized for document retrieval)
Attach an OpenAI Chat Model sub-node (gpt-4o-mini is fine for Q&A)
Attach a Supabase Vector Store sub-node in Retrieve mode:
- Connect to the same Supabase instance and table
- Top K: 4 (retrieves the 4 most relevant chunks)
Attach OpenAI Embeddings (same model you used for ingestion — this is important)

Step 2: Test it

Click “Test workflow.” Ask questions about your documents:

“What’s the refund policy?”
“How do I reset my password?”
“What features are included in the Pro plan?”

The Q&A Chain will search the vector store, retrieve the most relevant chunks, and generate an answer grounded in your actual documents.

✅ Quick Check: You ingested documents using OpenAI’s text-embedding-3-small model. For the retrieval workflow, can you use a different embedding model? (Answer: No. You must use the same embedding model for both ingestion and retrieval. Different models produce different vector representations — searching with mismatched embeddings returns irrelevant results. This is a common mistake when switching models mid-project.)

Chunking: The Make-or-Break Decision

Chunking strategy determines your RAG quality more than model selection. Bad chunks = bad answers.

Chunk Size	Pros	Cons	Good For
Small (200-400 tokens)	Precise retrieval	May split important context	FAQ-style knowledge bases
Medium (500-1000 tokens)	Balanced precision/context	Standard tradeoff	Most documentation
Large (1000-2000 tokens)	Full context preserved	May include irrelevant text	Long-form articles, reports

Overlap matters too. Without overlap, the boundary between chunks can split a sentence — and the critical information lives in neither chunk. A 10-20% overlap ensures sentences at chunk boundaries appear in both adjacent chunks.

Rule of thumb: Start with 800-token chunks and 100-token overlap. Test with real questions. If answers miss relevant context, increase chunk size. If answers include too much irrelevant text, decrease it.

RAG vs. Fine-Tuning vs. Long Context

Three approaches to giving an LLM your knowledge:

Approach	When to Use	Tradeoff
RAG	Dynamic knowledge that changes (docs, policies, product info)	Depends on retrieval quality — garbage in, garbage out
Fine-Tuning	Teaching the model a specific style or pattern	Expensive, slow to update, requires training data
Long Context	Small, fixed knowledge base (<100 pages)	Expensive per query, no scaling beyond context limit

For most business workflows, RAG is the right choice. Your docs change, your policies update, your product evolves — and you don’t want to retrain a model every time.

Advanced: RAG with the AI Agent

The Q&A Chain is great for straightforward document retrieval. But what if you want an agent that can both search your documents and use other tools (web search, code execution)?

Connect the vector store as a Vector Store Tool to the AI Agent node instead of using the Q&A Chain. The agent can then decide when to check your documents vs. when to search the web — combining RAG with the tool-use patterns from Lesson 4.

Update the system prompt:

You have access to:
- Company Knowledge Base (vector store): Use for internal policies, product docs, procedures
- Web Search: Use for external information, industry benchmarks, competitor data

Always check the knowledge base FIRST for internal questions.
Only use web search if the knowledge base doesn't have the answer.

Key Takeaways

RAG lets your AI answer questions from your actual documents — not from the LLM’s training data
The pipeline is: ingest (load → chunk → embed → store) then query (question → embed → search → generate)
Chunking is the most impactful decision — start with 800 tokens and 100 overlap, then tune
Use the same embedding model for both ingestion and retrieval — mismatched models produce bad results
Supabase (pgvector) is a solid free option for vector storage; Pinecone and Qdrant are alternatives
Combine RAG with the AI Agent to build assistants that search your docs and use external tools

Up Next

You’ve now built AI workflows with classification, agents, memory, and RAG. But none of them are production-ready. What happens when the LLM API times out? When your vector store is down? When credentials expire? In Lesson 7, you’ll learn production patterns — error handling, retry strategies, credential management, and monitoring — that turn your prototypes into workflows you can trust.