Document Processing: Chunking and Preparation
Master document chunking strategies: fixed-size, semantic, and structure-aware chunking. Learn metadata extraction, overlap, and how chunk size affects retrieval quality.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Chunking is the most underrated step in RAG. Get it wrong, and even the best embedding model and vector database won’t help — because the information the user needs will be split, buried, or missing.
🔄 Quick Recall: In the previous lesson, you learned the three RAG stages: indexing, retrieval, and generation. Chunking is the critical step in indexing that determines what the retrieval stage can find. Bad chunks = bad retrieval = bad answers.
Why Chunking Matters
An LLM’s context window is finite. You can’t feed it your entire document collection — you need to find the specific pieces that answer the question. Chunking determines what those pieces look like.
Too small: Chunks lack context. “30 days” means nothing without “electronics return policy” around it.
Too large: Chunks contain too many topics. A 5-page chunk about “all return policies” dilutes the specific electronics policy the user asked about.
Just right: Chunks contain one complete idea with enough context to stand alone.
Strategy 1: Fixed-Size Chunking
Split by a fixed number of tokens, with overlap:
Document: [..........500 tokens..........]
Chunk 1: [===== 200 tokens =====]
Chunk 2: [===== 200 tokens =====] (50-token overlap)
Chunk 3: [===== 200 tokens =====]
Settings:
- Chunk size: 200-512 tokens (experiment for your content)
- Overlap: 10-25% of chunk size (prevents boundary splits)
Pros: Simple to implement, predictable chunk count Cons: Ignores document structure — splits mid-sentence, mid-paragraph, mid-section
Best for: Unstructured text without clear headings (chat logs, emails, plain text)
✅ Quick Check: You set chunk size to 100 tokens with no overlap. Your document says: “The cancellation fee is 15% of the remaining contract value. This fee is waived for customers who have been with us for more than 3 years.” This critical two-sentence policy gets split across two chunks. What percentage overlap would prevent this? (Answer: 20-25% overlap (20-25 tokens). The second sentence starts around token 60-70, and with 20-25 tokens of overlap, the full policy would appear intact in at least one chunk. As a rule: set overlap to cover at least 2-3 sentences to prevent splitting complete thoughts.)
Strategy 2: Semantic Chunking
Split at natural semantic boundaries — sentences, paragraphs, or topic shifts:
Document about return policies:
Chunk 1: [Paragraph about electronics returns]
Chunk 2: [Paragraph about clothing returns]
Chunk 3: [Paragraph about furniture returns]
How Semantic Chunking Works
- Split text into sentences
- Generate embeddings for each sentence
- Compare adjacent sentence embeddings
- When similarity drops below a threshold (topic shift), insert a chunk boundary
Pros: Chunks contain complete, coherent ideas Cons: Variable chunk sizes, more complex to implement
Best for: Articles, reports, documentation — content with clear topic flow
Strategy 3: Structure-Aware Chunking
Use the document’s own structure (headings, sections, chapters) as chunk boundaries:
# Employee Handbook
## Chapter 1: Onboarding → Chunk 1
### First Day Checklist
### System Access Setup
## Chapter 2: Time Off Policy → Chunk 2
### Vacation Days
### Sick Leave
### Parental Leave
## Chapter 3: Benefits → Chunk 3
### Health Insurance
### Retirement Plan
Pros: Preserves the document’s logical organization Cons: Sections may be too large or too small — needs post-processing
Best for: Documentation, manuals, handbooks, structured reports
Hierarchical Chunking
For structure-aware content, keep both the section and subsection levels:
Parent chunk: "Chapter 2: Time Off Policy" (full chapter)
Child chunks:
- "Vacation Days" (subsection)
- "Sick Leave" (subsection)
- "Parental Leave" (subsection)
Search first finds the relevant parent, then drills into the specific child chunk.
Metadata Enrichment
Raw text chunks aren’t enough. Add metadata to improve retrieval:
{
"text": "Electronics can be returned within 30 days...",
"metadata": {
"source": "return-policy.pdf",
"page": 3,
"section": "Electronics Returns",
"document_type": "policy",
"last_updated": "2025-11-15",
"department": "customer_service"
}
}
Metadata enables:
- Filtered search: “Find only chunks from the policy department”
- Source attribution: “This answer comes from return-policy.pdf, page 3”
- Freshness control: “Prefer documents updated in the last 6 months”
✅ Quick Check: A user asks “What’s the current vacation policy?” Your knowledge base has two versions of the HR policy: one from 2024 and one from 2025. Without metadata, the retrieval might return chunks from the outdated 2024 version. How does metadata solve this? (Answer: The
last_updatedmetadata field lets you filter or prioritize recent documents. Your retrieval query becomes: “Find chunks about vacation policy, prefer documents updated after 2025-01-01.” This ensures the 2025 policy ranks higher than the 2024 version. Metadata turns a simple text search into a structured, context-aware search.)
Choosing the Right Strategy
| Content Type | Recommended Strategy | Chunk Size |
|---|---|---|
| Chat logs, emails | Fixed-size with overlap | 200-300 tokens |
| Articles, reports | Semantic chunking | Variable (200-500 tokens) |
| Manuals, handbooks | Structure-aware | Section-based |
| Code documentation | Structure-aware (by function/class) | Function-level |
| FAQ pages | One Q&A per chunk | Variable |
| Legal contracts | Structure-aware (by clause) | Clause-level |
Practice Exercise
- Take a document from your work (policy, manual, report)
- Try chunking it three ways: fixed-size (300 tokens), by paragraph, and by heading
- For each approach, ask yourself: If I searched for a specific question, would the right chunk come back?
- Note which strategy keeps the most complete, meaningful chunks
Key Takeaways
- Chunk size is a trade-off: too small loses context, too large dilutes relevance
- Overlap (10-25%) prevents splitting meaningful content at chunk boundaries
- Fixed-size chunking is simplest but ignores document structure
- Semantic chunking splits at topic boundaries for coherent chunks
- Structure-aware chunking uses the document’s own headings and sections
- Metadata (source, date, section, type) enables filtered search and source attribution
- Match your strategy to your content type — there’s no universal best approach
Up Next
In the next lesson, you’ll learn how embeddings convert text chunks into vectors and how to choose the right vector database for your knowledge base.
Knowledge Check
Complete the quiz above first
Lesson completed!