3/8

Lesson 3 15 min

Document Processing: Chunking and Preparation

Master document chunking strategies: fixed-size, semantic, and structure-aware chunking. Learn metadata extraction, overlap, and how chunk size affects retrieval quality.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

Chunking is the most underrated step in RAG. Get it wrong, and even the best embedding model and vector database won’t help — because the information the user needs will be split, buried, or missing.

🔄 Quick Recall: In the previous lesson, you learned the three RAG stages: indexing, retrieval, and generation. Chunking is the critical step in indexing that determines what the retrieval stage can find. Bad chunks = bad retrieval = bad answers.

Why Chunking Matters

An LLM’s context window is finite. You can’t feed it your entire document collection — you need to find the specific pieces that answer the question. Chunking determines what those pieces look like.

Too small: Chunks lack context. “30 days” means nothing without “electronics return policy” around it.

Too large: Chunks contain too many topics. A 5-page chunk about “all return policies” dilutes the specific electronics policy the user asked about.

Just right: Chunks contain one complete idea with enough context to stand alone.

Strategy 1: Fixed-Size Chunking

Split by a fixed number of tokens, with overlap:

Document: [..........500 tokens..........]
Chunk 1:  [===== 200 tokens =====]
Chunk 2:        [===== 200 tokens =====]     (50-token overlap)
Chunk 3:              [===== 200 tokens =====]

Settings:

Chunk size: 200-512 tokens (experiment for your content)
Overlap: 10-25% of chunk size (prevents boundary splits)

Pros: Simple to implement, predictable chunk count Cons: Ignores document structure — splits mid-sentence, mid-paragraph, mid-section

Best for: Unstructured text without clear headings (chat logs, emails, plain text)

✅ Quick Check: You set chunk size to 100 tokens with no overlap. Your document says: “The cancellation fee is 15% of the remaining contract value. This fee is waived for customers who have been with us for more than 3 years.” This critical two-sentence policy gets split across two chunks. What percentage overlap would prevent this? (Answer: 20-25% overlap (20-25 tokens). The second sentence starts around token 60-70, and with 20-25 tokens of overlap, the full policy would appear intact in at least one chunk. As a rule: set overlap to cover at least 2-3 sentences to prevent splitting complete thoughts.)

Strategy 2: Semantic Chunking

Split at natural semantic boundaries — sentences, paragraphs, or topic shifts:

Document about return policies:

Chunk 1: [Paragraph about electronics returns]
Chunk 2: [Paragraph about clothing returns]
Chunk 3: [Paragraph about furniture returns]

How Semantic Chunking Works

Split text into sentences
Generate embeddings for each sentence
Compare adjacent sentence embeddings
When similarity drops below a threshold (topic shift), insert a chunk boundary

Pros: Chunks contain complete, coherent ideas Cons: Variable chunk sizes, more complex to implement

Best for: Articles, reports, documentation — content with clear topic flow

Strategy 3: Structure-Aware Chunking

Use the document’s own structure (headings, sections, chapters) as chunk boundaries:

# Employee Handbook

## Chapter 1: Onboarding        → Chunk 1
### First Day Checklist
### System Access Setup

## Chapter 2: Time Off Policy    → Chunk 2
### Vacation Days
### Sick Leave
### Parental Leave

## Chapter 3: Benefits           → Chunk 3
### Health Insurance
### Retirement Plan

Pros: Preserves the document’s logical organization Cons: Sections may be too large or too small — needs post-processing

Best for: Documentation, manuals, handbooks, structured reports

Hierarchical Chunking

For structure-aware content, keep both the section and subsection levels:

Parent chunk: "Chapter 2: Time Off Policy" (full chapter)
Child chunks:
  - "Vacation Days" (subsection)
  - "Sick Leave" (subsection)
  - "Parental Leave" (subsection)

Search first finds the relevant parent, then drills into the specific child chunk.

Metadata Enrichment

Raw text chunks aren’t enough. Add metadata to improve retrieval:

{
  "text": "Electronics can be returned within 30 days...",
  "metadata": {
    "source": "return-policy.pdf",
    "page": 3,
    "section": "Electronics Returns",
    "document_type": "policy",
    "last_updated": "2025-11-15",
    "department": "customer_service"
  }
}

Metadata enables:

Filtered search: “Find only chunks from the policy department”
Source attribution: “This answer comes from return-policy.pdf, page 3”
Freshness control: “Prefer documents updated in the last 6 months”

✅ Quick Check: A user asks “What’s the current vacation policy?” Your knowledge base has two versions of the HR policy: one from 2024 and one from 2025. Without metadata, the retrieval might return chunks from the outdated 2024 version. How does metadata solve this? (Answer: The last_updated metadata field lets you filter or prioritize recent documents. Your retrieval query becomes: “Find chunks about vacation policy, prefer documents updated after 2025-01-01.” This ensures the 2025 policy ranks higher than the 2024 version. Metadata turns a simple text search into a structured, context-aware search.)

Choosing the Right Strategy

Content Type	Recommended Strategy	Chunk Size
Chat logs, emails	Fixed-size with overlap	200-300 tokens
Articles, reports	Semantic chunking	Variable (200-500 tokens)
Manuals, handbooks	Structure-aware	Section-based
Code documentation	Structure-aware (by function/class)	Function-level
FAQ pages	One Q&A per chunk	Variable
Legal contracts	Structure-aware (by clause)	Clause-level

Practice Exercise

Take a document from your work (policy, manual, report)
Try chunking it three ways: fixed-size (300 tokens), by paragraph, and by heading
For each approach, ask yourself: If I searched for a specific question, would the right chunk come back?
Note which strategy keeps the most complete, meaningful chunks

Key Takeaways

Chunk size is a trade-off: too small loses context, too large dilutes relevance
Overlap (10-25%) prevents splitting meaningful content at chunk boundaries
Fixed-size chunking is simplest but ignores document structure
Semantic chunking splits at topic boundaries for coherent chunks
Structure-aware chunking uses the document’s own headings and sections
Metadata (source, date, section, type) enables filtered search and source attribution
Match your strategy to your content type — there’s no universal best approach

Up Next

In the next lesson, you’ll learn how embeddings convert text chunks into vectors and how to choose the right vector database for your knowledge base.

Knowledge Check

1. You chunk a 50-page legal contract into 200-token pieces with no overlap. A clause spans tokens 190-210, getting split across two chunks. A user asks about that clause. What happens during retrieval?

The system finds both chunks automatically Neither chunk alone contains the full clause, so neither scores high enough to be retrieved — the information exists in your knowledge base but is invisible to search because the boundary split it The LLM reconstructs the clause from memory

2. You have two document types: a 200-page employee handbook with clear chapters and headings, and a folder of 500 short customer support emails. Should you use the same chunking strategy for both?

Yes — consistency is important No — the handbook should use structure-aware chunking (split by headings/chapters to preserve logical sections), while the emails should use document-level or fixed-size chunking (each email is a natural unit). Different content structures need different strategies Use the largest possible chunks for both

3. Your chunks are 512 tokens each. Retrieval accuracy is 78%. You experiment with 256-token chunks and accuracy drops to 65%. You try 1024-token chunks and accuracy rises to 82%. Should you use 1024-token chunks?

Yes — higher accuracy is always better Not necessarily — larger chunks improve accuracy but consume more context window space per chunk. With 1024-token chunks, you can fit fewer chunks in the LLM's context, which may hurt answers that need information from multiple documents. Test the full pipeline (retrieval + generation) before deciding No — always use the smallest chunks possible

Answer all questions to check

Complete the quiz above first

Related Skills

RAG Pipeline Builder Data Cleaning