Dataset Preparation
How to build, clean, and format training datasets for fine-tuning. Covers data quality rules, synthetic generation, and minimum dataset sizes by task.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 You’ve got the tools (Unsloth, Axolotl, OpenAI API) and you understand the methods (SFT, DPO, QLoRA). But here’s the uncomfortable truth: none of that matters if your data is bad. A perfectly configured training run on garbage data produces a garbage model. Every time.
Dataset preparation is where most fine-tuning projects succeed or fail. And it’s the step most tutorials skip.
What You’ll Learn
By the end of this lesson, you’ll know how to build a training dataset from scratch — the right format, the right size, and the quality bar that actually matters.
The Quality Rule
This is the single most important thing in this entire course:
1,000 high-quality examples > 10,000 mediocre ones.
Not sometimes. Not usually. Every time.
A “high-quality” example means:
- The input is realistic (something the model will actually see in production)
- The output is exactly what you want (not “close enough”)
- The format is consistent across examples
- There are no contradictions between examples
✅ Quick Check: You’re building a customer support fine-tune. Which dataset is better? (A) 5,000 real support tickets with auto-generated responses, or (B) 800 tickets where a senior support agent wrote the ideal response for each? (B. The agent’s responses set the quality ceiling. Auto-generated responses teach the model to produce average outputs.)
Data Formats
Every fine-tuning tool expects a specific format. Here are the three you’ll encounter:
OpenAI / Chat Format (Most Common)
{"messages": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute bronchitis."}, {"role": "assistant", "content": "ICD-10: J20.9\nCPT: 99213"}]}
{"messages": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient diagnosed with type 2 diabetes."}, {"role": "assistant", "content": "ICD-10: E11.9\nCPT: 99214"}]}
One JSON object per line. One conversation per object. This is JSONL — JSON Lines.
Alpaca Format (Hugging Face)
{"instruction": "Classify the sentiment", "input": "This product is terrible.", "output": "Negative"}
{"instruction": "Classify the sentiment", "input": "Best purchase I've made all year!", "output": "Positive"}
Simpler but less flexible. Works for single-turn tasks.
ShareGPT Format (Multi-Turn)
{"conversations": [{"from": "human", "value": "What's LoRA?"}, {"from": "gpt", "value": "LoRA is..."}, {"from": "human", "value": "How does it save memory?"}, {"from": "gpt", "value": "It freezes..."}]}
For multi-turn conversations. Axolotl supports this natively.
Which to use? Start with the OpenAI chat format. It works with OpenAI’s API, Unsloth, and most Hugging Face tools. You can always convert later.
How Many Examples Do You Need?
This depends entirely on what you’re teaching the model:
| Task | Minimum | Sweet Spot | What You’re Teaching |
|---|---|---|---|
| Style/tone transfer | 50-100 | 200-500 | “Sound like this” |
| Output format | 100-200 | 300-500 | “Always return JSON like this” |
| Classification | 200-500 | 1,000-2,000 | “Put things in these categories” |
| Instruction following | 500-1,000 | 2,000-5,000 | “When asked X, do Y” |
| Domain specialization | 1,000+ | 5,000-10,000 | “Be an expert in this field” |
Notice the pattern: the more complex the behavior change, the more examples you need. But the ceiling is lower than you’d think — going from 5,000 to 50,000 examples rarely improves results and can actually hurt (overfitting).
Building Your Dataset: Three Approaches
1. Curate Real Data
The gold standard. Collect real inputs your model will see in production, then write (or select) the ideal outputs.
Where to find inputs:
- Customer support tickets, chat logs, emails
- Your company’s existing classification data
- Domain documents (contracts, medical records, code)
- User queries from your product’s search logs
How to create outputs:
- Have domain experts write ideal responses
- Select the best existing responses from your data
- Use a rubric: each response scores 1-5 on accuracy, format, tone
2. Synthetic Data Generation
When you don’t have enough real data, use a larger model to generate training examples:
Prompt to GPT-4 / Claude:
You are generating training data for a customer support fine-tune.
The model should respond in a friendly, professional tone and always
include an order number reference.
Given this customer message:
"I ordered a blue sweater but received a red one."
Write the ideal support response.
The workflow:
- Start with 20-50 hand-written “seed” examples
- Use GPT-4/Claude to generate 500+ similar examples
- Manually review 10-20% of generated examples
- Filter out anything that doesn’t match your quality bar
- Repeat with different prompts to increase diversity
The trap: Synthetic data inherits the generating model’s biases and patterns. If you don’t review, your fine-tuned model will sound like GPT-4 — which defeats the purpose of fine-tuning for your specific style.
3. Use Existing Datasets
Hugging Face Hub has thousands of instruction-tuning datasets. Some useful ones:
- OpenAssistant/oasst1 — 160K human-written conversations
- teknium/OpenHermes — 1M+ synthetic instruction examples
- HuggingFaceH4/ultrachat_200k — 200K high-quality dialogues
- yahma/alpaca-cleaned — 52K cleaned instruction examples
These work for general fine-tuning. For domain-specific tasks, you’ll still need your own data — but these can supplement.
✅ Quick Check: You’re fine-tuning a model to extract structured data from legal contracts. Can you just use the Alpaca dataset? (No. Alpaca is general-purpose instruction data. For legal extraction, you need examples of actual contracts with the structured output you want. General datasets teach general behavior — they won’t teach domain-specific extraction patterns.)
Cleaning Your Dataset
Raw data is never clean enough. Here’s the checklist:
Remove:
- Duplicate examples (exact or near-duplicate)
- Examples shorter than your minimum useful length
- Examples with formatting errors (broken JSON, missing fields)
- Examples that contradict each other
- PII (names, emails, phone numbers) unless intentional
Verify:
- Output format is consistent across all examples
- System prompts are identical (if using them)
- Token length fits your
max_seq_lengthsetting - Label distribution is roughly balanced (for classification)
A practical deduplication check:
import hashlib
seen = set()
clean_data = []
for example in dataset:
h = hashlib.md5(str(example).encode()).hexdigest()
if h not in seen:
seen.add(h)
clean_data.append(example)
print(f"Removed {len(dataset) - len(clean_data)} duplicates")
Train/Test Split
Always hold out data for evaluation. Never train on your test set.
The split:
- Training set: 80-90% of your data
- Test set: 10-20% of your data (minimum 50 examples)
Critical rule: Split before any augmentation or synthetic generation. Your test set must contain only real, unmodified examples. Otherwise you’re testing the model on data it’s seen before — which tells you nothing.
For classification tasks, make sure the split is stratified — each category appears in the same proportion in both train and test sets.
DPO Preference Pairs
If you’re doing SFT + DPO (the modern standard for quality), you also need preference data:
{
"prompt": "Explain machine learning to a marketing manager",
"chosen": "Think of ML as teaching a computer to spot patterns in your customer data...",
"rejected": "Machine learning is a subset of artificial intelligence that uses statistical methods..."
}
How to create preference pairs:
- Generate 2-3 responses per prompt (different models or temperature settings)
- Have a human rank them: which is better?
- The best becomes “chosen,” the worst becomes “rejected”
You need fewer preference pairs than SFT examples — 200-500 is usually enough to noticeably improve quality.
Key Takeaways
- Data quality beats quantity — 1,000 curated examples outperform 10,000 noisy ones
- Use the OpenAI chat format (JSONL with messages array) — it works everywhere
- Dataset size depends on task complexity: 50-100 for style, 1,000+ for domain specialization
- Synthetic data generation works but needs manual quality review on 10-20% of samples
- Always hold out 10-20% as a test set, and split before any augmentation
- For DPO, create preference pairs by ranking multiple responses per prompt
Up Next
You’ve got data. You’ve got tools. In the next lesson, you’ll put them together — your first fine-tuning run. Step-by-step QLoRA fine-tuning of Llama 3.2 3B on Google Colab’s free T4 GPU. The whole thing takes about 15 minutes of compute time.
Knowledge Check
Complete the quiz above first
Lesson completed!