5/8

Lesson 5 15 min

Dataset Preparation

How to build, clean, and format training datasets for fine-tuning. Covers data quality rules, synthetic generation, and minimum dataset sizes by task.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 You’ve got the tools (Unsloth, Axolotl, OpenAI API) and you understand the methods (SFT, DPO, QLoRA). But here’s the uncomfortable truth: none of that matters if your data is bad. A perfectly configured training run on garbage data produces a garbage model. Every time.

Dataset preparation is where most fine-tuning projects succeed or fail. And it’s the step most tutorials skip.

What You’ll Learn

By the end of this lesson, you’ll know how to build a training dataset from scratch — the right format, the right size, and the quality bar that actually matters.

The Quality Rule

This is the single most important thing in this entire course:

1,000 high-quality examples > 10,000 mediocre ones.

Not sometimes. Not usually. Every time.

A “high-quality” example means:

The input is realistic (something the model will actually see in production)
The output is exactly what you want (not “close enough”)
The format is consistent across examples
There are no contradictions between examples

✅ Quick Check: You’re building a customer support fine-tune. Which dataset is better? (A) 5,000 real support tickets with auto-generated responses, or (B) 800 tickets where a senior support agent wrote the ideal response for each? (B. The agent’s responses set the quality ceiling. Auto-generated responses teach the model to produce average outputs.)

Data Formats

Every fine-tuning tool expects a specific format. Here are the three you’ll encounter:

OpenAI / Chat Format (Most Common)

{"messages": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute bronchitis."}, {"role": "assistant", "content": "ICD-10: J20.9\nCPT: 99213"}]}
{"messages": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient diagnosed with type 2 diabetes."}, {"role": "assistant", "content": "ICD-10: E11.9\nCPT: 99214"}]}

One JSON object per line. One conversation per object. This is JSONL — JSON Lines.

Alpaca Format (Hugging Face)

{"instruction": "Classify the sentiment", "input": "This product is terrible.", "output": "Negative"}
{"instruction": "Classify the sentiment", "input": "Best purchase I've made all year!", "output": "Positive"}

Simpler but less flexible. Works for single-turn tasks.

ShareGPT Format (Multi-Turn)

{"conversations": [{"from": "human", "value": "What's LoRA?"}, {"from": "gpt", "value": "LoRA is..."}, {"from": "human", "value": "How does it save memory?"}, {"from": "gpt", "value": "It freezes..."}]}

For multi-turn conversations. Axolotl supports this natively.

Which to use? Start with the OpenAI chat format. It works with OpenAI’s API, Unsloth, and most Hugging Face tools. You can always convert later.

How Many Examples Do You Need?

This depends entirely on what you’re teaching the model:

Task	Minimum	Sweet Spot	What You’re Teaching
Style/tone transfer	50-100	200-500	“Sound like this”
Output format	100-200	300-500	“Always return JSON like this”
Classification	200-500	1,000-2,000	“Put things in these categories”
Instruction following	500-1,000	2,000-5,000	“When asked X, do Y”
Domain specialization	1,000+	5,000-10,000	“Be an expert in this field”

Notice the pattern: the more complex the behavior change, the more examples you need. But the ceiling is lower than you’d think — going from 5,000 to 50,000 examples rarely improves results and can actually hurt (overfitting).

Building Your Dataset: Three Approaches

1. Curate Real Data

The gold standard. Collect real inputs your model will see in production, then write (or select) the ideal outputs.

Where to find inputs:

Customer support tickets, chat logs, emails
Your company’s existing classification data
Domain documents (contracts, medical records, code)
User queries from your product’s search logs

How to create outputs:

Have domain experts write ideal responses
Select the best existing responses from your data
Use a rubric: each response scores 1-5 on accuracy, format, tone

2. Synthetic Data Generation

When you don’t have enough real data, use a larger model to generate training examples:

Prompt to GPT-4 / Claude:

You are generating training data for a customer support fine-tune.
The model should respond in a friendly, professional tone and always
include an order number reference.

Given this customer message:
"I ordered a blue sweater but received a red one."

Write the ideal support response.

The workflow:

Start with 20-50 hand-written “seed” examples
Use GPT-4/Claude to generate 500+ similar examples
Manually review 10-20% of generated examples
Filter out anything that doesn’t match your quality bar
Repeat with different prompts to increase diversity

The trap: Synthetic data inherits the generating model’s biases and patterns. If you don’t review, your fine-tuned model will sound like GPT-4 — which defeats the purpose of fine-tuning for your specific style.

3. Use Existing Datasets

Hugging Face Hub has thousands of instruction-tuning datasets. Some useful ones:

OpenAssistant/oasst1 — 160K human-written conversations
teknium/OpenHermes — 1M+ synthetic instruction examples
HuggingFaceH4/ultrachat_200k — 200K high-quality dialogues
yahma/alpaca-cleaned — 52K cleaned instruction examples

These work for general fine-tuning. For domain-specific tasks, you’ll still need your own data — but these can supplement.

✅ Quick Check: You’re fine-tuning a model to extract structured data from legal contracts. Can you just use the Alpaca dataset? (No. Alpaca is general-purpose instruction data. For legal extraction, you need examples of actual contracts with the structured output you want. General datasets teach general behavior — they won’t teach domain-specific extraction patterns.)

Cleaning Your Dataset

Raw data is never clean enough. Here’s the checklist:

Remove:

Duplicate examples (exact or near-duplicate)
Examples shorter than your minimum useful length
Examples with formatting errors (broken JSON, missing fields)
Examples that contradict each other
PII (names, emails, phone numbers) unless intentional

Verify:

Output format is consistent across all examples
System prompts are identical (if using them)
Token length fits your max_seq_length setting
Label distribution is roughly balanced (for classification)

A practical deduplication check:

import hashlib

seen = set()
clean_data = []
for example in dataset:
    h = hashlib.md5(str(example).encode()).hexdigest()
    if h not in seen:
        seen.add(h)
        clean_data.append(example)

print(f"Removed {len(dataset) - len(clean_data)} duplicates")

Train/Test Split

Always hold out data for evaluation. Never train on your test set.

The split:

Training set: 80-90% of your data
Test set: 10-20% of your data (minimum 50 examples)

Critical rule: Split before any augmentation or synthetic generation. Your test set must contain only real, unmodified examples. Otherwise you’re testing the model on data it’s seen before — which tells you nothing.

For classification tasks, make sure the split is stratified — each category appears in the same proportion in both train and test sets.

DPO Preference Pairs

If you’re doing SFT + DPO (the modern standard for quality), you also need preference data:

{
  "prompt": "Explain machine learning to a marketing manager",
  "chosen": "Think of ML as teaching a computer to spot patterns in your customer data...",
  "rejected": "Machine learning is a subset of artificial intelligence that uses statistical methods..."
}

How to create preference pairs:

Generate 2-3 responses per prompt (different models or temperature settings)
Have a human rank them: which is better?
The best becomes “chosen,” the worst becomes “rejected”

You need fewer preference pairs than SFT examples — 200-500 is usually enough to noticeably improve quality.

Key Takeaways

Data quality beats quantity — 1,000 curated examples outperform 10,000 noisy ones
Use the OpenAI chat format (JSONL with messages array) — it works everywhere
Dataset size depends on task complexity: 50-100 for style, 1,000+ for domain specialization
Synthetic data generation works but needs manual quality review on 10-20% of samples
Always hold out 10-20% as a test set, and split before any augmentation
For DPO, create preference pairs by ranking multiple responses per prompt

Up Next

You’ve got data. You’ve got tools. In the next lesson, you’ll put them together — your first fine-tuning run. Step-by-step QLoRA fine-tuning of Llama 3.2 3B on Google Colab’s free T4 GPU. The whole thing takes about 15 minutes of compute time.

Knowledge Check

1. You have 10,000 training examples but they were auto-generated without review. Your colleague has 500 hand-curated examples. Who gets the better model?

Your colleague — 500 high-quality examples consistently outperform 10,000 unreviewed ones. Quality beats quantity every time in fine-tuning. You — more data always wins Neither — the number doesn't matter at all

2. What's the correct JSONL format for OpenAI's fine-tuning API?

Each line is a JSON object with 'instruction' and 'output' fields Each line is a JSON object with a 'messages' array containing role/content objects (system, user, assistant) A single JSON array containing all training examples

3. When is synthetic data generation (using GPT-4 or Claude to create training examples) a good idea?

Always — AI-generated data is always better than human-written Never — synthetic data always introduces quality drift When you need to scale from a small seed set of high-quality examples, with manual quality checks on a sample of the generated data

Answer all questions to check

Complete the quiz above first

Related Skills

Code Review Assistant