6/8

Lesson 6 15 min

Your First Fine-Tuning Run

Step-by-step QLoRA fine-tuning of Llama 3.2 3B on a free Google Colab T4 GPU. From loading the model to testing your fine-tuned output.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 Five lessons of theory. Now you build something real. In this lesson, you’ll fine-tune Llama 3.2 3B using QLoRA on a free Google Colab T4 — from loading the model to generating output with your fine-tuned version. Total compute time: about 15 minutes.

What You’ll Learn

By the end of this lesson, you’ll have run a complete fine-tuning job and tested the result. Not a tutorial you read about — one you actually execute.

Prerequisites

A Google account (for Colab)
Basic Python comfort (you’ll copy and run code, not write it from scratch)
No GPU needed on your local machine — Colab provides one for free

Step 1: Set Up Colab

Open Google Colab and create a new notebook. Change the runtime:

Runtime → Change runtime type → T4 GPU

The free T4 has 16 GB VRAM. Our QLoRA setup uses about 5-6 GB. Plenty of headroom.

Install Unsloth:

%%capture
!pip install unsloth

This takes 2-3 minutes. The %%capture hides the install output.

Step 2: Load the Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,    # QLoRA: 4-bit quantization
)

load_in_4bit=True is the QLoRA switch. The 3B model loads in about 3.5 GB of VRAM instead of 6+ GB.

✅ Quick Check: What happens if you set load_in_4bit=False? (The model loads in 16-bit precision — about 6 GB for a 3B model. Standard LoRA, not QLoRA. Still fits on a T4, but you’d have less headroom for larger batch sizes. For a 7B model, you’d need 14 GB in 16-bit — much tighter on a T4.)

Step 3: Add LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # Rank: 16 is a good default
    lora_alpha=16,           # Scaling factor (usually = rank)
    lora_dropout=0.05,       # Light regularization
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
)

This adds LoRA adapters to the attention and feedforward layers. With rank 16, you’re training about 0.5% of the model’s total parameters — roughly 15 million out of 3 billion.

What each parameter does:

r=16 — Adapter dimensionality. Higher = more expressive but more memory. 8-32 covers most tasks.
lora_alpha=16 — Scaling factor. Setting it equal to rank is the standard.
target_modules — Which layers get adapters. Including all linear layers (not just attention) is the 2026 best practice.

Step 4: Prepare Your Dataset

For this walkthrough, we’ll use a subset of the Alpaca instruction dataset:

from datasets import load_dataset

dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")

# Format as chat messages
def format_example(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"] +
         ("\n" + example["input"] if example["input"] else "")},
        {"role": "assistant", "content": example["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )}

dataset = dataset.map(format_example)

We take 1,000 examples and format them into the Llama 3.2 chat template. In a real project, you’d use your own domain-specific data here — the format stays the same.

Step 5: Configure Training

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # Effective batch = 16
        num_train_epochs=1,
        learning_rate=2e-4,
        warmup_ratio=0.03,
        max_seq_length=2048,
        output_dir="outputs",
        logging_steps=10,
        fp16=True,          # Mixed precision on T4
        optim="adamw_8bit", # 8-bit optimizer saves VRAM
    ),
)

Key decisions explained:

Setting	Value	Why
`batch_size`	4	Fits in T4 VRAM with QLoRA
`gradient_accumulation`	4	Effective batch of 16 without extra VRAM
`epochs`	1	One pass — enough for 1,000 quality examples
`learning_rate`	2e-4	Standard for LoRA fine-tuning
`warmup_ratio`	0.03	Gentle start to avoid early instability
`fp16`	True	Mixed precision — T4 supports FP16 natively
`optim`	adamw_8bit	8-bit optimizer uses half the memory of standard Adam

Step 6: Train

trainer.train()

On 1,000 examples with batch size 4, this runs about 250 steps. On a T4, that takes roughly 10-15 minutes. You’ll see output like:

Step 10: loss = 1.42
Step 20: loss = 1.28
Step 30: loss = 1.15
...
Step 250: loss = 0.85

The loss should trend downward. If it jumps around wildly or doesn’t decrease, something’s wrong with your data or learning rate.

What to watch for:

Loss not decreasing at all → Learning rate too low, or data has problems
Loss drops to near zero → Overfitting (the model memorized your data)
Loss oscillates wildly → Learning rate too high
Out of memory error → Reduce batch size to 2

Step 7: Test Your Model

Before saving, test the fine-tuned model on a few examples:

FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between a list and a tuple in Python."},
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Compare the output to what the base model produces (before fine-tuning). With Alpaca data, the difference is subtle — the fine-tuned model follows instructions more crisply. With domain-specific data, the difference is dramatic.

✅ Quick Check: Your fine-tuned model produces great outputs on training examples but terrible outputs on new questions. What happened? (Overfitting. The model memorized specific examples instead of learning general patterns. Solutions: use more diverse training data, reduce epochs, increase dropout, or reduce rank.)

Step 8: Save Your Model

Save the LoRA adapters (small, fast):

model.save_pretrained("my-fine-tuned-model")
tokenizer.save_pretrained("my-fine-tuned-model")

This saves only the adapter weights — about 30-80 MB depending on rank and target modules. The base model isn’t duplicated.

To merge adapters into the base model (for deployment):

merged_model = model.merge_and_unload()
merged_model.save_pretrained("my-merged-model")

The merged model is a standalone — no adapter loading needed at inference. Same speed as the original base model.

Push to Hugging Face Hub (optional):

model.push_to_hub("your-username/my-fine-tuned-model")

What Just Happened

In about 20 minutes of total work, you:

Loaded a 3B model in 4-bit precision (QLoRA)
Added LoRA adapters (0.5% of parameters)
Fine-tuned on 1,000 examples
Tested the output
Saved a 30-80 MB adapter file

Total cost: $0. On a free Colab T4.

The same process works for 7B models (like Mistral 7B or Llama 3.2 7B) — just reduce batch size to 2. For 13B models, you’ll need a paid Colab (A100) or a local RTX 4090.

Key Takeaways

A complete QLoRA fine-tune runs in ~15 minutes on a free Colab T4
Unsloth handles quantization, LoRA setup, and training with under 30 lines of code
One epoch on 1,000 quality examples is a solid starting point — don’t over-train
Always test on held-out examples before declaring victory — loss curves alone aren’t enough
Saved adapters are 30-80 MB — merge into base model for deployment or keep separate for hot-swapping
The same workflow scales to 7B models on free Colab and 13B+ on paid GPUs

Up Next

You trained a model. But is it actually better? In the next lesson, you’ll learn evaluation and iteration — how to measure whether your fine-tuned model outperforms the base, using held-out test sets, automated metrics, and LLM-as-judge comparisons.

Knowledge Check

1. Why do we use load_in_4bit=True when loading the model in Unsloth?

It makes the model more accurate It quantizes the frozen base model to 4-bit (QLoRA), cutting VRAM from ~14 GB to ~3.5 GB so it fits on a free Colab T4 It speeds up inference but doesn't affect training

2. After training completes, what's the recommended way to test your fine-tuned model before deploying?

Check if the training loss went down Run inference on your held-out test examples and compare outputs to the base model — loss alone doesn't tell you if the model is actually better at your task Deploy immediately and monitor production metrics

3. What does num_train_epochs=1 mean, and why is it recommended for most fine-tuning runs?

The model trains on each example once — fewer epochs reduces overfitting risk, and one pass through a high-quality dataset is often enough The model trains for 1 hour The model only uses 1 GPU

Answer all questions to check

Complete the quiz above first

Related Skills

Code Review Assistant