Your First Fine-Tuning Run
Step-by-step QLoRA fine-tuning of Llama 3.2 3B on a free Google Colab T4 GPU. From loading the model to testing your fine-tuned output.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 Five lessons of theory. Now you build something real. In this lesson, you’ll fine-tune Llama 3.2 3B using QLoRA on a free Google Colab T4 — from loading the model to generating output with your fine-tuned version. Total compute time: about 15 minutes.
What You’ll Learn
By the end of this lesson, you’ll have run a complete fine-tuning job and tested the result. Not a tutorial you read about — one you actually execute.
Prerequisites
- A Google account (for Colab)
- Basic Python comfort (you’ll copy and run code, not write it from scratch)
- No GPU needed on your local machine — Colab provides one for free
Step 1: Set Up Colab
Open Google Colab and create a new notebook. Change the runtime:
Runtime → Change runtime type → T4 GPU
The free T4 has 16 GB VRAM. Our QLoRA setup uses about 5-6 GB. Plenty of headroom.
Install Unsloth:
%%capture
!pip install unsloth
This takes 2-3 minutes. The %%capture hides the install output.
Step 2: Load the Model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True, # QLoRA: 4-bit quantization
)
load_in_4bit=True is the QLoRA switch. The 3B model loads in about 3.5 GB of VRAM instead of 6+ GB.
✅ Quick Check: What happens if you set
load_in_4bit=False? (The model loads in 16-bit precision — about 6 GB for a 3B model. Standard LoRA, not QLoRA. Still fits on a T4, but you’d have less headroom for larger batch sizes. For a 7B model, you’d need 14 GB in 16-bit — much tighter on a T4.)
Step 3: Add LoRA Adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # Rank: 16 is a good default
lora_alpha=16, # Scaling factor (usually = rank)
lora_dropout=0.05, # Light regularization
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
This adds LoRA adapters to the attention and feedforward layers. With rank 16, you’re training about 0.5% of the model’s total parameters — roughly 15 million out of 3 billion.
What each parameter does:
r=16— Adapter dimensionality. Higher = more expressive but more memory. 8-32 covers most tasks.lora_alpha=16— Scaling factor. Setting it equal to rank is the standard.target_modules— Which layers get adapters. Including all linear layers (not just attention) is the 2026 best practice.
Step 4: Prepare Your Dataset
For this walkthrough, we’ll use a subset of the Alpaca instruction dataset:
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")
# Format as chat messages
def format_example(example):
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": example["instruction"] +
("\n" + example["input"] if example["input"] else "")},
{"role": "assistant", "content": example["output"]},
]
return {"text": tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)}
dataset = dataset.map(format_example)
We take 1,000 examples and format them into the Llama 3.2 chat template. In a real project, you’d use your own domain-specific data here — the format stays the same.
Step 5: Configure Training
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
args=SFTConfig(
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 16
num_train_epochs=1,
learning_rate=2e-4,
warmup_ratio=0.03,
max_seq_length=2048,
output_dir="outputs",
logging_steps=10,
fp16=True, # Mixed precision on T4
optim="adamw_8bit", # 8-bit optimizer saves VRAM
),
)
Key decisions explained:
| Setting | Value | Why |
|---|---|---|
batch_size | 4 | Fits in T4 VRAM with QLoRA |
gradient_accumulation | 4 | Effective batch of 16 without extra VRAM |
epochs | 1 | One pass — enough for 1,000 quality examples |
learning_rate | 2e-4 | Standard for LoRA fine-tuning |
warmup_ratio | 0.03 | Gentle start to avoid early instability |
fp16 | True | Mixed precision — T4 supports FP16 natively |
optim | adamw_8bit | 8-bit optimizer uses half the memory of standard Adam |
Step 6: Train
trainer.train()
On 1,000 examples with batch size 4, this runs about 250 steps. On a T4, that takes roughly 10-15 minutes. You’ll see output like:
Step 10: loss = 1.42
Step 20: loss = 1.28
Step 30: loss = 1.15
...
Step 250: loss = 0.85
The loss should trend downward. If it jumps around wildly or doesn’t decrease, something’s wrong with your data or learning rate.
What to watch for:
- Loss not decreasing at all → Learning rate too low, or data has problems
- Loss drops to near zero → Overfitting (the model memorized your data)
- Loss oscillates wildly → Learning rate too high
- Out of memory error → Reduce batch size to 2
Step 7: Test Your Model
Before saving, test the fine-tuned model on a few examples:
FastLanguageModel.for_inference(model)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between a list and a tuple in Python."},
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=256,
temperature=0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Compare the output to what the base model produces (before fine-tuning). With Alpaca data, the difference is subtle — the fine-tuned model follows instructions more crisply. With domain-specific data, the difference is dramatic.
✅ Quick Check: Your fine-tuned model produces great outputs on training examples but terrible outputs on new questions. What happened? (Overfitting. The model memorized specific examples instead of learning general patterns. Solutions: use more diverse training data, reduce epochs, increase dropout, or reduce rank.)
Step 8: Save Your Model
Save the LoRA adapters (small, fast):
model.save_pretrained("my-fine-tuned-model")
tokenizer.save_pretrained("my-fine-tuned-model")
This saves only the adapter weights — about 30-80 MB depending on rank and target modules. The base model isn’t duplicated.
To merge adapters into the base model (for deployment):
merged_model = model.merge_and_unload()
merged_model.save_pretrained("my-merged-model")
The merged model is a standalone — no adapter loading needed at inference. Same speed as the original base model.
Push to Hugging Face Hub (optional):
model.push_to_hub("your-username/my-fine-tuned-model")
What Just Happened
In about 20 minutes of total work, you:
- Loaded a 3B model in 4-bit precision (QLoRA)
- Added LoRA adapters (0.5% of parameters)
- Fine-tuned on 1,000 examples
- Tested the output
- Saved a 30-80 MB adapter file
Total cost: $0. On a free Colab T4.
The same process works for 7B models (like Mistral 7B or Llama 3.2 7B) — just reduce batch size to 2. For 13B models, you’ll need a paid Colab (A100) or a local RTX 4090.
Key Takeaways
- A complete QLoRA fine-tune runs in ~15 minutes on a free Colab T4
- Unsloth handles quantization, LoRA setup, and training with under 30 lines of code
- One epoch on 1,000 quality examples is a solid starting point — don’t over-train
- Always test on held-out examples before declaring victory — loss curves alone aren’t enough
- Saved adapters are 30-80 MB — merge into base model for deployment or keep separate for hot-swapping
- The same workflow scales to 7B models on free Colab and 13B+ on paid GPUs
Up Next
You trained a model. But is it actually better? In the next lesson, you’ll learn evaluation and iteration — how to measure whether your fine-tuned model outperforms the base, using held-out test sets, automated metrics, and LLM-as-judge comparisons.
Knowledge Check
Complete the quiz above first
Lesson completed!