LoRA & QLoRA: Fine-Tuning on a Budget
How LoRA and QLoRA make fine-tuning accessible: low-rank adaptation, 4-bit quantization, and VRAM math that lets you train a 7B model on a free GPU.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 In the last lesson, you learned the three fine-tuning methods: SFT, RLHF, and DPO. The missing piece was hardware — full fine-tuning of a 7B model requires 100-120 GB of VRAM. That’s multiple H100 GPUs at $2-3/hour each, or $50,000+ to buy.
LoRA and QLoRA changed everything. Now you can fine-tune a 7B model on a $0 Google Colab GPU with a T4 (16 GB VRAM). Here’s how.
What You’ll Learn
By the end of this lesson, you’ll understand how LoRA and QLoRA work, why they’re 10-20x cheaper than full fine-tuning, and what hyperparameters to pick.
The VRAM Problem
Fine-tuning means updating model weights with gradient descent. For each weight, you need:
- The weight itself (parameter)
- Its gradient (same size as the weight)
- Optimizer states (Adam stores 2 states per parameter)
For a 7B model in 16-bit precision:
- Parameters: 7B × 2 bytes = 14 GB
- Gradients: 14 GB
- Optimizer: 28 GB
- Activations: ~20-40 GB (depending on batch size)
- Total: ~80-120 GB VRAM
That’s 2-3 A100 GPUs (80 GB each). Not practical for most people.
LoRA: The Elegant Hack
LoRA (Low-Rank Adaptation) is based on a simple insight: you don’t need to update all 7 billion parameters. Most of the “learning” during fine-tuning happens in a small subspace of the weight matrix.
How It Works
Instead of updating a full weight matrix W (size d × k), LoRA adds two small matrices:
Original: W (frozen, not updated)
LoRA adds: A (d × r) and B (r × k) where r << d,k
At inference: W_new = W + A × B
The rank r is typically 8-64. For a layer with d=4096 and k=4096:
- Full update: 4096 × 4096 = 16.7 million parameters
- LoRA (r=8): 4096 × 8 + 8 × 4096 = 65,536 parameters (0.4%)
That’s 99.6% fewer trainable parameters. The gradients and optimizer states are proportionally smaller too.
✅ Quick Check: If you merge the LoRA adapters back into the base model after training, is there added latency at inference? (No. A × B produces a matrix the same size as W. You add it once: W_new = W + A × B. The merged model has the same architecture and speed as the original — zero overhead.)
QLoRA: Even Cheaper
QLoRA takes LoRA one step further. The base model (the frozen weights) gets quantized to 4-bit precision:
| Component | Precision | Memory |
|---|---|---|
| Base model (frozen) | 4-bit NF4 | ~3.5 GB (7B model) |
| LoRA adapters (trainable) | bfloat16 | ~50-200 MB |
| Gradients | bfloat16 | ~50-200 MB |
| Optimizer states | 32-bit | ~100-400 MB |
| Activations | varies | ~2-4 GB |
| Total (7B) | ~8-10 GB |
From 100+ GB to under 10 GB. That fits on a free Google Colab T4 (16 GB).
The Quality Trade-Off
Does 4-bit quantization hurt quality? Barely. Multiple studies show QLoRA retains 90-95% of full fine-tuning quality. For practical tasks (classification, format consistency, style transfer), the difference is negligible.
The magic is in NormalFloat4 (NF4) — a quantization format designed specifically for the weight distributions in neural networks. It’s not just rounding to 4 bits; it’s smart rounding that preserves the most information.
VRAM Requirements: The Cheat Sheet
| Model Size | Full FT | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|---|
| 3B | ~50 GB | ~12 GB | ~5-6 GB |
| 7B | 100-120 GB | ~20 GB | ~8-10 GB |
| 13B | ~200 GB | ~35 GB | ~15 GB |
| 30B | ~480 GB | ~80 GB | ~48 GB |
| 70B | ~1120 GB | ~180 GB | ~80 GB |
The sweet spot for most people: QLoRA on a 3B-7B model. Fits on consumer GPUs or free cloud tiers.
✅ Quick Check: You have a 24 GB GPU (RTX 4090). What’s the largest model you can fine-tune with QLoRA? (13B comfortably. 30B is tight but possible with aggressive batch size settings. Most practical: fine-tune a 7B model with plenty of headroom for larger batch sizes.)
Hyperparameters That Matter
You don’t need to be an ML expert to pick good hyperparameters. Here are the ones that actually affect your results:
LoRA-Specific
| Parameter | What It Does | Recommended |
|---|---|---|
| rank (r) | Adapter dimensionality | 8-32 for most tasks |
| alpha | Scaling factor (usually = rank) | Same as rank |
| target_modules | Which layers get LoRA | All linear layers (default) |
| dropout | Regularization | 0.05-0.1 |
rank = 8 is sufficient for style transfer and classification. Use 16-32 for more complex behavior changes. Going above 64 rarely helps and costs more memory.
Training
| Parameter | What It Does | Recommended |
|---|---|---|
| learning_rate | Step size for weight updates | 2e-4 to 2e-5 |
| epochs | How many times through the dataset | 1-3 (more = overfitting) |
| batch_size | Examples per gradient update | 4-16 (limited by VRAM) |
| warmup_ratio | Learning rate warmup | 0.03-0.1 |
| max_seq_length | Maximum token length | 512-2048 |
The most common mistake: Too many epochs. One pass through a high-quality dataset often works. Two is safe. Three starts overfitting on most datasets.
LoRA Adapters: Portable Specializations
Here’s something powerful: LoRA adapters are small files (typically 10-200 MB) that you can swap at inference time. One base model, multiple specialists:
Llama 3.2 7B (base)
├── customer-support-adapter.bin (50 MB) → Support tone
├── legal-review-adapter.bin (80 MB) → Legal language
└── code-review-adapter.bin (60 MB) → Code analysis
You load the base model once, then hot-swap adapters based on the task. No need to store three separate 7B models.
Key Takeaways
- LoRA freezes original weights and trains small adapter matrices (0.4% of parameters) — 10-20x memory reduction
- QLoRA adds 4-bit quantization of the base model — a 7B model fits in ~8 GB VRAM (free Colab T4)
- Quality trade-off is minimal: QLoRA retains 90-95% of full fine-tuning quality
- Rank 8-32 is sufficient for most practical tasks. Higher rank = more parameters but rarely better results
- LoRA adapters are small (10-200 MB) and swappable — one base model, multiple specialists
- The same fine-tune that required $50,000 in H100s now runs on a $1,500 RTX 4090 — or a free Colab GPU
Up Next
You know the theory — LoRA, QLoRA, hyperparameters. But what software do you actually use? In the next lesson, we’ll cover the tools and infrastructure: Unsloth (speed champion), Axolotl (production), Hugging Face ecosystem, and OpenAI’s managed API.
Knowledge Check
Complete the quiz above first
Lesson completed!