3/8

Lesson 3 15 min

LoRA & QLoRA: Fine-Tuning on a Budget

How LoRA and QLoRA make fine-tuning accessible: low-rank adaptation, 4-bit quantization, and VRAM math that lets you train a 7B model on a free GPU.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 In the last lesson, you learned the three fine-tuning methods: SFT, RLHF, and DPO. The missing piece was hardware — full fine-tuning of a 7B model requires 100-120 GB of VRAM. That’s multiple H100 GPUs at $2-3/hour each, or $50,000+ to buy.

LoRA and QLoRA changed everything. Now you can fine-tune a 7B model on a $0 Google Colab GPU with a T4 (16 GB VRAM). Here’s how.

What You’ll Learn

By the end of this lesson, you’ll understand how LoRA and QLoRA work, why they’re 10-20x cheaper than full fine-tuning, and what hyperparameters to pick.

The VRAM Problem

Fine-tuning means updating model weights with gradient descent. For each weight, you need:

The weight itself (parameter)
Its gradient (same size as the weight)
Optimizer states (Adam stores 2 states per parameter)

For a 7B model in 16-bit precision:

Parameters: 7B × 2 bytes = 14 GB
Gradients: 14 GB
Optimizer: 28 GB
Activations: ~20-40 GB (depending on batch size)
Total: ~80-120 GB VRAM

That’s 2-3 A100 GPUs (80 GB each). Not practical for most people.

LoRA: The Elegant Hack

LoRA (Low-Rank Adaptation) is based on a simple insight: you don’t need to update all 7 billion parameters. Most of the “learning” during fine-tuning happens in a small subspace of the weight matrix.

How It Works

Instead of updating a full weight matrix W (size d × k), LoRA adds two small matrices:

Original: W (frozen, not updated)
LoRA adds: A (d × r) and B (r × k) where r << d,k

At inference: W_new = W + A × B

The rank r is typically 8-64. For a layer with d=4096 and k=4096:

Full update: 4096 × 4096 = 16.7 million parameters
LoRA (r=8): 4096 × 8 + 8 × 4096 = 65,536 parameters (0.4%)

That’s 99.6% fewer trainable parameters. The gradients and optimizer states are proportionally smaller too.

✅ Quick Check: If you merge the LoRA adapters back into the base model after training, is there added latency at inference? (No. A × B produces a matrix the same size as W. You add it once: W_new = W + A × B. The merged model has the same architecture and speed as the original — zero overhead.)

QLoRA: Even Cheaper

QLoRA takes LoRA one step further. The base model (the frozen weights) gets quantized to 4-bit precision:

Component	Precision	Memory
Base model (frozen)	4-bit NF4	~3.5 GB (7B model)
LoRA adapters (trainable)	bfloat16	~50-200 MB
Gradients	bfloat16	~50-200 MB
Optimizer states	32-bit	~100-400 MB
Activations	varies	~2-4 GB
Total (7B)		~8-10 GB

From 100+ GB to under 10 GB. That fits on a free Google Colab T4 (16 GB).

The Quality Trade-Off

Does 4-bit quantization hurt quality? Barely. Multiple studies show QLoRA retains 90-95% of full fine-tuning quality. For practical tasks (classification, format consistency, style transfer), the difference is negligible.

The magic is in NormalFloat4 (NF4) — a quantization format designed specifically for the weight distributions in neural networks. It’s not just rounding to 4 bits; it’s smart rounding that preserves the most information.

VRAM Requirements: The Cheat Sheet

Model Size	Full FT	LoRA (16-bit)	QLoRA (4-bit)
3B	~50 GB	~12 GB	~5-6 GB
7B	100-120 GB	~20 GB	~8-10 GB
13B	~200 GB	~35 GB	~15 GB
30B	~480 GB	~80 GB	~48 GB
70B	~1120 GB	~180 GB	~80 GB

The sweet spot for most people: QLoRA on a 3B-7B model. Fits on consumer GPUs or free cloud tiers.

✅ Quick Check: You have a 24 GB GPU (RTX 4090). What’s the largest model you can fine-tune with QLoRA? (13B comfortably. 30B is tight but possible with aggressive batch size settings. Most practical: fine-tune a 7B model with plenty of headroom for larger batch sizes.)

Hyperparameters That Matter

You don’t need to be an ML expert to pick good hyperparameters. Here are the ones that actually affect your results:

LoRA-Specific

Parameter	What It Does	Recommended
rank (r)	Adapter dimensionality	8-32 for most tasks
alpha	Scaling factor (usually = rank)	Same as rank
target_modules	Which layers get LoRA	All linear layers (default)
dropout	Regularization	0.05-0.1

rank = 8 is sufficient for style transfer and classification. Use 16-32 for more complex behavior changes. Going above 64 rarely helps and costs more memory.

Training

Parameter	What It Does	Recommended
learning_rate	Step size for weight updates	2e-4 to 2e-5
epochs	How many times through the dataset	1-3 (more = overfitting)
batch_size	Examples per gradient update	4-16 (limited by VRAM)
warmup_ratio	Learning rate warmup	0.03-0.1
max_seq_length	Maximum token length	512-2048

The most common mistake: Too many epochs. One pass through a high-quality dataset often works. Two is safe. Three starts overfitting on most datasets.

LoRA Adapters: Portable Specializations

Here’s something powerful: LoRA adapters are small files (typically 10-200 MB) that you can swap at inference time. One base model, multiple specialists:

Llama 3.2 7B (base)
├── customer-support-adapter.bin (50 MB) → Support tone
├── legal-review-adapter.bin (80 MB) → Legal language
└── code-review-adapter.bin (60 MB) → Code analysis

You load the base model once, then hot-swap adapters based on the task. No need to store three separate 7B models.

Key Takeaways

LoRA freezes original weights and trains small adapter matrices (0.4% of parameters) — 10-20x memory reduction
QLoRA adds 4-bit quantization of the base model — a 7B model fits in ~8 GB VRAM (free Colab T4)
Quality trade-off is minimal: QLoRA retains 90-95% of full fine-tuning quality
Rank 8-32 is sufficient for most practical tasks. Higher rank = more parameters but rarely better results
LoRA adapters are small (10-200 MB) and swappable — one base model, multiple specialists
The same fine-tune that required $50,000 in H100s now runs on a $1,500 RTX 4090 — or a free Colab GPU

Up Next

You know the theory — LoRA, QLoRA, hyperparameters. But what software do you actually use? In the next lesson, we’ll cover the tools and infrastructure: Unsloth (speed champion), Axolotl (production), Hugging Face ecosystem, and OpenAI’s managed API.

Knowledge Check

1. What does LoRA actually modify during fine-tuning?

All of the model's weights Only a small pair of low-rank matrices (A and B) added to specific layers — the original weights stay frozen The model's architecture and layer count

2. How does QLoRA reduce VRAM requirements compared to standard LoRA?

By using fewer training examples By quantizing the base model to 4-bit (NF4) while keeping LoRA adapters in 16-bit for training stability By training on CPU instead of GPU

3. What does the rank (r) parameter in LoRA control?

The number of layers modified The dimensionality of the adapter matrices — higher rank = more trainable parameters = more expressive but more memory The learning rate

Answer all questions to check

Complete the quiz above first

Related Skills

Code Review Assistant