Transformers & LLMs for NLP
How BERT, GPT, and T5 transformed every NLP task — architecture differences, when to use each, and the shift from task-specific to general-purpose models.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
The Transformer Revolution
🔄 Lessons 2-6 built your NLP toolkit — preprocessing, representations, classification, NER, and sentiment analysis. Each task used specific models and techniques. Then transformers arrived and changed everything.
Before 2018, each NLP task required a separate model trained from scratch on task-specific data. After BERT (2018), the approach flipped: pretrain one massive model on general language, then fine-tune it for any task with minimal labeled data. This single shift improved accuracy across every NLP benchmark simultaneously.
Three Transformer Architectures
All transformers use self-attention — the mechanism that lets the model consider relationships between every pair of words in the input. But they differ in architecture and what they’re best at.
Encoder-only (BERT family)
BERT reads text bidirectionally — it sees words to the left AND right of each position simultaneously. This makes it excellent at understanding the full context of a passage.
- How it works: Mask random words during pretraining, predict what’s missing using surrounding context
- Best for: Classification, NER, sentiment analysis, question answering — tasks that require understanding
- Models: BERT, RoBERTa, ALBERT, DeBERTa, DistilBERT
Decoder-only (GPT family)
GPT reads text left-to-right and predicts the next word. This makes it a powerful text generator that can produce coherent, fluent output.
- How it works: Given a sequence, predict the next token. Trained on trillions of words
- Best for: Text generation, summarization, translation, conversational AI — tasks that produce text
- Models: GPT-4, Claude, Gemini, Llama, Mistral
Encoder-decoder (T5 family)
T5 reads the full input (encoder) then generates output (decoder). It treats every NLP task as a text-to-text problem: classification becomes “classify: [text]” → “positive”; translation becomes “translate English to French: [text]” → “[French text].”
- How it works: Encode the full input, then decode to generate the output
- Best for: Tasks that require both understanding input and generating output
- Models: T5, BART, mBART, Flan-T5
✅ Quick Check: You need to extract the answer to a question from a paragraph of text (extractive QA). The answer is a span of text within the paragraph — you’re identifying it, not generating new text. Which architecture fits best? Encoder-only (BERT). Extractive QA requires deep understanding of both the question and the paragraph to locate the answer span. BERT’s bidirectional attention excels at this — it scores each position in the paragraph as a potential start or end of the answer. GPT could generate an answer, but that’s generative QA (different task, risk of hallucination).
When to Use Which
| Task | Best Architecture | Why |
|---|---|---|
| Text classification | Encoder (BERT) | Needs full context understanding |
| NER | Encoder (BERT) | Each token needs bidirectional context |
| Sentiment analysis | Encoder (BERT) | Understanding trumps generation |
| Text generation | Decoder (GPT) | Autoregressive generation |
| Summarization | Encoder-decoder (T5) or Decoder (GPT) | Needs to understand then produce |
| Translation | Encoder-decoder (T5/mBART) | Map from one language to another |
| Conversational AI | Decoder (GPT/Claude) | Generate natural responses |
| Question answering | Encoder (BERT) for extractive; Decoder for generative | Depends on whether answer is in the text |
The Size-Accuracy Trade-off
Transformer models come in dramatically different sizes:
| Model | Parameters | Size | Inference Speed | Accuracy |
|---|---|---|---|---|
| DistilBERT | 66M | 250MB | ~3ms | Good |
| BERT-base | 110M | 440MB | ~8ms | Very good |
| BERT-large | 340M | 1.3GB | ~25ms | Excellent |
| Llama 3 8B | 8B | 16GB | ~200ms | Very good |
| GPT-4 | ~1.8T (est.) | Cloud only | ~1-3s | Best (general) |
Bigger isn’t always better. For classification and NER, a fine-tuned BERT-base often matches or outperforms GPT-4 zero-shot — at 1/100th the cost and 100x the speed. Fine-tuned GPT-3.5 achieves F1 scores of 0.95-0.97, matching fine-tuned BERT on many tasks.
✅ Quick Check: A startup needs sentiment analysis on 1 million product reviews per day. Each review needs classification in under 50ms. Budget is limited. What’s the right model choice? Fine-tuned DistilBERT or BERT-base. At 3-8ms per inference, these handle 1M reviews in hours on a single GPU. GPT-4 would take days and cost thousands of dollars. DistilBERT is 60% smaller than BERT-base but retains 97% of its accuracy — the ideal choice when speed and cost matter more than marginal accuracy gains.
Zero-Shot vs Fine-Tuned
The biggest practical decision in modern NLP: prompt a general-purpose LLM (zero-shot) or fine-tune a specialized model?
| Factor | Zero-Shot LLM | Fine-Tuned BERT |
|---|---|---|
| Labeled data needed | None | Hundreds to thousands |
| Setup time | Minutes | Days (labeling + training) |
| Per-query cost | $0.001-$0.03 | ~$0.0001 (self-hosted) |
| Accuracy | Good (F1 0.70-0.85) | Excellent (F1 0.90-0.97) |
| Latency | 500ms-3s | 5-25ms |
| Data privacy | Data sent to API provider | Data stays on your servers |
| Flexibility | Change task with a prompt change | Retrain for each new task |
The practical strategy: Start with zero-shot (fast prototyping, no data needed). If accuracy matters, invest in labeled data and fine-tune. Many production systems use both — zero-shot for rare/new categories, fine-tuned for high-volume categories.
Open-Source NLP Ecosystem
The open-source transformer ecosystem is massive and growing:
- Hugging Face Hub: 500,000+ pretrained models, ready to fine-tune
- spaCy + Transformers: Production pipeline with transformer-powered NER and classification
- Sentence Transformers: Specialized models for text similarity and search
- Ollama, vLLM: Run open LLMs (Llama, Mistral) locally
This ecosystem means you can deploy state-of-the-art NLP without sending data to any third-party API — critical for regulated industries and privacy-sensitive applications.
Key Takeaways
- Three transformer architectures: encoder (BERT — understanding), decoder (GPT — generation), encoder-decoder (T5 — both)
- BERT excels at classification, NER, sentiment; GPT excels at generation; T5 unifies all tasks as text-to-text
- Fine-tuned BERT often matches GPT-4 zero-shot on classification at 1/100th the cost and 100x speed
- Zero-shot LLMs need no training data — ideal for prototyping; fine-tuned models win on accuracy and cost at scale
- Open-source models (Llama, Mistral, BERT) enable on-premises NLP with full data sovereignty
- Bigger isn’t always better: DistilBERT (66M params) retains 97% of BERT’s accuracy at 60% the size
Up Next
You now have the full NLP toolkit — from preprocessing to transformers. Lesson 8 brings it all together: designing your first NLP project, choosing a career path, and mapping the skills that command $107K-$206K salaries.
Knowledge Check
Complete the quiz above first
Lesson completed!