Transformers & LLMs for NLP

The Transformer Revolution

🔄 Lessons 2-6 built your NLP toolkit — preprocessing, representations, classification, NER, and sentiment analysis. Each task used specific models and techniques. Then transformers arrived and changed everything.

Before 2018, each NLP task required a separate model trained from scratch on task-specific data. After BERT (2018), the approach flipped: pretrain one massive model on general language, then fine-tune it for any task with minimal labeled data. This single shift improved accuracy across every NLP benchmark simultaneously.

Three Transformer Architectures

All transformers use self-attention — the mechanism that lets the model consider relationships between every pair of words in the input. But they differ in architecture and what they’re best at.

Encoder-only (BERT family)

BERT reads text bidirectionally — it sees words to the left AND right of each position simultaneously. This makes it excellent at understanding the full context of a passage.

How it works: Mask random words during pretraining, predict what’s missing using surrounding context
Best for: Classification, NER, sentiment analysis, question answering — tasks that require understanding
Models: BERT, RoBERTa, ALBERT, DeBERTa, DistilBERT

Decoder-only (GPT family)

GPT reads text left-to-right and predicts the next word. This makes it a powerful text generator that can produce coherent, fluent output.

How it works: Given a sequence, predict the next token. Trained on trillions of words
Best for: Text generation, summarization, translation, conversational AI — tasks that produce text
Models: GPT-4, Claude, Gemini, Llama, Mistral

Encoder-decoder (T5 family)

T5 reads the full input (encoder) then generates output (decoder). It treats every NLP task as a text-to-text problem: classification becomes “classify: [text]” → “positive”; translation becomes “translate English to French: [text]” → “[French text].”

How it works: Encode the full input, then decode to generate the output
Best for: Tasks that require both understanding input and generating output
Models: T5, BART, mBART, Flan-T5

✅ Quick Check: You need to extract the answer to a question from a paragraph of text (extractive QA). The answer is a span of text within the paragraph — you’re identifying it, not generating new text. Which architecture fits best? Encoder-only (BERT). Extractive QA requires deep understanding of both the question and the paragraph to locate the answer span. BERT’s bidirectional attention excels at this — it scores each position in the paragraph as a potential start or end of the answer. GPT could generate an answer, but that’s generative QA (different task, risk of hallucination).

When to Use Which

Task	Best Architecture	Why
Text classification	Encoder (BERT)	Needs full context understanding
NER	Encoder (BERT)	Each token needs bidirectional context
Sentiment analysis	Encoder (BERT)	Understanding trumps generation
Text generation	Decoder (GPT)	Autoregressive generation
Summarization	Encoder-decoder (T5) or Decoder (GPT)	Needs to understand then produce
Translation	Encoder-decoder (T5/mBART)	Map from one language to another
Conversational AI	Decoder (GPT/Claude)	Generate natural responses
Question answering	Encoder (BERT) for extractive; Decoder for generative	Depends on whether answer is in the text

The Size-Accuracy Trade-off

Transformer models come in dramatically different sizes:

Model	Parameters	Size	Inference Speed	Accuracy
DistilBERT	66M	250MB	~3ms	Good
BERT-base	110M	440MB	~8ms	Very good
BERT-large	340M	1.3GB	~25ms	Excellent
Llama 3 8B	8B	16GB	~200ms	Very good
GPT-4	~1.8T (est.)	Cloud only	~1-3s	Best (general)

Bigger isn’t always better. For classification and NER, a fine-tuned BERT-base often matches or outperforms GPT-4 zero-shot — at 1/100th the cost and 100x the speed. Fine-tuned GPT-3.5 achieves F1 scores of 0.95-0.97, matching fine-tuned BERT on many tasks.

✅ Quick Check: A startup needs sentiment analysis on 1 million product reviews per day. Each review needs classification in under 50ms. Budget is limited. What’s the right model choice? Fine-tuned DistilBERT or BERT-base. At 3-8ms per inference, these handle 1M reviews in hours on a single GPU. GPT-4 would take days and cost thousands of dollars. DistilBERT is 60% smaller than BERT-base but retains 97% of its accuracy — the ideal choice when speed and cost matter more than marginal accuracy gains.

Zero-Shot vs Fine-Tuned

The biggest practical decision in modern NLP: prompt a general-purpose LLM (zero-shot) or fine-tune a specialized model?

Factor	Zero-Shot LLM	Fine-Tuned BERT
Labeled data needed	None	Hundreds to thousands
Setup time	Minutes	Days (labeling + training)
Per-query cost	$0.001-$0.03	~$0.0001 (self-hosted)
Accuracy	Good (F1 0.70-0.85)	Excellent (F1 0.90-0.97)
Latency	500ms-3s	5-25ms
Data privacy	Data sent to API provider	Data stays on your servers
Flexibility	Change task with a prompt change	Retrain for each new task

The practical strategy: Start with zero-shot (fast prototyping, no data needed). If accuracy matters, invest in labeled data and fine-tune. Many production systems use both — zero-shot for rare/new categories, fine-tuned for high-volume categories.

Open-Source NLP Ecosystem

The open-source transformer ecosystem is massive and growing:

Hugging Face Hub: 500,000+ pretrained models, ready to fine-tune
spaCy + Transformers: Production pipeline with transformer-powered NER and classification
Sentence Transformers: Specialized models for text similarity and search
Ollama, vLLM: Run open LLMs (Llama, Mistral) locally

This ecosystem means you can deploy state-of-the-art NLP without sending data to any third-party API — critical for regulated industries and privacy-sensitive applications.

Key Takeaways

Three transformer architectures: encoder (BERT — understanding), decoder (GPT — generation), encoder-decoder (T5 — both)
BERT excels at classification, NER, sentiment; GPT excels at generation; T5 unifies all tasks as text-to-text
Fine-tuned BERT often matches GPT-4 zero-shot on classification at 1/100th the cost and 100x speed
Zero-shot LLMs need no training data — ideal for prototyping; fine-tuned models win on accuracy and cost at scale
Open-source models (Llama, Mistral, BERT) enable on-premises NLP with full data sovereignty
Bigger isn’t always better: DistilBERT (66M params) retains 97% of BERT’s accuracy at 60% the size

Up Next

You now have the full NLP toolkit — from preprocessing to transformers. Lesson 8 brings it all together: designing your first NLP project, choosing a career path, and mapping the skills that command $107K-$206K salaries.