Text Classification
How to build systems that automatically categorize text — spam detection, topic labeling, content moderation, and the models that power them.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Sorting Text at Scale
🔄 Lesson 3 covered how to turn words into numbers — bag-of-words, TF-IDF, Word2Vec, and transformer embeddings. Those representations are the input. Now we need the system that uses them to make decisions: text classification.
Text classification assigns a label to a document. It’s the most common NLP task in production — if your company processes text, you’re probably already using it somewhere.
Where Classification Runs
| Application | Classes | Scale |
|---|---|---|
| Email spam filtering | Spam / Not spam | Billions of emails per day (Gmail) |
| Content moderation | Safe / Toxic / Spam / Violence | Millions of posts per hour (social media) |
| Customer support routing | Billing / Technical / Returns / General | Thousands of tickets per day |
| News categorization | Politics / Sports / Tech / Business | Millions of articles per day |
| Legal document review | Relevant / Privileged / Irrelevant | Millions of documents per case |
| Medical coding | ICD codes (70,000+ categories) | Millions of records per hospital |
The Classification Pipeline
Every text classifier follows the same basic flow:
Raw text → Preprocess → Represent (vectorize) → Classify → Label
The “represent” step uses the methods from Lesson 3. The “classify” step applies a model that maps those representations to categories.
Classical Models
Naive Bayes — The starting point for text classification. It applies Bayes’ theorem: given a document, what’s the probability it belongs to each class? It’s “naive” because it assumes words are independent (they’re not, but it works surprisingly well).
- Strengths: fast to train, works with small datasets, good baseline
- Weakness: ignores word order and word relationships
- Best with: bag-of-words or TF-IDF representations
Logistic Regression — Despite the name, it’s a classifier. It learns a weight for each feature (word) and combines them to predict the class probability.
- Strengths: interpretable (you can see which words drive each prediction), fast, solid performance
- Weakness: linear decision boundary — can’t capture complex patterns
- Best with: TF-IDF representations
Support Vector Machines (SVM) — Finds the optimal boundary between classes in the feature space. Works well with high-dimensional text data.
- Strengths: handles high-dimensional data well, strong with small-to-medium datasets
- Weakness: slower than Naive Bayes, harder to interpret
- Best with: TF-IDF or Word2Vec
✅ Quick Check: You need a quick baseline classifier for a new text classification project. You have 1,000 labeled examples and need results by tomorrow. What do you build? TF-IDF + Logistic Regression. It takes minutes to train, gives interpretable results (which words matter most for each class), and achieves competitive accuracy on many tasks. This is the standard “first model” in NLP — if it solves the problem, you don’t need anything fancier.
Deep Learning Models
CNNs for text — Apply convolutional filters over word sequences to capture local patterns (n-grams). A filter of size 3 captures three-word phrases; multiple filter sizes capture patterns at different scales.
- Strengths: captures phrase-level patterns, relatively fast
- Use case: sentence-level classification (sentiment, topic)
RNNs/LSTMs — Process text sequentially, maintaining a hidden state. Good for tasks where word order matters.
- Strengths: captures sequential patterns, understands word order
- Weakness: slow (sequential processing), struggles with long documents
Transformers (BERT, RoBERTa) — The current standard. Pretrained on billions of words, then fine-tuned on your specific classification task.
- Strengths: state-of-the-art accuracy, handles nuance, context-aware
- Weakness: requires GPU, slower inference, less interpretable
- The approach: take a pretrained model, add a classification head, fine-tune on your labeled data
Evaluation: Beyond Accuracy
Accuracy alone is misleading for most real-world classification problems. Here’s why, and what to use instead:
| Metric | What It Measures | When to Use |
|---|---|---|
| Accuracy | % of correct predictions | Only when classes are balanced |
| Precision | Of predicted positives, % actually positive | When false positives are costly (spam filter) |
| Recall | Of actual positives, % correctly found | When false negatives are costly (fraud, disease) |
| F1 Score | Harmonic mean of precision and recall | General-purpose balanced metric |
| AUC-ROC | Performance across all thresholds | Comparing models overall |
For fraud detection, recall matters most — missing a fraud is worse than a false alarm. For content recommendation, precision matters more — recommending bad content hurts more than missing a good article.
✅ Quick Check: A medical screening test classifies X-rays as “possible tumor” or “clear.” The hospital wants to catch 99% of actual tumors, even if it means extra follow-up scans. Which metric should they optimize? Recall — also called sensitivity. A recall of 99% means only 1% of actual tumors are missed. The trade-off: lower precision means more false alarms (healthy patients flagged for follow-up). In medical screening, this trade-off is almost always worth it — follow-up scans are inconvenient but catching cancer early saves lives.
Handling Imbalanced Data
Most real classification problems have imbalanced classes — fraud is 0.1% of transactions, toxic content is 2% of posts, urgent tickets are 5% of support volume. Standard training on imbalanced data produces models that ignore the rare class.
Solutions:
- Oversampling: Duplicate minority class examples (SMOTE generates synthetic examples)
- Undersampling: Remove majority class examples (fast but loses data)
- Class weights: Tell the model that minority class errors cost more
- Threshold tuning: Adjust the classification threshold to favor the important class
Key Takeaways
- Text classification assigns labels to documents — the most common NLP task in production
- Classical pipeline: TF-IDF + Logistic Regression is a strong, fast baseline
- Deep learning: fine-tuned BERT is the current standard for accuracy on nuanced tasks
- Accuracy is misleading for imbalanced classes — use precision, recall, and F1 instead
- For rare-class detection (fraud, disease): optimize for recall
- Multilingual models (mBERT, XLM-R) handle 100+ languages without separate classifiers
Up Next
Classification labels entire documents. But what about extracting specific pieces of information from within the text — names, dates, organizations, amounts? That’s named entity recognition, and it’s the subject of Lesson 5.
Knowledge Check
Complete the quiz above first
Lesson completed!