Text Classification

Sorting Text at Scale

🔄 Lesson 3 covered how to turn words into numbers — bag-of-words, TF-IDF, Word2Vec, and transformer embeddings. Those representations are the input. Now we need the system that uses them to make decisions: text classification.

Text classification assigns a label to a document. It’s the most common NLP task in production — if your company processes text, you’re probably already using it somewhere.

Where Classification Runs

Application	Classes	Scale
Email spam filtering	Spam / Not spam	Billions of emails per day (Gmail)
Content moderation	Safe / Toxic / Spam / Violence	Millions of posts per hour (social media)
Customer support routing	Billing / Technical / Returns / General	Thousands of tickets per day
News categorization	Politics / Sports / Tech / Business	Millions of articles per day
Legal document review	Relevant / Privileged / Irrelevant	Millions of documents per case
Medical coding	ICD codes (70,000+ categories)	Millions of records per hospital

The Classification Pipeline

Every text classifier follows the same basic flow:

Raw text → Preprocess → Represent (vectorize) → Classify → Label

The “represent” step uses the methods from Lesson 3. The “classify” step applies a model that maps those representations to categories.

Classical Models

Naive Bayes — The starting point for text classification. It applies Bayes’ theorem: given a document, what’s the probability it belongs to each class? It’s “naive” because it assumes words are independent (they’re not, but it works surprisingly well).

Strengths: fast to train, works with small datasets, good baseline
Weakness: ignores word order and word relationships
Best with: bag-of-words or TF-IDF representations

Logistic Regression — Despite the name, it’s a classifier. It learns a weight for each feature (word) and combines them to predict the class probability.

Strengths: interpretable (you can see which words drive each prediction), fast, solid performance
Weakness: linear decision boundary — can’t capture complex patterns
Best with: TF-IDF representations

Support Vector Machines (SVM) — Finds the optimal boundary between classes in the feature space. Works well with high-dimensional text data.

Strengths: handles high-dimensional data well, strong with small-to-medium datasets
Weakness: slower than Naive Bayes, harder to interpret
Best with: TF-IDF or Word2Vec

✅ Quick Check: You need a quick baseline classifier for a new text classification project. You have 1,000 labeled examples and need results by tomorrow. What do you build? TF-IDF + Logistic Regression. It takes minutes to train, gives interpretable results (which words matter most for each class), and achieves competitive accuracy on many tasks. This is the standard “first model” in NLP — if it solves the problem, you don’t need anything fancier.

Deep Learning Models

CNNs for text — Apply convolutional filters over word sequences to capture local patterns (n-grams). A filter of size 3 captures three-word phrases; multiple filter sizes capture patterns at different scales.

Strengths: captures phrase-level patterns, relatively fast
Use case: sentence-level classification (sentiment, topic)

RNNs/LSTMs — Process text sequentially, maintaining a hidden state. Good for tasks where word order matters.

Strengths: captures sequential patterns, understands word order
Weakness: slow (sequential processing), struggles with long documents

Transformers (BERT, RoBERTa) — The current standard. Pretrained on billions of words, then fine-tuned on your specific classification task.

Strengths: state-of-the-art accuracy, handles nuance, context-aware
Weakness: requires GPU, slower inference, less interpretable
The approach: take a pretrained model, add a classification head, fine-tune on your labeled data

Evaluation: Beyond Accuracy

Accuracy alone is misleading for most real-world classification problems. Here’s why, and what to use instead:

Metric	What It Measures	When to Use
Accuracy	% of correct predictions	Only when classes are balanced
Precision	Of predicted positives, % actually positive	When false positives are costly (spam filter)
Recall	Of actual positives, % correctly found	When false negatives are costly (fraud, disease)
F1 Score	Harmonic mean of precision and recall	General-purpose balanced metric
AUC-ROC	Performance across all thresholds	Comparing models overall

For fraud detection, recall matters most — missing a fraud is worse than a false alarm. For content recommendation, precision matters more — recommending bad content hurts more than missing a good article.

✅ Quick Check: A medical screening test classifies X-rays as “possible tumor” or “clear.” The hospital wants to catch 99% of actual tumors, even if it means extra follow-up scans. Which metric should they optimize? Recall — also called sensitivity. A recall of 99% means only 1% of actual tumors are missed. The trade-off: lower precision means more false alarms (healthy patients flagged for follow-up). In medical screening, this trade-off is almost always worth it — follow-up scans are inconvenient but catching cancer early saves lives.

Handling Imbalanced Data

Most real classification problems have imbalanced classes — fraud is 0.1% of transactions, toxic content is 2% of posts, urgent tickets are 5% of support volume. Standard training on imbalanced data produces models that ignore the rare class.

Solutions:

Oversampling: Duplicate minority class examples (SMOTE generates synthetic examples)
Undersampling: Remove majority class examples (fast but loses data)
Class weights: Tell the model that minority class errors cost more
Threshold tuning: Adjust the classification threshold to favor the important class

Key Takeaways

Text classification assigns labels to documents — the most common NLP task in production
Classical pipeline: TF-IDF + Logistic Regression is a strong, fast baseline
Deep learning: fine-tuned BERT is the current standard for accuracy on nuanced tasks
Accuracy is misleading for imbalanced classes — use precision, recall, and F1 instead
For rare-class detection (fraud, disease): optimize for recall
Multilingual models (mBERT, XLM-R) handle 100+ languages without separate classifiers

Up Next

Classification labels entire documents. But what about extracting specific pieces of information from within the text — names, dates, organizations, amounts? That’s named entity recognition, and it’s the subject of Lesson 5.