Text Preprocessing
How to clean raw text for NLP — tokenization, stopword removal, stemming vs lemmatization, and building preprocessing pipelines with spaCy.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
From Messy Text to Clean Input
Raw text is a mess. It has typos, inconsistent capitalization, punctuation that means different things in different contexts, HTML tags, emojis, and abbreviations. NLP models can’t work with this directly — they need structured, consistent input.
Text preprocessing is the bridge between raw text and usable data. And it’s where most NLP projects spend 50-80% of their time.
The Preprocessing Pipeline
A standard NLP preprocessing pipeline runs these steps in order:
Raw Text → Cleaning → Tokenization → Stopword Removal → Stemming/Lemmatization → Output
Each step makes the text a little more structured and a little more machine-friendly. Let’s walk through each one.
Step 1: Text Cleaning
Before anything else, strip the noise:
| Noise Type | Example | Action |
|---|---|---|
| HTML/XML tags | <p>Hello</p> | Strip tags, keep text |
| URLs | https://example.com | Remove or replace with [URL] |
| Special characters | @user #topic | Remove or normalize |
| Extra whitespace | hello world | Collapse to single space |
| Encoding issues | caf\u00e9 → café | Unicode normalization |
Cleaning is straightforward but easy to get wrong. A URL inside a tech support ticket might be important. An @mention in a social media post might be the entity you’re trying to extract. Always clean with your downstream task in mind.
Step 2: Tokenization
Tokenization splits text into individual units — tokens. These are usually words, but they can be subwords, characters, or sentences depending on the approach.
Word tokenization: “The cat sat on the mat” → ["The", "cat", "sat", "on", "the", "mat"]
Sounds trivial. It isn’t. Consider:
- “New York” — one token or two?
- “don’t” — “do” + “n’t” or “don” + “’t”?
- “state-of-the-art” — one token or five?
- “Dr.” — sentence boundary or abbreviation?
Subword tokenization (used by BERT and GPT) breaks words into smaller pieces: “unhappiness” → ["un", "##happi", "##ness"]. This handles unknown words gracefully — even if the model never saw “unhappiness” during training, it knows “un-”, “happi-”, and “-ness” as meaningful fragments.
✅ Quick Check: Why do transformer models like BERT use subword tokenization instead of word-level tokenization? Word-level tokenization creates a fixed vocabulary — any word not in that vocabulary becomes “unknown.” Subword tokenization handles any word by breaking it into known pieces. This is critical for handling rare words, technical jargon, misspellings, and new terms the model never saw during training.
Step 3: Stopword Removal
Stopwords are common words that carry little meaning on their own: “the,” “is,” “at,” “in,” “a,” “an.” They’re frequent but not informative for many tasks.
Removing them reduces noise and dimensionality. “The quick brown fox jumps over the lazy dog” becomes “quick brown fox jumps lazy dog” — the core meaning survives.
But stopword removal isn’t always appropriate:
| Task | Remove Stopwords? | Why |
|---|---|---|
| Topic modeling | Yes | Focus on content words |
| Search indexing | Yes | Reduce index size |
| Sentiment analysis | Careful | “Not good” → “good” flips meaning |
| Machine translation | No | Grammar needs function words |
| NER | Usually no | Context helps entity detection |
Step 4: Stemming vs Lemmatization
Both reduce words to their base form, but differently.
Stemming chops endings using rules:
- “running” → “run” (dropped -ning… actually → “runn”)
- “studies” → “studi”
- “happily” → “happili”
Fast. Crude. Often produces non-words.
Lemmatization finds the actual dictionary base form:
- “running” → “run”
- “studies” → “study”
- “better” → “good”
- “was” → “be”
Slower but accurate. Needs to know the word’s part of speech — “saw” as a verb lemmatizes to “see,” but “saw” as a noun stays “saw.”
Modern NLP tools default to lemmatization. spaCy handles it automatically as part of its pipeline — tokenization, POS tagging, and lemmatization in a single pass across 75+ languages.
✅ Quick Check: When would you choose stemming over lemmatization? When speed matters more than precision. Search engines often use stemming to match “running,” “runs,” and “ran” to the same index entry. The slight inaccuracy doesn’t matter because search is about recall (finding relevant documents), not linguistic precision. For tasks where word meaning matters — classification, sentiment, NER — lemmatization is almost always better.
Putting It Together: spaCy Pipelines
Modern NLP libraries like spaCy combine all these steps into a single processing pipeline. You feed in raw text, and spaCy returns tokenized, lemmatized, POS-tagged text with named entities already identified.
spaCy’s pipeline runs: tokenizer → tagger → parser → NER, processing each step automatically. This is a massive productivity gain over manual preprocessing — what used to take 100 lines of custom code now takes a few lines with spaCy.
The key insight: preprocessing isn’t a one-size-fits-all recipe. Every decision — whether to lowercase, which stopwords to remove, whether to stem or lemmatize — depends on your downstream task.
Key Takeaways
- Text preprocessing bridges raw text and machine-readable input — typically 50-80% of NLP project time
- The pipeline: cleaning → tokenization → stopword removal → stemming/lemmatization
- Subword tokenization (BERT, GPT) handles unknown words by breaking them into known fragments
- Stopword removal is task-dependent — removing “not” destroys sentiment; keeping it helps classification
- Lemmatization (dictionary lookup) beats stemming (rule-based chopping) for accuracy; stemming wins on speed
- spaCy handles the entire pipeline in one pass across 75+ languages
Up Next
Preprocessed text is clean, but it’s still text — and models need numbers. Lesson 3 covers how to convert words into numerical representations: bag-of-words, TF-IDF, Word2Vec, and the transformer embeddings that power modern NLP.
Knowledge Check
Complete the quiz above first
Lesson completed!