Text Preprocessing

From Messy Text to Clean Input

Raw text is a mess. It has typos, inconsistent capitalization, punctuation that means different things in different contexts, HTML tags, emojis, and abbreviations. NLP models can’t work with this directly — they need structured, consistent input.

Text preprocessing is the bridge between raw text and usable data. And it’s where most NLP projects spend 50-80% of their time.

The Preprocessing Pipeline

A standard NLP preprocessing pipeline runs these steps in order:

Raw Text → Cleaning → Tokenization → Stopword Removal → Stemming/Lemmatization → Output

Each step makes the text a little more structured and a little more machine-friendly. Let’s walk through each one.

Step 1: Text Cleaning

Before anything else, strip the noise:

Noise Type	Example	Action
HTML/XML tags	`<p>Hello</p>`	Strip tags, keep text
URLs	`https://example.com`	Remove or replace with `[URL]`
Special characters	`@user #topic`	Remove or normalize
Extra whitespace	`hello world`	Collapse to single space
Encoding issues	`caf\u00e9` → `café`	Unicode normalization

Cleaning is straightforward but easy to get wrong. A URL inside a tech support ticket might be important. An @mention in a social media post might be the entity you’re trying to extract. Always clean with your downstream task in mind.

Step 2: Tokenization

Tokenization splits text into individual units — tokens. These are usually words, but they can be subwords, characters, or sentences depending on the approach.

Word tokenization: “The cat sat on the mat” → ["The", "cat", "sat", "on", "the", "mat"]

Sounds trivial. It isn’t. Consider:

“New York” — one token or two?
“don’t” — “do” + “n’t” or “don” + “’t”?
“state-of-the-art” — one token or five?
“Dr.” — sentence boundary or abbreviation?

Subword tokenization (used by BERT and GPT) breaks words into smaller pieces: “unhappiness” → ["un", "##happi", "##ness"]. This handles unknown words gracefully — even if the model never saw “unhappiness” during training, it knows “un-”, “happi-”, and “-ness” as meaningful fragments.

✅ Quick Check: Why do transformer models like BERT use subword tokenization instead of word-level tokenization? Word-level tokenization creates a fixed vocabulary — any word not in that vocabulary becomes “unknown.” Subword tokenization handles any word by breaking it into known pieces. This is critical for handling rare words, technical jargon, misspellings, and new terms the model never saw during training.

Step 3: Stopword Removal

Stopwords are common words that carry little meaning on their own: “the,” “is,” “at,” “in,” “a,” “an.” They’re frequent but not informative for many tasks.

Removing them reduces noise and dimensionality. “The quick brown fox jumps over the lazy dog” becomes “quick brown fox jumps lazy dog” — the core meaning survives.

But stopword removal isn’t always appropriate:

Task	Remove Stopwords?	Why
Topic modeling	Yes	Focus on content words
Search indexing	Yes	Reduce index size
Sentiment analysis	Careful	“Not good” → “good” flips meaning
Machine translation	No	Grammar needs function words
NER	Usually no	Context helps entity detection

Step 4: Stemming vs Lemmatization

Both reduce words to their base form, but differently.

Stemming chops endings using rules:

“running” → “run” (dropped -ning… actually → “runn”)
“studies” → “studi”
“happily” → “happili”

Fast. Crude. Often produces non-words.

Lemmatization finds the actual dictionary base form:

“running” → “run”
“studies” → “study”
“better” → “good”
“was” → “be”

Slower but accurate. Needs to know the word’s part of speech — “saw” as a verb lemmatizes to “see,” but “saw” as a noun stays “saw.”

Modern NLP tools default to lemmatization. spaCy handles it automatically as part of its pipeline — tokenization, POS tagging, and lemmatization in a single pass across 75+ languages.

✅ Quick Check: When would you choose stemming over lemmatization? When speed matters more than precision. Search engines often use stemming to match “running,” “runs,” and “ran” to the same index entry. The slight inaccuracy doesn’t matter because search is about recall (finding relevant documents), not linguistic precision. For tasks where word meaning matters — classification, sentiment, NER — lemmatization is almost always better.

Putting It Together: spaCy Pipelines

Modern NLP libraries like spaCy combine all these steps into a single processing pipeline. You feed in raw text, and spaCy returns tokenized, lemmatized, POS-tagged text with named entities already identified.

spaCy’s pipeline runs: tokenizer → tagger → parser → NER, processing each step automatically. This is a massive productivity gain over manual preprocessing — what used to take 100 lines of custom code now takes a few lines with spaCy.

The key insight: preprocessing isn’t a one-size-fits-all recipe. Every decision — whether to lowercase, which stopwords to remove, whether to stem or lemmatize — depends on your downstream task.

Key Takeaways

Text preprocessing bridges raw text and machine-readable input — typically 50-80% of NLP project time
The pipeline: cleaning → tokenization → stopword removal → stemming/lemmatization
Subword tokenization (BERT, GPT) handles unknown words by breaking them into known fragments
Stopword removal is task-dependent — removing “not” destroys sentiment; keeping it helps classification
Lemmatization (dictionary lookup) beats stemming (rule-based chopping) for accuracy; stemming wins on speed
spaCy handles the entire pipeline in one pass across 75+ languages

Up Next

Preprocessed text is clean, but it’s still text — and models need numbers. Lesson 3 covers how to convert words into numerical representations: bag-of-words, TF-IDF, Word2Vec, and the transformer embeddings that power modern NLP.