Text Representations
How to convert words into numbers — bag-of-words, TF-IDF, Word2Vec embeddings, and transformer representations that capture meaning.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Words to Numbers
🔄 Lesson 2 covered how to preprocess raw text — cleaning, tokenizing, removing stopwords, and lemmatizing. But even preprocessed text is still just strings of characters. Machine learning models need numbers. This lesson covers how to convert words into numerical representations that capture meaning.
Bag-of-Words: The Simplest Approach
Bag-of-words (BoW) counts how many times each word appears in a document, ignoring word order entirely.
Example: Two documents:
- Doc 1: “The cat sat on the mat”
- Doc 2: “The dog sat on the log”
The vocabulary is: {the, cat, sat, on, mat, dog, log}
| the | cat | sat | on | mat | dog | log | |
|---|---|---|---|---|---|---|---|
| Doc 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 |
| Doc 2 | 2 | 0 | 1 | 1 | 0 | 1 | 1 |
Each document becomes a vector of word counts. Simple, fast, and surprisingly effective for many tasks — spam detection, topic classification, and document similarity all work well with BoW.
Limitation: It ignores word order. “Dog bites man” and “Man bites dog” produce identical vectors. And common words like “the” dominate the counts without adding meaning.
TF-IDF: Smarter Counting
TF-IDF (Term Frequency × Inverse Document Frequency) fixes BoW’s biggest problem: it downweighs words that appear everywhere and amplifies words that are distinctive to a specific document.
How it works:
- TF (Term Frequency): How often a word appears in this document
- IDF (Inverse Document Frequency): How rare the word is across all documents
- TF-IDF = TF × IDF
A word that appears frequently in one document but rarely in others gets a high score. “The” appears everywhere — low IDF, low score. “Cryptocurrency” appears in few articles — high IDF, high score.
✅ Quick Check: A search engine indexes 1 million web pages. You search for “quantum computing basics.” Why does TF-IDF help return relevant results better than raw word counts? TF-IDF ranks pages where “quantum” and “computing” are distinctively present (high TF in that page, moderate IDF across all pages). Raw counts would rank any page mentioning “basics” thousands of times — even if it’s about cooking basics. TF-IDF surfaces the pages where your search terms are genuinely distinctive.
TF-IDF is still widely used. Search engines, recommendation systems, and document similarity tools rely on it because it’s fast, interpretable, and works without any training data.
Word2Vec: Words as Vectors
Bag-of-words and TF-IDF treat each word as independent — “happy” and “joyful” are as different as “happy” and “table.” Word2Vec (Google, 2013) changed this by representing words as dense vectors in a continuous space where similar words cluster together.
The insight: Words that appear in similar contexts have similar meanings. “Dog” and “cat” frequently appear near “pet,” “vet,” “cute,” “food” — so their vectors end up close together. “Dog” and “parliament” rarely share context — their vectors are far apart.
Two training approaches:
- CBOW (Continuous Bag of Words): Predict the target word from surrounding words
- Skip-gram: Predict surrounding words from the target word
The result: each word gets a vector (typically 100-300 dimensions) that encodes its semantic meaning.
The breakthrough: Vector arithmetic captures relationships.
- king - man + woman ≈ queen (gender relationship)
- Paris - France + Germany ≈ Berlin (capital-country relationship)
- walked - walking + swimming ≈ swam (tense relationship)
These relationships aren’t programmed — they emerge from patterns in billions of words of text.
Limitation: Each word gets one fixed vector regardless of context. “Bank” has the same vector whether it means a financial institution or a river bank.
GloVe: Global Vectors
GloVe (Stanford, 2014) takes a different approach to the same goal. Instead of predicting context words, it builds a co-occurrence matrix — how often each pair of words appears near each other across the entire corpus — then factorizes it into word vectors.
The result is similar to Word2Vec: dense vectors where semantic similarity maps to vector proximity. GloVe often captures global patterns (word co-occurrence across the full dataset) better than Word2Vec, which only looks at local context windows.
In practice, Word2Vec and GloVe produce roughly comparable results. Both are called static embeddings — each word gets a single, fixed representation.
✅ Quick Check: Why are Word2Vec and GloVe called “static” embeddings? Because each word gets exactly one vector, regardless of how it’s used. The word “bank” has the same vector in “river bank” and “bank account.” This is a problem for polysemous words (words with multiple meanings) and is the key limitation that contextual embeddings (BERT) were designed to solve.
Transformer Embeddings: Context Changes Everything
BERT (Google, 2018) introduced contextual embeddings — vectors that change based on surrounding words.
Instead of one fixed vector for “bank,” BERT generates:
- A financial-oriented vector for “I deposited money at the bank”
- A geographical-oriented vector for “I sat on the river bank”
How it works: BERT reads the entire sentence at once (bidirectionally) and uses self-attention to weigh how each word relates to every other word. The resulting embedding for each word is shaped by its full context.
This is why BERT dramatically outperformed static embeddings on benchmarks like GLUE — tasks like question answering, sentiment analysis, and NER all depend on understanding which meaning of a word is intended.
Comparison Table
| Method | Year | Captures Meaning? | Context-Aware? | Speed | Use Case |
|---|---|---|---|---|---|
| Bag-of-Words | Classic | No | No | Very fast | Topic classification, spam |
| TF-IDF | Classic | Partially | No | Very fast | Search, keyword extraction |
| Word2Vec | 2013 | Yes (static) | No | Fast | Similarity, analogy tasks |
| GloVe | 2014 | Yes (static) | No | Fast | Similar to Word2Vec |
| BERT embeddings | 2018 | Yes (contextual) | Yes | Slower | NER, QA, sentiment, classification |
Each method builds on the last. You don’t need to choose just one — many NLP pipelines use TF-IDF for fast filtering and transformer embeddings for detailed analysis.
Key Takeaways
- Bag-of-words counts word frequency — simple, fast, ignores meaning and order
- TF-IDF weighs words by distinctiveness — high score means a word is important to that specific document
- Word2Vec/GloVe encode words as dense vectors where similar words cluster together — “king - man + woman ≈ queen”
- Static embeddings (Word2Vec, GloVe) give each word one fixed vector regardless of context
- Transformer embeddings (BERT) generate context-dependent vectors — “bank” gets different representations in different sentences
- The evolution: counts (BoW) → weighted counts (TF-IDF) → static vectors (Word2Vec) → contextual vectors (BERT)
Up Next
Now that you know how to turn words into numbers, it’s time to do something with those numbers. Lesson 4 covers text classification — how to train models that automatically sort documents into categories like spam/not-spam, positive/negative, or topic labels.
Knowledge Check
Complete the quiz above first
Lesson completed!