Text Representations

Words to Numbers

🔄 Lesson 2 covered how to preprocess raw text — cleaning, tokenizing, removing stopwords, and lemmatizing. But even preprocessed text is still just strings of characters. Machine learning models need numbers. This lesson covers how to convert words into numerical representations that capture meaning.

Bag-of-Words: The Simplest Approach

Bag-of-words (BoW) counts how many times each word appears in a document, ignoring word order entirely.

Example: Two documents:

Doc 1: “The cat sat on the mat”
Doc 2: “The dog sat on the log”

The vocabulary is: {the, cat, sat, on, mat, dog, log}

	the	cat	sat	on	mat	dog	log
Doc 1	2	1	1	1	1	0	0
Doc 2	2	0	1	1	0	1	1

Each document becomes a vector of word counts. Simple, fast, and surprisingly effective for many tasks — spam detection, topic classification, and document similarity all work well with BoW.

Limitation: It ignores word order. “Dog bites man” and “Man bites dog” produce identical vectors. And common words like “the” dominate the counts without adding meaning.

TF-IDF: Smarter Counting

TF-IDF (Term Frequency × Inverse Document Frequency) fixes BoW’s biggest problem: it downweighs words that appear everywhere and amplifies words that are distinctive to a specific document.

How it works:

TF (Term Frequency): How often a word appears in this document
IDF (Inverse Document Frequency): How rare the word is across all documents
TF-IDF = TF × IDF

A word that appears frequently in one document but rarely in others gets a high score. “The” appears everywhere — low IDF, low score. “Cryptocurrency” appears in few articles — high IDF, high score.

✅ Quick Check: A search engine indexes 1 million web pages. You search for “quantum computing basics.” Why does TF-IDF help return relevant results better than raw word counts? TF-IDF ranks pages where “quantum” and “computing” are distinctively present (high TF in that page, moderate IDF across all pages). Raw counts would rank any page mentioning “basics” thousands of times — even if it’s about cooking basics. TF-IDF surfaces the pages where your search terms are genuinely distinctive.

TF-IDF is still widely used. Search engines, recommendation systems, and document similarity tools rely on it because it’s fast, interpretable, and works without any training data.

Word2Vec: Words as Vectors

Bag-of-words and TF-IDF treat each word as independent — “happy” and “joyful” are as different as “happy” and “table.” Word2Vec (Google, 2013) changed this by representing words as dense vectors in a continuous space where similar words cluster together.

The insight: Words that appear in similar contexts have similar meanings. “Dog” and “cat” frequently appear near “pet,” “vet,” “cute,” “food” — so their vectors end up close together. “Dog” and “parliament” rarely share context — their vectors are far apart.

Two training approaches:

CBOW (Continuous Bag of Words): Predict the target word from surrounding words
Skip-gram: Predict surrounding words from the target word

The result: each word gets a vector (typically 100-300 dimensions) that encodes its semantic meaning.

The breakthrough: Vector arithmetic captures relationships.

king - man + woman ≈ queen (gender relationship)
Paris - France + Germany ≈ Berlin (capital-country relationship)
walked - walking + swimming ≈ swam (tense relationship)

These relationships aren’t programmed — they emerge from patterns in billions of words of text.

Limitation: Each word gets one fixed vector regardless of context. “Bank” has the same vector whether it means a financial institution or a river bank.

GloVe: Global Vectors

GloVe (Stanford, 2014) takes a different approach to the same goal. Instead of predicting context words, it builds a co-occurrence matrix — how often each pair of words appears near each other across the entire corpus — then factorizes it into word vectors.

The result is similar to Word2Vec: dense vectors where semantic similarity maps to vector proximity. GloVe often captures global patterns (word co-occurrence across the full dataset) better than Word2Vec, which only looks at local context windows.

In practice, Word2Vec and GloVe produce roughly comparable results. Both are called static embeddings — each word gets a single, fixed representation.

✅ Quick Check: Why are Word2Vec and GloVe called “static” embeddings? Because each word gets exactly one vector, regardless of how it’s used. The word “bank” has the same vector in “river bank” and “bank account.” This is a problem for polysemous words (words with multiple meanings) and is the key limitation that contextual embeddings (BERT) were designed to solve.

Transformer Embeddings: Context Changes Everything

BERT (Google, 2018) introduced contextual embeddings — vectors that change based on surrounding words.

Instead of one fixed vector for “bank,” BERT generates:

A financial-oriented vector for “I deposited money at the bank”
A geographical-oriented vector for “I sat on the river bank”

How it works: BERT reads the entire sentence at once (bidirectionally) and uses self-attention to weigh how each word relates to every other word. The resulting embedding for each word is shaped by its full context.

This is why BERT dramatically outperformed static embeddings on benchmarks like GLUE — tasks like question answering, sentiment analysis, and NER all depend on understanding which meaning of a word is intended.

Comparison Table

Method	Year	Captures Meaning?	Context-Aware?	Speed	Use Case
Bag-of-Words	Classic	No	No	Very fast	Topic classification, spam
TF-IDF	Classic	Partially	No	Very fast	Search, keyword extraction
Word2Vec	2013	Yes (static)	No	Fast	Similarity, analogy tasks
GloVe	2014	Yes (static)	No	Fast	Similar to Word2Vec
BERT embeddings	2018	Yes (contextual)	Yes	Slower	NER, QA, sentiment, classification

Each method builds on the last. You don’t need to choose just one — many NLP pipelines use TF-IDF for fast filtering and transformer embeddings for detailed analysis.

Key Takeaways

Bag-of-words counts word frequency — simple, fast, ignores meaning and order
TF-IDF weighs words by distinctiveness — high score means a word is important to that specific document
Word2Vec/GloVe encode words as dense vectors where similar words cluster together — “king - man + woman ≈ queen”
Static embeddings (Word2Vec, GloVe) give each word one fixed vector regardless of context
Transformer embeddings (BERT) generate context-dependent vectors — “bank” gets different representations in different sentences
The evolution: counts (BoW) → weighted counts (TF-IDF) → static vectors (Word2Vec) → contextual vectors (BERT)

Up Next

Now that you know how to turn words into numbers, it’s time to do something with those numbers. Lesson 4 covers text classification — how to train models that automatically sort documents into categories like spam/not-spam, positive/negative, or topic labels.