4/8

Lesson 4 15 min

Embeddings and Vector Databases

Understand how embedding models convert text to vectors, how vector similarity search works, and how to choose between Pinecone, Weaviate, Qdrant, Chroma, and pgvector.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

Chunking gives you pieces of text. Embeddings give those pieces meaning — converting words into numbers that capture semantic relationships. Vector databases store and search those numbers at scale.

🔄 Quick Recall: In the previous lesson, you learned chunking strategies for splitting documents into retrievable pieces. Now you’ll learn how those pieces get converted into searchable vectors and stored in databases designed for fast similarity search.

How Embeddings Work

An embedding model converts text into a high-dimensional vector — a list of numbers (typically 768-3072 dimensions) that represents the text’s meaning.

"How do I return a product?"    → [0.23, -0.45, 0.67, 0.12, ...]
"What's the return policy?"     → [0.21, -0.43, 0.65, 0.14, ...]
"Today's weather forecast"      → [0.89, 0.34, -0.22, 0.56, ...]

Notice: the first two vectors are nearly identical (similar meaning), while the third is very different (unrelated topic). This is how semantic search works — similar meanings produce similar vectors.

Key Properties

Semantic similarity: Words with similar meanings map to nearby vectors, even with different wording (“refund” ≈ “money back” ≈ “return”).

Language agnostic: Multilingual embedding models map “return policy” (English) near “política de devoluciones” (Spanish) because they mean the same thing.

Fixed dimensions: Every text input (whether 3 words or 3 paragraphs) produces a vector of the same dimension. This enables comparison.

Choosing an Embedding Model

Model	Dimensions	Strengths	Best For
OpenAI text-embedding-3-small	1536	Good quality, affordable	General purpose, budget-conscious
OpenAI text-embedding-3-large	3072	Highest quality from OpenAI	Maximum accuracy, cost is secondary
Cohere embed-v3	1024	Strong multilingual, hybrid search	Multi-language knowledge bases
Voyage AI voyage-3	1024	Excellent for code and technical docs	Technical documentation
Open-source (BGE, E5)	768-1024	Free, self-hosted, privacy	Data-sensitive, on-premise

Rule of thumb: Start with OpenAI text-embedding-3-small for prototyping. Evaluate others when you need specific capabilities (multilingual, code, privacy).

✅ Quick Check: A hospital wants to build a RAG system for medical records. They cannot send patient data to external APIs for embedding. Which embedding model category should they use? (Answer: Open-source self-hosted models like BGE or E5. These run on the hospital’s own infrastructure, so patient data never leaves their network. The quality trade-off vs. commercial models is small, and the privacy guarantee is absolute. For sensitive data, self-hosted embedding is not optional — it’s required.)

How Vector Search Works

Distance Metrics

Vector databases find similar vectors using distance metrics:

Metric	How It Works	When to Use
Cosine similarity	Measures angle between vectors	Most RAG use cases (default)
Euclidean distance	Measures straight-line distance	When vector magnitude matters
Dot product	Measures magnitude-weighted similarity	When some documents should rank higher

Default choice: Cosine similarity. It’s the most common for text embeddings because it focuses on direction (meaning) rather than magnitude (length).

Approximate Nearest Neighbor (ANN)

At scale, checking every vector is too slow. ANN algorithms trade tiny accuracy losses for massive speed gains:

HNSW (Hierarchical Navigable Small World): Builds a graph structure for fast neighbor-hopping. Most popular ANN algorithm.

Recall: 95-99% (finds the right results almost always)
Speed: milliseconds even at millions of vectors
Memory: requires all vectors in RAM

Vector Database Comparison

Database	Type	Best For	Trade-off
Chroma	Embedded (in-process)	Prototyping, small datasets	No production features
pgvector	PostgreSQL extension	Teams already using PostgreSQL	Performance ceiling at scale
Pinecone	Fully managed cloud	Fast setup, zero ops	Higher cost, vendor lock-in
Weaviate	Open-source + managed	Hybrid search, ML integrations	More complex setup
Qdrant	Open-source + managed	Performance, filtering	Smaller ecosystem
Milvus	Open-source + managed	Billions of vectors	Complex to operate

Decision Framework

Starting a prototype?
  → Chroma (zero setup, pip install)

Already using PostgreSQL?
  → pgvector (add extension, no new database)

Need production without ops overhead?
  → Pinecone (fully managed)

Need hybrid search (vector + keyword)?
  → Weaviate (built-in hybrid)

Need maximum performance per dollar?
  → Qdrant (efficient, strong filtering)

Need billion-vector scale?
  → Milvus (designed for extreme scale)

✅ Quick Check: Your company uses PostgreSQL for everything. Your RAG knowledge base has 50,000 document chunks. A colleague suggests Pinecone. Another suggests pgvector. Who’s right? (Answer: At 50,000 chunks, pgvector is the pragmatic choice. It runs as a PostgreSQL extension — no new infrastructure, no new vendor, no data migration. pgvector handles up to 1-5 million vectors well. If you grow to millions of vectors or need sub-millisecond latency, then evaluate Pinecone or Qdrant. But don’t add complexity before you need it.)

Indexing Pipeline in Practice

Putting it all together — a complete indexing pipeline:

1. Document arrives (PDF, HTML, Markdown)
          ↓
2. Parse text + extract metadata
          ↓
3. Chunk using appropriate strategy
   (fixed-size for emails, structure-aware for docs)
          ↓
4. Generate embeddings for each chunk
   (batch processing: 100+ chunks per API call)
          ↓
5. Store vectors + metadata in vector database
          ↓
6. Verify: spot-check retrieval on test queries

Batch Processing Tip

Embedding APIs support batch requests. Instead of one API call per chunk:

❌ Slow: 150,000 API calls (one per chunk)
✅ Fast: 1,500 API calls (100 chunks each)

Batching reduces latency and often reduces cost.

Practice Exercise

Take 10 sentences from your domain — 5 pairs of semantically similar sentences and 5 unrelated
Predict which pairs an embedding model would score as most similar
Look up the vector database decision framework: which database fits your current situation?

Key Takeaways

Embeddings convert text to numerical vectors where similar meanings produce similar numbers
All documents and queries must use the same embedding model — mixing models produces garbage results
Cosine similarity is the default distance metric for text RAG
HNSW approximate search trades minimal accuracy for massive speed gains at scale
Database choice depends on your stage: Chroma for prototyping, pgvector for PostgreSQL shops, managed services for production
Batch embedding requests for efficiency — 100 chunks per API call instead of one

Up Next

In the next lesson, you’ll learn retrieval strategies that go beyond basic vector similarity — hybrid search, reranking, and query rewriting that dramatically improve what your system finds.

Knowledge Check

1. You indexed your documents with OpenAI's text-embedding-3-small model. Later, you switch to Cohere's embed-v3 model for new documents. Search quality drops drastically. Why?

Cohere's model is worse Different embedding models produce incompatible vector spaces — a query embedded with Cohere can't meaningfully compare against documents embedded with OpenAI. All documents and queries must use the same embedding model for similarity search to work The vector database is corrupted

2. Your knowledge base has 1 million document chunks. A user query takes 3 seconds to return results — too slow for a real-time application. The vector database is using brute-force search. What's the fix?

Use a smaller embedding model Switch to approximate nearest neighbor (ANN) search using an index like HNSW — it trades a small amount of accuracy (finding the 99th-best match instead of the absolute best) for dramatically faster search, typically reducing latency from seconds to milliseconds Reduce the number of documents

3. You're choosing a vector database for a startup building a RAG-powered customer support bot. Budget is limited, the team is small, and you need to move fast. Which is the best fit?

Milvus — it handles billions of vectors Chroma for prototyping (embedded, zero setup) then migrate to a managed service like Pinecone or Weaviate Cloud when you scale — matching the database to your current stage avoids overengineering early and underscaling later Build a custom vector database from scratch

Answer all questions to check

Complete the quiz above first