Embeddings and Vector Databases
Understand how embedding models convert text to vectors, how vector similarity search works, and how to choose between Pinecone, Weaviate, Qdrant, Chroma, and pgvector.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Chunking gives you pieces of text. Embeddings give those pieces meaning — converting words into numbers that capture semantic relationships. Vector databases store and search those numbers at scale.
🔄 Quick Recall: In the previous lesson, you learned chunking strategies for splitting documents into retrievable pieces. Now you’ll learn how those pieces get converted into searchable vectors and stored in databases designed for fast similarity search.
How Embeddings Work
An embedding model converts text into a high-dimensional vector — a list of numbers (typically 768-3072 dimensions) that represents the text’s meaning.
"How do I return a product?" → [0.23, -0.45, 0.67, 0.12, ...]
"What's the return policy?" → [0.21, -0.43, 0.65, 0.14, ...]
"Today's weather forecast" → [0.89, 0.34, -0.22, 0.56, ...]
Notice: the first two vectors are nearly identical (similar meaning), while the third is very different (unrelated topic). This is how semantic search works — similar meanings produce similar vectors.
Key Properties
Semantic similarity: Words with similar meanings map to nearby vectors, even with different wording (“refund” ≈ “money back” ≈ “return”).
Language agnostic: Multilingual embedding models map “return policy” (English) near “política de devoluciones” (Spanish) because they mean the same thing.
Fixed dimensions: Every text input (whether 3 words or 3 paragraphs) produces a vector of the same dimension. This enables comparison.
Choosing an Embedding Model
| Model | Dimensions | Strengths | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Good quality, affordable | General purpose, budget-conscious |
| OpenAI text-embedding-3-large | 3072 | Highest quality from OpenAI | Maximum accuracy, cost is secondary |
| Cohere embed-v3 | 1024 | Strong multilingual, hybrid search | Multi-language knowledge bases |
| Voyage AI voyage-3 | 1024 | Excellent for code and technical docs | Technical documentation |
| Open-source (BGE, E5) | 768-1024 | Free, self-hosted, privacy | Data-sensitive, on-premise |
Rule of thumb: Start with OpenAI text-embedding-3-small for prototyping. Evaluate others when you need specific capabilities (multilingual, code, privacy).
✅ Quick Check: A hospital wants to build a RAG system for medical records. They cannot send patient data to external APIs for embedding. Which embedding model category should they use? (Answer: Open-source self-hosted models like BGE or E5. These run on the hospital’s own infrastructure, so patient data never leaves their network. The quality trade-off vs. commercial models is small, and the privacy guarantee is absolute. For sensitive data, self-hosted embedding is not optional — it’s required.)
How Vector Search Works
Distance Metrics
Vector databases find similar vectors using distance metrics:
| Metric | How It Works | When to Use |
|---|---|---|
| Cosine similarity | Measures angle between vectors | Most RAG use cases (default) |
| Euclidean distance | Measures straight-line distance | When vector magnitude matters |
| Dot product | Measures magnitude-weighted similarity | When some documents should rank higher |
Default choice: Cosine similarity. It’s the most common for text embeddings because it focuses on direction (meaning) rather than magnitude (length).
Approximate Nearest Neighbor (ANN)
At scale, checking every vector is too slow. ANN algorithms trade tiny accuracy losses for massive speed gains:
HNSW (Hierarchical Navigable Small World): Builds a graph structure for fast neighbor-hopping. Most popular ANN algorithm.
- Recall: 95-99% (finds the right results almost always)
- Speed: milliseconds even at millions of vectors
- Memory: requires all vectors in RAM
Vector Database Comparison
| Database | Type | Best For | Trade-off |
|---|---|---|---|
| Chroma | Embedded (in-process) | Prototyping, small datasets | No production features |
| pgvector | PostgreSQL extension | Teams already using PostgreSQL | Performance ceiling at scale |
| Pinecone | Fully managed cloud | Fast setup, zero ops | Higher cost, vendor lock-in |
| Weaviate | Open-source + managed | Hybrid search, ML integrations | More complex setup |
| Qdrant | Open-source + managed | Performance, filtering | Smaller ecosystem |
| Milvus | Open-source + managed | Billions of vectors | Complex to operate |
Decision Framework
Starting a prototype?
→ Chroma (zero setup, pip install)
Already using PostgreSQL?
→ pgvector (add extension, no new database)
Need production without ops overhead?
→ Pinecone (fully managed)
Need hybrid search (vector + keyword)?
→ Weaviate (built-in hybrid)
Need maximum performance per dollar?
→ Qdrant (efficient, strong filtering)
Need billion-vector scale?
→ Milvus (designed for extreme scale)
✅ Quick Check: Your company uses PostgreSQL for everything. Your RAG knowledge base has 50,000 document chunks. A colleague suggests Pinecone. Another suggests pgvector. Who’s right? (Answer: At 50,000 chunks, pgvector is the pragmatic choice. It runs as a PostgreSQL extension — no new infrastructure, no new vendor, no data migration. pgvector handles up to 1-5 million vectors well. If you grow to millions of vectors or need sub-millisecond latency, then evaluate Pinecone or Qdrant. But don’t add complexity before you need it.)
Indexing Pipeline in Practice
Putting it all together — a complete indexing pipeline:
1. Document arrives (PDF, HTML, Markdown)
↓
2. Parse text + extract metadata
↓
3. Chunk using appropriate strategy
(fixed-size for emails, structure-aware for docs)
↓
4. Generate embeddings for each chunk
(batch processing: 100+ chunks per API call)
↓
5. Store vectors + metadata in vector database
↓
6. Verify: spot-check retrieval on test queries
Batch Processing Tip
Embedding APIs support batch requests. Instead of one API call per chunk:
❌ Slow: 150,000 API calls (one per chunk)
✅ Fast: 1,500 API calls (100 chunks each)
Batching reduces latency and often reduces cost.
Practice Exercise
- Take 10 sentences from your domain — 5 pairs of semantically similar sentences and 5 unrelated
- Predict which pairs an embedding model would score as most similar
- Look up the vector database decision framework: which database fits your current situation?
Key Takeaways
- Embeddings convert text to numerical vectors where similar meanings produce similar numbers
- All documents and queries must use the same embedding model — mixing models produces garbage results
- Cosine similarity is the default distance metric for text RAG
- HNSW approximate search trades minimal accuracy for massive speed gains at scale
- Database choice depends on your stage: Chroma for prototyping, pgvector for PostgreSQL shops, managed services for production
- Batch embedding requests for efficiency — 100 chunks per API call instead of one
Up Next
In the next lesson, you’ll learn retrieval strategies that go beyond basic vector similarity — hybrid search, reranking, and query rewriting that dramatically improve what your system finds.
Knowledge Check
Complete the quiz above first
Lesson completed!