ML Methods
Semantic Search and Embeddings
Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.
Prerequisites
Why This Matters
Keyword search matches tokens. Semantic search matches meaning. A query about "canine nutrition" should retrieve documents about "dog food" even though they share no words. Dense embedding models make this possible by mapping text to vectors where geometric proximity corresponds to semantic similarity. This is the retrieval backbone of RAG systems, recommendation engines, and multimodal search.
Mental Model
Encode both queries and documents into vectors in . At query time, find the nearest document vectors. The embedding model must be trained so that semantically similar pairs land close together and dissimilar pairs land far apart. The search infrastructure must handle millions or billions of vectors efficiently.
Formal Setup
Dense embedding model
A function mapping variable-length text to a fixed-dimensional vector. Typically implemented as a transformer encoder followed by pooling (CLS token or mean pooling). The dimension is commonly 384, 768, or 1024.
Cosine similarity
For vectors :
When vectors are -normalized (as most embedding models produce), cosine similarity equals the dot product , and maximizing cosine similarity is equivalent to minimizing Euclidean distance: .
Bi-Encoders vs Cross-Encoders
Bi-encoders encode query and document independently. The query embedding is computed once and compared against precomputed document embeddings via dot product. This allows precomputation of all document embeddings and sub-linear search at query time. Sentence-BERT (Reimers & Gurevych, 2019) is the canonical bi-encoder for text.
Cross-encoders take the concatenated pair as input and produce a relevance score directly. They are more accurate because they allow token-level interaction between query and document, but they cannot precompute document representations. A cross-encoder must run inference for every (query, document) pair.
Late-interaction models (ColBERT, Khattab and Zaharia 2020) sit between the two. They store one vector per token rather than one per document and score with a MaxSim operator . This recovers most of the cross-encoder's accuracy while keeping document-side precomputation; ColBERTv2 (Santhanam et al. 2022) and PLAID (Santhanam et al. 2022) compress the per-token vectors to make this practical at scale.
The standard pipeline: bi-encoder retrieves top- candidates (fast, approximate), then cross-encoder reranks them (slow, accurate).
Modern Embedding Models (2023-2025)
The Sentence-BERT recipe has been substantially superseded. The 2023-2025 generation of embedding models share a common training pattern: large-scale weakly-supervised contrastive pretraining (hundreds of millions to billions of pairs from the web), followed by supervised fine-tuning on labeled retrieval and similarity datasets, often with hard-negative mining and multi-task objectives. The notable open and commercial families:
- E5 (Wang et al. 2022/2024, Microsoft): multistage contrastive training
on the CCPairs corpus; the multilingual
multilingual-e5-largeand the instruction-tunede5-mistral-7b-instruct(Wang et al. 2024) extend the family to 7B parameters. - BGE / BGE-M3 (Xiao et al. 2024, BAAI): consistently strong on the MTEB English leaderboard; BGE-M3 unifies dense, sparse (BM25-like learned weights), and multi-vector (ColBERT-style) retrieval in a single model.
- GTE (Li et al. 2023, Alibaba) and Snowflake Arctic Embed (Merrick et al. 2024): smaller, fast, often Pareto-optimal at moderate dimensions.
- Nomic Embed (Nussbaum et al. 2024): fully open (data, weights, training code) 8192-context embedding model.
- Voyage AI (
voyage-3,voyage-code-3,voyage-law-2): commercial domain-specialized models that frequently top general and code retrieval benchmarks. - OpenAI
text-embedding-3-small/text-embedding-3-large(2024) and Cohereembed-v3(2023): commercial general-purpose embeddings with built-in Matryoshka support and explicit retrieval/clustering modes. - Mistral Embed and Jina Embeddings v3 (Sturua et al. 2024) round out the open-weights frontier, with Jina v3 introducing task-specific LoRA adapters for retrieval, classification, separation, and matching.
For non-text and cross-modal retrieval the analogous frontier is CLIP / SigLIP / SigLIP 2 (Radford et al. 2021, Zhai et al. 2023, Tschannen et al. 2025) for image-text and CLAP (Wu et al. 2023) for audio-text.
For a new retrieval system in 2025, a strong default is:
- Use a current open embedding model (BGE-M3, Nomic Embed v1.5, or
multilingual-e5-largefor multilingual). Use a commercial model (voyage-3, OpenAItext-embedding-3-large) only if you cannot host GPUs or your domain matches a specialized commercial offering. - L2-normalize every vector and search by inner product.
- Use Matryoshka representations if available: store full-dimension vectors and truncate at search time to trade quality for cost.
- Hybrid search: combine dense scores with BM25 (or learned sparse like SPLADE / BGE-M3 sparse) using reciprocal-rank fusion. On most real corpora the hybrid wins by 3-10 nDCG points over either signal alone.
- Rerank the top 50-200 with a cross-encoder (
bge-reranker-v2-m3, Cohere Rerank, or a Voyage reranker) before passing to the LLM. - Index in a vector store sized to your scale:
pgvectorfor under 10M vectors and operational simplicity; Qdrant, Weaviate, Milvus, or Vespa for self-hosted scale; Pinecone or Turbopuffer for hosted scale; FAISS or DiskANN when you need a library, not a server. - Validate every change against MTEB (Muennighoff et al. 2023) for English, MIRACL / MTEB Multilingual for multilingual, and a small in-domain eval set you actually care about.
Matryoshka Representations and Truncation
Matryoshka Representation Learning (MRL)
Kusupati et al. (2022) train an embedding so that the prefix is itself a useful embedding for every in a predefined ladder (e.g., 64, 128, 256, 512, 1024). The training objective is a sum of contrastive losses computed on each prefix length, so each prefix is forced to carry stand-alone semantic information rather than relying on the full vector. At deployment, an application can pick the prefix length that fits its memory and latency budget without retraining.
OpenAI text-embedding-3-large (3072 dimensions, truncatable down to 256
with minor quality loss), Cohere embed-v3, Nomic Embed v1.5, Jina v3, and
BGE-M3 all ship with Matryoshka heads. The practical pattern: store the
full-dimension vector once, then serve different products with different
truncations - 256-dim for first-stage ANN, full 1024- or 3072-dim for
final reranking - without ever recomputing the encoder.
Training Embedding Models
Embedding models are trained with contrastive losses. Given a query , a positive document , and negative documents :
where is a temperature parameter. This is the InfoNCE loss. Hard negatives (documents that are similar but not relevant) are critical for learning fine-grained distinctions.
Approximate Nearest Neighbor Search
Exact nearest neighbor search in over vectors costs per query. For in the millions, this is too slow. Approximate nearest neighbor (ANN) algorithms trade exactness for speed.
HNSW (Hierarchical Navigable Small World): builds a multi-layer graph where each node connects to nearby vectors. Search starts at the top layer (coarse) and descends to the bottom layer (fine). Query time is in practice.
IVF (Inverted File Index): partitions the vector space into Voronoi cells using k-means. At query time, search only the nearest cells. Reduces the search space by a factor of .
Product Quantization (PQ): compresses each vector by splitting it into subvectors and quantizing each independently. Reduces memory and allows fast distance computation via lookup tables.
ScaNN (Scalable Nearest Neighbors), Guo et al. 2020: introduces anisotropic vector quantization, which weights quantization error by its projection onto the inner-product direction rather than treating all directions equally. Empirically dominates IVF-PQ at the recall-vs-throughput frontier for inner-product search and is the indexing backbone of much of Google's production retrieval.
DiskANN / Vamana, Subramanya et al. 2019: builds a single-layer proximity graph (Vamana) tuned so that the index fits on SSD with only the hot navigation set in RAM. Enables billion-scale ANN on a single workstation and is the basis of Microsoft's production vector indexes.
LSH (Locality Sensitive Hashing), Indyk and Motwani 1998: the historically first sublinear ANN method. Hashes such that nearby points collide with high probability. Largely superseded by graph- and quantization-based methods on real data, but still useful when you need pure-CPU, no-training, streaming-friendly indexing or theoretical guarantees.
FAISS (Facebook AI Similarity Search) combines several of these: IVF for coarse partitioning, PQ (or its more accurate variants OPQ and IVF-PQ-Refine) for compression, and optional HNSW for the coarse quantizer.
Johnson-Lindenstrauss Lemma for Embeddings
Statement
For any set of points in and any , there exists a linear map with such that for all pairs :
Intuition
Pairwise distances are approximately preserved when projecting to dimensions. This is why ANN search works in moderate dimensions: the intrinsic dimensionality of the point set is often much smaller than the ambient dimension .
Proof Sketch
Use a random projection matrix where has i.i.d. sub-Gaussian entries. For a fixed pair, concentrates around by sub-Gaussian tail bounds. Union bound over all pairs gives the required .
Why It Matters
JL justifies dimensionality reduction for search. If your embeddings are in but you have documents, you can project to with small distortion, drastically reducing memory and search cost. PQ implicitly exploits this by quantizing in subspaces.
Failure Mode
JL preserves Euclidean distances, not cosine similarities directly. For normalized vectors, the two are monotonically related, so it works. For unnormalized vectors, cosine similarity involves division by norms, and small norm distortions can cause large cosine distortions. Always normalize before applying random projection for cosine search.
Hybrid Search: Dense + Sparse
Pure dense retrieval misses exact-match signals (product codes, names, acronyms, rare technical terms) that BM25 catches trivially. Pure BM25 misses paraphrase and concept overlap. Hybrid search combines them.
The standard combiner is Reciprocal Rank Fusion (Cormack et al. 2009): given rankings from retrievers, score each document by for a small constant . RRF is parameter-light and outperforms most weighted score-fusion schemes because it normalizes away the very different score scales of BM25 and cosine similarity.
Learned sparse retrievers - SPLADE (Formal et al. 2021), uniCOIL, and the sparse head of BGE-M3 - produce BM25-shaped term-weight vectors from a transformer. They preserve the inverted-index machinery (efficient on CPU, no GPU at query time) while learning expansion and re-weighting end to end. In production they are often used alongside a dense index for cheap, exact-match-friendly recall.
Vector Database Landscape
The 2024-2025 production stack splits roughly into:
- Libraries: FAISS (Facebook AI Similarity Search), DiskANN, ScaNN, hnswlib, Annoy, Voyager. No service layer, embed in your own process.
- Single-node servers: Qdrant, Weaviate, Chroma, Marqo, LanceDB. Easy to self-host, good developer experience, often Rust- or Go-based.
- Distributed databases: Milvus, Vespa, Vald. Designed for tens of billions of vectors; more operational overhead.
- OLTP extensions:
pgvectorandpgvecto.rs(PostgreSQL), Redis Search, Elasticsearch / OpenSearch dense_vector, MongoDB Atlas Vector Search. Trade peak performance for keeping vectors in your existing database. - Hosted / serverless: Pinecone, Turbopuffer, Vespa Cloud, Weaviate Cloud, Qdrant Cloud, MongoDB Atlas. Pay for managed scale and operational simplicity.
Most of these expose HNSW or IVF-PQ under the hood; the differentiation is
in metadata filtering, multi-tenancy, hybrid search, replication, and
cost. For new projects, the boring-but-correct default is to start with
pgvector if you are already on Postgres, Qdrant or Weaviate if not, and
graduate to a distributed system only when index size and QPS demand it.
The RAG Retrieval Pipeline
Retrieval-Augmented Generation (RAG) follows this flow:
- Index: chunk documents, embed each chunk with a bi-encoder, store vectors in an ANN index alongside a sparse / BM25 index
- Query: embed the user query with the same bi-encoder; tokenize for the sparse index
- Search: find top- nearest chunks via ANN search and BM25; fuse the rankings (e.g., reciprocal rank fusion)
- Rerank (optional): score the top 50-200 fused candidates with a cross-encoder
- Generate: feed retrieved chunks as context to a language model
The quality bottleneck is usually retrieval, not generation. If the right documents are not in the top-, no amount of generation quality will compensate.
Common Confusions
Cosine similarity is not a distance metric
Cosine similarity ranges from to and does not satisfy the triangle inequality. Cosine distance () does not satisfy the triangle inequality either, even for non-negative vectors. Counterexample: for , , , the cosine distance exceeds . The angular distance is a valid metric. ANN libraries typically work with inner product or distance, not cosine similarity directly. Normalizing vectors converts cosine search to inner product search.
More dimensions does not always mean better embeddings
Embedding dimension is a design choice with diminishing returns. Going from 128 to 768 dimensions usually helps. Going from 768 to 4096 often does not justify the increased memory and search cost. The effective intrinsic dimension of the embedding manifold is typically much smaller than the ambient dimension.
Bi-encoder similarity is not the same as relevance
Two documents can be highly similar (both about dogs) without either being relevant to a specific query (about dog allergies). Similarity is symmetric; relevance is asymmetric. Cross-encoders model relevance better because they condition on the specific query-document pair.
Summary
- Bi-encoders enable precomputation and fast search; cross-encoders are more accurate but require per-pair inference
- Cosine similarity on normalized vectors equals dot product; maximizing it equals minimizing distance
- ANN algorithms (HNSW, IVF, PQ) make billion-scale search practical
- The JL lemma justifies dimensionality reduction for search
- RAG quality is bottlenecked by retrieval quality, not generation quality
- Hard negatives are critical for training good embedding models
Exercises
Problem
If embedding vectors are -normalized, show that maximizing cosine similarity between and is equivalent to minimizing .
Problem
You have document embeddings in . Each vector stored in float32 takes bytes. What is the total memory? If you use PQ with 96 subquantizers and 256 centroids per subquantizer (1 byte per subvector), what is the compressed memory?
References
Canonical:
- Johnson & Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," AMS (1984)
- Reimers & Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," EMNLP 2019
Training and contrastive objectives:
- van den Oord, Li, Vinyals, "Representation Learning with Contrastive Predictive Coding" (2018). arXiv:1807.03748 (InfoNCE loss)
- Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" (EMNLP 2020). arXiv:2004.04906
- Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction" (SIGIR 2020). arXiv:2004.12832
- Santhanam et al., "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" (NAACL 2022). arXiv:2112.01488
- Formal, Lassance, Piwowarski, Clinchant, "SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval" (2021). arXiv:2109.10086
ANN and indexing:
- Indyk & Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" (STOC 1998)
- Malkov & Yashunin, "Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs" (IEEE TPAMI, 2020). arXiv:1603.09320
- Jegou, Douze, Schmid, "Product Quantization for Nearest Neighbor Search" (IEEE TPAMI, 2011)
- Guo et al., "Accelerating Large-Scale Inference with Anisotropic Vector Quantization" (ICML 2020). arXiv:1908.10396 (ScaNN)
- Subramanya et al., "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node" (NeurIPS 2019)
- Johnson, Douze, Jegou, "Billion-scale similarity search with GPUs," IEEE TBD 2021
- Cormack, Clarke, Buettcher, "Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods" (SIGIR 2009)
Modern embedding models:
- Wang et al., "Text Embeddings by Weakly-Supervised Contrastive Pre-training" (2022). arXiv:2212.03533 (E5)
- Wang et al., "Improving Text Embeddings with Large Language Models" (ACL 2024). arXiv:2401.00368 (E5-Mistral)
- Xiao et al., "C-Pack: Packed Resources For General Chinese Embeddings" (SIGIR 2024). arXiv:2309.07597 (BGE)
- Chen, Xiao, Zhang et al., "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation" (2024). arXiv:2402.03216
- Li et al., "Towards General Text Embeddings with Multi-stage Contrastive Learning" (2023). arXiv:2308.03281 (GTE)
- Nussbaum et al., "Nomic Embed: Training a Reproducible Long Context Text Embedder" (2024). arXiv:2402.01613
- Sturua et al., "jina-embeddings-v3: Multilingual Embeddings With Task LoRA" (2024). arXiv:2409.10173
- Merrick et al., "Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models" (2024). arXiv:2405.05374
Matryoshka and dimension reduction:
- Kusupati et al., "Matryoshka Representation Learning" (NeurIPS 2022). arXiv:2205.13147
Multimodal embeddings:
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (ICML 2021). arXiv:2103.00020 (CLIP)
- Zhai et al., "Sigmoid Loss for Language Image Pre-Training" (ICCV 2023). arXiv:2303.15343 (SigLIP)
Current:
- Muennighoff et al., "MTEB: Massive Text Embedding Benchmark," EACL 2023. arXiv:2210.07316
Next Topics
From semantic search, the natural continuations are:
- Multimodal RAG: extending retrieval to images, tables, and code
- Context engineering: how to select and arrange retrieved context for LLMs
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Inner Product Spaces and Orthogonalitylayer 0A · tier 1
- Information Retrieval Foundationslayer 2 · tier 1
- K-Nearest Neighborslayer 1 · tier 2
- Word Embeddingslayer 2 · tier 2
Derived topics
2- Context Engineeringlayer 5 · tier 2
- Multimodal RAGlayer 5 · tier 2
Graph-backed continuations