Vector Search at Scale: Indexing, Quantization, and Hybrid Retrieval
Building fast, accurate vector search systems requires understanding index types, quantization trade-offs, and when to combine dense and sparse retrieval. Here's what works in production.
Index selection matters
HNSW offers speed, IVF offers memory efficiency, and exact search offers guarantees. Your choice depends on dataset size, query patterns, and accuracy requirements.
Quantization (reducing embedding precision) can cut memory 4x–8x with minimal quality loss. Test on your domain before deploying.
Understanding HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph structure that enables fast approximate nearest neighbor search. Query time is logarithmic in dataset size, making it excellent for large datasets.
Trade-offs: higher memory usage (stores full graph), longer index build times, but excellent query performance. Best for: production search where query speed is critical.
IVF (Inverted File Index) for memory efficiency
IVF partitions the vector space into clusters and only searches relevant clusters for each query. This dramatically reduces memory and compute compared to HNSW.
The key parameter is nprobe (number of clusters to search). Higher nprobe increases recall but decreases speed. Tune this based on your accuracy requirements.
Quantization techniques
Product Quantization (PQ) compresses vectors by clustering sub-vectors. Scalar Quantization (SQ) reduces precision from float32 to int8. Both trade accuracy for memory.
Measure recall@k on your domain before deploying quantization. Some domains (especially those with clustered embeddings) compress better than others.
Hybrid search: combining dense and sparse
Dense embeddings capture semantic similarity but miss exact keyword matches. Sparse methods (BM25, TF-IDF) handle exact matches well but miss semantics.
Hybrid search runs both, then merges results with weighted scores. Typical weights: 0.7 dense + 0.3 sparse, but tune on your data.
Metadata filtering and pre-filtering
In production, you often need to filter by metadata (date ranges, user permissions, categories) before or during vector search.
Pre-filtering (filter first, then search) reduces search space but may miss relevant results. Post-filtering (search first, then filter) ensures recall but wastes compute. Filtered indexes combine both.
Benchmarking and monitoring
Track key metrics: query latency (p50, p95, p99), recall@k, index build time, memory usage, and throughput. Set alerts for degradation.
Run periodic accuracy tests with labeled query-document pairs. Vector search quality can degrade as data distribution shifts.