AQR-HNSW: Accelerating Approximate Nearest Neighbor Search via Density-aware Quantization and Multi-stage Re-ranking
- URL: http://arxiv.org/abs/2602.21600v1
- Date: Wed, 25 Feb 2026 05:58:16 GMT
- Title: AQR-HNSW: Accelerating Approximate Nearest Neighbor Search via Density-aware Quantization and Multi-stage Re-ranking
- Authors: Ganap Ashit Tewary, Nrusinga Charan Gantayat, Jeff Zhang,
- Abstract summary: This paper presents Adaptive Quantization and Rerank HNSW, a novel framework that integrates three strategies to enhance HNSW scalability.<n>AQR-HNSW introduces (1) density-aware adaptive quantization, achieving 4x compression while preserving distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (3) quantization-optimized SIMD implementations delivering 16-64 operations per cycle across architectures.<n> Evaluation on standard benchmarks demonstrates 2.5-3.3x higher queries per second (QPS) than state-of-the-art HNSW implementations while maintaining over 98% recall, with 75% memory reduction for the index graph and
- Score: 1.2690814190593385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Approximate Nearest Neighbor (ANN) search has become fundamental to modern AI infrastructure, powering recommendation systems, search engines, and large language models across industry leaders from Google to OpenAI. Hierarchical Navigable Small World (HNSW) graphs have emerged as the dominant ANN algorithm, widely adopted in production systems due to their superior recall versus latency balance. However, as vector databases scale to billions of embeddings, HNSW faces critical bottlenecks: memory consumption expands, distance computation overhead dominates query latency, and it suffers suboptimal performance on heterogeneous data distributions. This paper presents Adaptive Quantization and Rerank HNSW (AQR-HNSW), a novel framework that synergistically integrates three strategies to enhance HNSW scalability. AQR-HNSW introduces (1) density-aware adaptive quantization, achieving 4x compression while preserving distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (3) quantization-optimized SIMD implementations delivering 16-64 operations per cycle across architectures. Evaluation on standard benchmarks demonstrates 2.5-3.3x higher queries per second (QPS) than state-of-the-art HNSW implementations while maintaining over 98% recall, with 75% memory reduction for the index graph and 5x faster index construction.
Related papers
- Reranker Optimization via Geodesic Distances on k-NN Manifolds [0.0]
We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) equations.<n>Maniscope outperforms HNSW graph-based baseline on the three hardest datasets.<n>Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency.
arXiv Detail & Related papers (2026-01-26T07:55:08Z) - From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector Search [0.7804710977378487]
This paper analyzes the architectural limitations of the dominant "HNSW + float32 + cosine similarity" stack.<n>We introduce and empirically evaluate an alternative information-theoretic architecture based on maximally informative binarization (MIB)<n>Results demonstrate retrieval quality comparable to full-precision systems, while achieving substantially lower latency and maintaining constant throughput at high request rates.
arXiv Detail & Related papers (2025-12-16T23:24:37Z) - Panorama: Fast-Track Nearest Neighbors [22.201421121801218]
We present PANORAMA, a machine learning-driven approach that tackles the Approximate Nearest-Neighbor Search bottleneck.<n>We show that PANORAMA affords a 2--30$times$ end-to-end speedup with no recall loss.
arXiv Detail & Related papers (2025-10-01T06:38:45Z) - Scaling Linear Attention with Sparse State Expansion [62.749291436866606]
Transformer architecture struggles with long-context scenarios due to quadratic computation and linear memory growth.<n>We propose two key innovations to achieve more effective context compression.<n>First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification.<n>Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions.
arXiv Detail & Related papers (2025-07-22T13:27:31Z) - Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization [0.8786066051474574]
The Hierarchical Navigable Small World (HNSW) algorithm is widely used for approximate nearest neighbor search.<n>We propose a novel algorithm that mitigates local optima and cluster disconnections while enhancing the construction speed, maintaining inference speed.<n> Experiments on various benchmarks and datasets showed that our algorithm outperforms the original HNSW in both accuracy and speed.
arXiv Detail & Related papers (2025-01-23T10:20:12Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds [0.09208007322096533]
Investigation focuses on HNSW's efficacy across a spectrum of datasets.
We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality.
We observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models.
arXiv Detail & Related papers (2024-05-28T04:16:43Z) - INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order
Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient.
We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture.
We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Learning to Solve Combinatorial Graph Partitioning Problems via
Efficient Exploration [72.15369769265398]
Experimentally, ECORD achieves a new SOTA for RL algorithms on the Maximum Cut problem.
Compared to the nearest competitor, ECORD reduces the optimality gap by up to 73%.
arXiv Detail & Related papers (2022-05-27T17:13:10Z) - Scaling Quantum Approximate Optimization on Near-term Hardware [49.94954584453379]
We quantify scaling of the expected resource requirements by optimized circuits for hardware architectures with varying levels of connectivity.
We show the number of measurements, and hence total time to synthesizing solution, grows exponentially in problem size and problem graph degree.
These problems may be alleviated by increasing hardware connectivity or by recently proposed modifications to the QAOA that achieve higher performance with fewer circuit layers.
arXiv Detail & Related papers (2022-01-06T21:02:30Z) - NPAS: A Compiler-aware Framework of Unified Network Pruning and
Architecture Search for Beyond Real-Time Mobile Acceleration [48.25487285358816]
We propose a compiler automatic code generation framework supporting different DNNs and different pruning schemes.
We also propose NPAS, a compiler-aware unified network pruning, and architecture search.
Our framework achieves 6.7ms, 5.9ms, 3.9ms ImageNet inference times with 78.2%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone.
arXiv Detail & Related papers (2020-12-01T16:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.