Related papers: GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search

GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search

URL: http://arxiv.org/abs/2602.23999v1
Date: Fri, 27 Feb 2026 13:23:30 GMT
Title: GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search
Authors: Jifan Shi, Jianyang Gao, James Xia, Tamás Béla Fehér, Cheng Long,
Abstract summary: IVF-RaBitQ is a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline.<n>We show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint.
Score: 6.459073253087106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.

Related papers

GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z)
GPU-Accelerated ANNS: Quantized for Speed, Built for Change [1.8419317899207142]
Current Approximate nearest neighbor search (ANNS) systems face three key limitations.<n>Current systems lack efficient quantization techniques that reduce data movement without introducing costly random memory accesses.<n>We present Jasper, a GPU-accelerated ANNS system with both high query throughput and upability.
arXiv Detail & Related papers (2026-01-11T19:51:54Z)
CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing [28.184704036272787]
Long contexts pose significant challenges for inference efficiency in large language models.<n>We propose CTKVR, a novel centroid-then-token KV retrieval scheme.<n>CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation.
arXiv Detail & Related papers (2025-12-17T15:56:32Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs [20.4238781638402]
Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Existing GPU libraries offer two dataflow types for sparse convolution. We introduce TorchSparse++, a new GPU library that achieves the best of both worlds.
arXiv Detail & Related papers (2023-10-25T21:02:38Z)
High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison [0.0]
We present a versatile GPU-based parallel version of Logistic Regression (LR) Our implementation is a direct translation of the parallel Gradient Descent Logistic Regression algorithm proposed by X. Zou et al. Our method is particularly advantageous for real-time prediction applications like image recognition, spam detection, and fraud detection.
arXiv Detail & Related papers (2023-08-19T14:49:37Z)
INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing [66.00729477511219]
Given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient. We introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We present results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively.
arXiv Detail & Related papers (2023-08-11T04:24:39Z)
Similarity search in the blink of an eye with compressed indices [3.39271933237479]
Graph-based indices are currently the best performing techniques for billion-scale similarity search. We present new techniques and systems for creating faster and smaller graph-based indices.
arXiv Detail & Related papers (2023-04-07T23:10:39Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings. We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z)
GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification [10.66048003460524]
One of the most efficient methods to solve L2-regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. We show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced.
arXiv Detail & Related papers (2020-08-08T03:40:27Z)
Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO [46.20949184826173]
This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency.
arXiv Detail & Related papers (2020-03-30T14:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.