Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets
- URL: http://arxiv.org/abs/2509.24815v1
- Date: Mon, 29 Sep 2025 14:02:45 GMT
- Title: Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets
- Authors: Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini,
- Abstract summary: We introduce a set of novel data structures and algorithmic methods for sparse ANNS.<n>Our contributions range from a theoretically-grounded sketching algorithm for sparse vectors to reduce their effective dimensionality.<n>Our final algorithm, dubbed Seismic, reaches sub-millisecond latency with high accuracy on a large-scale benchmark dataset.
- Score: 16.768212375976546
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sparse embeddings of data form an attractive class due to their inherent interpretability: Every dimension is tied to a term in some vocabulary, making it easy to visually decipher the latent space. Sparsity, however, poses unique challenges for Approximate Nearest Neighbor Search (ANNS) which finds, from a collection of vectors, the k vectors closest to a query. To encourage research on this underexplored topic, sparse ANNS featured prominently in a BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on large benchmark datasets by throughput and accuracy. In this work, we introduce a set of novel data structures and algorithmic methods, a combination of which leads to an elegant, effective, and highly efficient solution to sparse ANNS. Our contributions range from a theoretically-grounded sketching algorithm for sparse vectors to reduce their effective dimensionality while preserving inner product-induced ranks; a geometric organization of the inverted index; and the blending of local and global information to improve the efficiency and efficacy of ANNS. Empirically, our final algorithm, dubbed Seismic, reaches sub-millisecond per-query latency with high accuracy on a large-scale benchmark dataset using a single CPU.
Related papers
- Infinity Search: Approximate Vector Search with Projections on q-Metric Spaces [94.12116458306916]
We show that in $q$-metric spaces, metric trees can leverage a stronger version of the triangle inequality to reduce comparisons for exact search.<n>We propose a novel projection method that embeds datasets with arbitrary dissimilarity measures into $q$-metric spaces.
arXiv Detail & Related papers (2025-06-06T22:09:44Z) - Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search [14.821492155733555]
We propose a graph-based ANNS algorithm for dense-sparse hybrid vectors.
We show that our algorithm achieves 8.9x$sim$11.7x throughput at equal accuracy compared to existing hybrid vector search algorithms.
arXiv Detail & Related papers (2024-10-27T09:12:51Z) - PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search [7.466687780705626]
This paper studies density-based clustering of point sets.<n>It unifies the different variants of density peaks clustering into a single framework, PECANN.<n>We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading.
arXiv Detail & Related papers (2023-12-06T22:43:50Z) - ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate
Nearest Neighbor Search Algorithms [5.478671305092084]
We introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms.
We develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets.
arXiv Detail & Related papers (2023-05-07T19:28:23Z) - Efficient Convex Algorithms for Universal Kernel Learning [46.573275307034336]
An ideal set of kernels should: admit a linear parameterization (for tractability); dense in the set of all kernels (for accuracy)
Previous algorithms for optimization of kernels were limited to classification and relied on computationally complex Semidefinite Programming (SDP) algorithms.
We propose a SVD-QCQPQP algorithm which dramatically reduces the computational complexity as compared with previous SDP-based approaches.
arXiv Detail & Related papers (2023-04-15T04:57:37Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - A density peaks clustering algorithm with sparse search and K-d tree [16.141611031128427]
Density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem.
Experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms.
arXiv Detail & Related papers (2022-03-02T09:29:40Z) - A Sparse Structure Learning Algorithm for Bayesian Network
Identification from Discrete High-Dimensional Data [0.40611352512781856]
This paper addresses the problem of learning a sparse structure Bayesian network from high-dimensional discrete data.
We propose a score function that satisfies the sparsity and the DAG property simultaneously.
Specifically, we use a variance reducing method in our optimization algorithm to make the algorithm work efficiently in high-dimensional data.
arXiv Detail & Related papers (2021-08-21T12:21:01Z) - Learning to Hash Robustly, with Guarantees [79.68057056103014]
In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms.
We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically.
Our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
arXiv Detail & Related papers (2021-08-11T20:21:30Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.