Related papers: Group Testing for Accurate and Efficient Range-Based Near Neighbor Search : An Adaptive Binary Splitting Approach

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search : An Adaptive Binary Splitting Approach

URL: http://arxiv.org/abs/2311.02573v1
Date: Sun, 5 Nov 2023 06:12:03 GMT
Title: Group Testing for Accurate and Efficient Range-Based Near Neighbor Search : An Adaptive Binary Splitting Approach
Authors: Kashish Mittal, Harsh Shah, Ajit Rajwade
Abstract summary: This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. We experimentally show that our method achieves a speed-up over exhaustive search by a factor of more than ten with an accuracy same as that of exhaustive search.
Score: 2.6764607949560593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. The proposed method detects high-similarity vectors from an extensive collection of high dimensional vectors, where each vector represents an image descriptor. Our method efficiently marks each item in the collection as neighbor or non-neighbor on the basis of a cosine distance threshold without exhaustive search. Like other methods in the domain of large scale retrieval, our approach exploits the assumption that most of the items in the collection are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following the procedure of binary splitting, a multi-stage adaptive group testing algorithm, we split the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that our method achieves a speed-up over exhaustive search by a factor of more than ten with an accuracy same as that of exhaustive search, on a variety of large datasets. We present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. In this way, our method exploits very useful and practical distributional properties unlike other methods. In our method, all required data structures are created purely offline. Moreover, our method does not impose any strong assumptions on the number of true near neighbors, is adaptible to streaming settings where new vectors are dynamically added to the database, and does not require any parameter tuning.

Related papers

Infinity Search: Approximate Vector Search with Projections on q-Metric Spaces [94.12116458306916]
We show that in $q$-metric spaces, metric trees can leverage a stronger version of the triangle inequality to reduce comparisons for exact search.<n>We propose a novel projection method that embeds datasets with arbitrary dissimilarity measures into $q$-metric spaces.
arXiv Detail & Related papers (2025-06-06T22:09:44Z)
A Query-Driven Approach to Space-Efficient Range Searching [12.760453906939446]
We show that a near-linear sample of queries allows the construction of a partition tree with a near-optimal expected number of nodes visited during querying. We enhance this approach by treating node processing as a classification problem, leveraging fast classifiers like shallow neural networks to obtain experimentally efficient query times. Our algorithm, based on a sample of queries, builds a balanced tree with nodes associated with separators that minimize query stabs on expectation.
arXiv Detail & Related papers (2025-02-19T12:01:00Z)
Range Retrieval with Graph-Based Indices [0.294944680995069]
Retrieving points based on proximity in a high-dimensional vector space is a crucial step in information retrieval applications. We present a set of algorithms for range retrieval on graph-based vector indices. We find up to 100x improvement in query throughput over a naive baseline approach, with 5-10x improvement on average, and strong performance up to 100 million data points.
arXiv Detail & Related papers (2025-02-18T19:18:01Z)
Learning Cluster Representatives for Approximate Nearest Neighbor Search [0.0]
This thesis provides a comprehensive explanation of clustering-based approximate nearest neighbor search. It also introduces and delve into every aspect of our novel state-of-the-art method. The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.
arXiv Detail & Related papers (2024-12-08T12:31:32Z)
pEBR: A Probabilistic Approach to Embedding Based Retrieval [4.8338111302871525]
Embedding retrieval aims to learn a shared semantic representation space for both queries and items. In current industrial practice, retrieval systems typically retrieve a fixed number of items for different queries.
arXiv Detail & Related papers (2024-10-25T07:14:12Z)
Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search [9.01394829787271]
We study the problem of routing in clustering-based maximum inner product search (MIPS) We present a new framework that incorporates the moments of the distribution of inner products within each shard to optimistically estimate the maximum inner product.
arXiv Detail & Related papers (2024-05-20T17:47:18Z)
Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z)
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations [20.944914202453962]
We study the worst-case performance of graph-based approximate nearest neighbor search algorithms. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query. We present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size.
arXiv Detail & Related papers (2023-10-29T19:25:48Z)
Knowledge Base Question Answering by Case-based Reasoning over Subgraphs [81.22050011503933]
We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms. The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.
arXiv Detail & Related papers (2022-02-22T01:34:35Z)
Approximate Nearest Neighbor Search under Neural Similarity Metric for Large-Scale Recommendation [20.42993976179691]
We propose a novel method to extend ANN search to arbitrary matching functions. Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items. The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase.
arXiv Detail & Related papers (2022-02-14T07:55:57Z)
Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem. We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z)
Learning Query Expansion over the Nearest Neighbor Graph [94.80212602202518]
Graph Query Expansion (GQE) is presented, which is learned in a supervised manner and performs aggregation over an extended neighborhood of the query. The technique achieves state-of-the-art results over known benchmarks.
arXiv Detail & Related papers (2021-12-05T19:48:42Z)
Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering. We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure. We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z)
Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models. We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z)
A Practical Index Structure Supporting Fr\'echet Proximity Queries Among Trajectories [1.9335262420787858]
We present a scalable approach for range and $k$ nearest neighbor queries under computationally expensive metrics. Based on clustering for metric indexes, we obtain a dynamic tree structure whose size is linear in the number of trajectories. We analyze the efficiency and effectiveness of our methods with extensive experiments on diverse synthetic and real-world data sets.
arXiv Detail & Related papers (2020-05-28T04:12:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.