Group Testing for Accurate and Efficient Range-Based Near Neighbor
Search : An Adaptive Binary Splitting Approach
- URL: http://arxiv.org/abs/2311.02573v1
- Date: Sun, 5 Nov 2023 06:12:03 GMT
- Title: Group Testing for Accurate and Efficient Range-Based Near Neighbor
Search : An Adaptive Binary Splitting Approach
- Authors: Kashish Mittal, Harsh Shah, Ajit Rajwade
- Abstract summary: This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem.
We experimentally show that our method achieves a speed-up over exhaustive search by a factor of more than ten with an accuracy same as that of exhaustive search.
- Score: 2.6764607949560593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents an adaptive group testing framework for the range-based
high dimensional near neighbor search problem. The proposed method detects
high-similarity vectors from an extensive collection of high dimensional
vectors, where each vector represents an image descriptor. Our method
efficiently marks each item in the collection as neighbor or non-neighbor on
the basis of a cosine distance threshold without exhaustive search. Like other
methods in the domain of large scale retrieval, our approach exploits the
assumption that most of the items in the collection are unrelated to the query.
Unlike other methods, it does not assume a large difference between the cosine
similarity of the query vector with the least related neighbor and that with
the least unrelated non-neighbor. Following the procedure of binary splitting,
a multi-stage adaptive group testing algorithm, we split the set of items to be
searched into half at each step, and perform dot product tests on smaller and
smaller subsets, many of which we are able to prune away. We experimentally
show that our method achieves a speed-up over exhaustive search by a factor of
more than ten with an accuracy same as that of exhaustive search, on a variety
of large datasets. We present a theoretical analysis of the expected number of
distance computations per query and the probability that a pool with a certain
number of members will be pruned. In this way, our method exploits very useful
and practical distributional properties unlike other methods. In our method,
all required data structures are created purely offline. Moreover, our method
does not impose any strong assumptions on the number of true near neighbors, is
adaptible to streaming settings where new vectors are dynamically added to the
database, and does not require any parameter tuning.
Related papers
- Efficient Retrieval with Learned Similarities [2.729516456192901]
State-of-the-art retrieval algorithms have migrated to learned similarities.
We show that Mixture-of-Logits (MoL) is a universal approximator, and can express all learned similarity functions.
MoL sets new state-of-the-art results on recommendation retrieval tasks, and our approximate top-k retrieval with learned similarities outperforms baselines by up to two orders of magnitude in latency.
arXiv Detail & Related papers (2024-07-22T08:19:34Z) - Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search [9.01394829787271]
We study the problem of routing in clustering-based maximum inner product search (MIPS)
We present a new framework that incorporates the moments of the distribution of inner products within each shard to optimistically estimate the maximum inner product.
arXiv Detail & Related papers (2024-05-20T17:47:18Z) - Approximate Nearest Neighbor Search under Neural Similarity Metric for
Large-Scale Recommendation [20.42993976179691]
We propose a novel method to extend ANN search to arbitrary matching functions.
Our main idea is to perform a greedy walk with a matching function in a similarity graph constructed from all items.
The proposed method has been fully deployed in the Taobao display advertising platform and brings a considerable advertising revenue increase.
arXiv Detail & Related papers (2022-02-14T07:55:57Z) - Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity.
We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem.
We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - How to Design Robust Algorithms using Noisy Comparison Oracle [12.353002222958605]
Metric based comparison operations are fundamental to studying various clustering techniques.
In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search.
We give robust algorithms for k-center clustering and agglomerative hierarchical clustering.
arXiv Detail & Related papers (2021-05-12T16:58:09Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on
Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models.
We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z) - A Practical Index Structure Supporting Fr\'echet Proximity Queries Among
Trajectories [1.9335262420787858]
We present a scalable approach for range and $k$ nearest neighbor queries under computationally expensive metrics.
Based on clustering for metric indexes, we obtain a dynamic tree structure whose size is linear in the number of trajectories.
We analyze the efficiency and effectiveness of our methods with extensive experiments on diverse synthetic and real-world data sets.
arXiv Detail & Related papers (2020-05-28T04:12:43Z) - Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.