Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection
- URL: http://arxiv.org/abs/2311.02573v2
- Date: Sat, 7 Sep 2024 03:12:07 GMT
- Title: Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection
- Authors: Harsh Shah, Kashish Mittal, Ajit Rajwade,
- Abstract summary: This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem.
Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search.
We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy.
- Score: 2.3814052021083354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. However, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.
Related papers
- A Query-Driven Approach to Space-Efficient Range Searching [12.760453906939446]
We show that a near-linear sample of queries allows the construction of a partition tree with a near-optimal expected number of nodes visited during querying.
We enhance this approach by treating node processing as a classification problem, leveraging fast classifiers like shallow neural networks to obtain experimentally efficient query times.
Our algorithm, based on a sample of queries, builds a balanced tree with nodes associated with separators that minimize query stabs on expectation.
arXiv Detail & Related papers (2025-02-19T12:01:00Z) - Range Retrieval with Graph-Based Indices [0.0]
Retrieving points based on proximity in a high-dimensional vector space is a crucial step in information retrieval applications.
We present a set of algorithms for range retrieval on graph-based vector indices.
We find up to 100x improvement in query throughput over a naive baseline approach, with 5-10x improvement on average, and strong performance up to 100 million data points.
arXiv Detail & Related papers (2025-02-18T19:18:01Z) - Learning Cluster Representatives for Approximate Nearest Neighbor Search [0.0]
This thesis provides a comprehensive explanation of clustering-based approximate nearest neighbor search.
It also introduces and delve into every aspect of our novel state-of-the-art method.
The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.
arXiv Detail & Related papers (2024-12-08T12:31:32Z) - Optimistic Query Routing in Clustering-based Approximate Maximum Inner Product Search [9.01394829787271]
This work bridges the gap by studying routing in clustering-based maximum inner product search.
We present a framework that incorporates the moments of the distribution of inner products within each shard to estimate the maximum inner product.
arXiv Detail & Related papers (2024-05-20T17:47:18Z) - Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance.
We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z) - Knowledge Base Question Answering by Case-based Reasoning over Subgraphs [81.22050011503933]
We show that our model answers queries requiring complex reasoning patterns more effectively than existing KG completion algorithms.
The proposed model outperforms or performs competitively with state-of-the-art models on several KBQA benchmarks.
arXiv Detail & Related papers (2022-02-22T01:34:35Z) - Multidimensional Assignment Problem for multipartite entity resolution [69.48568967931608]
Multipartite entity resolution aims at integrating records from multiple datasets into one entity.
We apply two procedures, a Greedy algorithm and a large scale neighborhood search, to solve the assignment problem.
We find evidence that design-based multi-start can be more efficient as the size of databases grow large.
arXiv Detail & Related papers (2021-12-06T20:34:55Z) - Learning Query Expansion over the Nearest Neighbor Graph [94.80212602202518]
Graph Query Expansion (GQE) is presented, which is learned in a supervised manner and performs aggregation over an extended neighborhood of the query.
The technique achieves state-of-the-art results over known benchmarks.
arXiv Detail & Related papers (2021-12-05T19:48:42Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on
Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models.
We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z) - A Practical Index Structure Supporting Fr\'echet Proximity Queries Among
Trajectories [1.9335262420787858]
We present a scalable approach for range and $k$ nearest neighbor queries under computationally expensive metrics.
Based on clustering for metric indexes, we obtain a dynamic tree structure whose size is linear in the number of trajectories.
We analyze the efficiency and effectiveness of our methods with extensive experiments on diverse synthetic and real-world data sets.
arXiv Detail & Related papers (2020-05-28T04:12:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.