Learning Cluster Representatives for Approximate Nearest Neighbor Search
- URL: http://arxiv.org/abs/2412.05921v1
- Date: Sun, 08 Dec 2024 12:31:32 GMT
- Title: Learning Cluster Representatives for Approximate Nearest Neighbor Search
- Authors: Thomas Vecchiato,
- Abstract summary: This thesis provides a comprehensive explanation of clustering-based approximate nearest neighbor search.
It also introduces and delve into every aspect of our novel state-of-the-art method.
The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.
- Score: 0.0
- License:
- Abstract: Developing increasingly efficient and accurate algorithms for approximate nearest neighbor search is a paramount goal in modern information retrieval. A primary approach to addressing this question is clustering, which involves partitioning the dataset into distinct groups, with each group characterized by a representative data point. By this method, retrieving the top-k data points for a query requires identifying the most relevant clusters based on their representatives -- a routing step -- and then conducting a nearest neighbor search within these clusters only, drastically reducing the search space. The objective of this thesis is not only to provide a comprehensive explanation of clustering-based approximate nearest neighbor search but also to introduce and delve into every aspect of our novel state-of-the-art method, which originated from a natural observation: The routing function solves a ranking problem, making the function amenable to learning-to-rank. The development of this intuition and applying it to maximum inner product search has led us to demonstrate that learning cluster representatives using a simple linear function significantly boosts the accuracy of clustering-based approximate nearest neighbor search.
Related papers
- Range Retrieval with Graph-Based Indices [0.0]
Retrieving points based on proximity in a high-dimensional vector space is a crucial step in information retrieval applications.
We present a set of algorithms for range retrieval on graph-based vector indices.
We find up to 100x improvement in query throughput over a naive baseline approach, with 5-10x improvement on average, and strong performance up to 100 million data points.
arXiv Detail & Related papers (2025-02-18T19:18:01Z) - On Socially Fair Low-Rank Approximation and Column Subset Selection [62.44413238556872]
Low-rank approximation and column subset selection are two fundamental and related problems that are applied across a wealth of machine learning applications.
We show that surprisingly, even constant-factor approximation fair low-rank approximation requires exponential time under certain standard complexity hypotheses.
We give an algorithm for fair low-rank approximation that, for a constant number of groups and constant-factor accuracy, runs in $2textpoly(k)$ time rather than the na"ive $ntextpoly(k)$.
arXiv Detail & Related papers (2024-12-08T20:34:16Z) - Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection [2.3814052021083354]
This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem.
Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search.
We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy.
arXiv Detail & Related papers (2023-11-05T06:12:03Z) - Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering.
We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines.
Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z) - Natural Hierarchical Cluster Analysis by Nearest Neighbors with
Near-Linear Time Complexity [0.0]
We propose a nearest neighbor based clustering algorithm that results in a naturally defined hierarchy of clusters.
In contrast to the agglomerative and divisive hierarchical clustering algorithms, our approach is not dependent on the iterative working of the algorithm.
arXiv Detail & Related papers (2022-03-15T16:03:42Z) - Hierarchical clustering by aggregating representatives in
sub-minimum-spanning-trees [5.877624540482919]
We propose a novel hierarchical clustering algorithm, in which, while building the clustering dendrogram, we can effectively detect the representative point.
Under our analysis, the proposed algorithm has O(nlogn) time-complexity and O(logn) space-complexity, indicating that it has the scalability in handling massive data.
arXiv Detail & Related papers (2021-11-11T07:36:55Z) - Correlation Clustering Reconstruction in Semi-Adversarial Models [70.11015369368272]
Correlation Clustering is an important clustering problem with many applications.
We study the reconstruction version of this problem in which one is seeking to reconstruct a latent clustering corrupted by random noise and adversarial modifications.
arXiv Detail & Related papers (2021-08-10T14:46:17Z) - How to Design Robust Algorithms using Noisy Comparison Oracle [12.353002222958605]
Metric based comparison operations are fundamental to studying various clustering techniques.
In this paper, we study various problems that include finding maximum, nearest/farthest neighbor search.
We give robust algorithms for k-center clustering and agglomerative hierarchical clustering.
arXiv Detail & Related papers (2021-05-12T16:58:09Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on
Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models.
We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.