A general framework for distributed approximate similarity search with arbitrary distances
- URL: http://arxiv.org/abs/2405.13795v2
- Date: Wed, 07 Aug 2024 15:13:45 GMT
- Title: A general framework for distributed approximate similarity search with arbitrary distances
- Authors: Elena Garcia-Morato, Maria Jesus Algar, Cesar Alfaro, Felipe Ortega, Javier Gomez, Javier M. Moguerza,
- Abstract summary: Similarity search is a central problem in domains such as information management and retrieval or data analysis.
Many similarity search algorithms are designed or specifically adapted to metric distances.
This paper presents GDASC, a general framework for distributed approximate similarity search that accepts arbitrary distances.
- Score: 0.5030361857850012
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Similarity search is a central problem in domains such as information management and retrieval or data analysis. Many similarity search algorithms are designed or specifically adapted to metric distances. Thus, they are unsuitable for alternatives like the cosine distance, which has become quite common, for example, with embeddings and in text mining. This paper presents GDASC (General Distributed Approximate Similarity search with Clustering), a general framework for distributed approximate similarity search that accepts arbitrary distances. This framework can build a multilevel index structure, by selecting a clustering algorithm, the number of prototypes in each cluster and any arbitrary distance function. As a result, this framework effectively overcomes the limitation of using metric distances and can address situations involving cosine similarity or other non-standard similarity measures. Experimental results using k-medoids clustering in GDASC with real datasets confirm the applicability of this approach for approximate similarity search, improving the performance of extant algorithms for this purpose.
Related papers
- Measuring similarity between embedding spaces using induced neighborhood graphs [10.056989400384772]
We propose a metric to evaluate the similarity between paired item representations.
Our results show that accuracy in both analogy and zero-shot classification tasks correlates with the embedding similarity.
arXiv Detail & Related papers (2024-11-13T15:22:33Z) - GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - Enhancing Community Detection in Networks: A Comparative Analysis of Local Metrics and Hierarchical Algorithms [49.1574468325115]
This study employs the same method to evaluate the relevance of using local similarity metrics for community detection.
The efficacy of these metrics was evaluated by applying the base algorithm to several real networks with varying community sizes.
arXiv Detail & Related papers (2024-08-17T02:17:09Z) - COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval [0.0]
This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG)
Traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces.
We incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors.
arXiv Detail & Related papers (2024-06-02T06:48:43Z) - Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner.
exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods.
We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z) - Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning [0.0]
We propose a metric called KDSUM that uses mixed kernels to measure dissimilarity.
We demonstrate that KDSUM is a shrinkage method from existing mixed-type metrics to a uniform dissimilarity metric.
arXiv Detail & Related papers (2023-06-02T19:51:48Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Efficient Joint-Dimensional Search with Solution Space Regularization
for Real-Time Semantic Segmentation [27.94898516315886]
We search an optimal network structure that can run in real-time for this problem.
A novel Solution Space Regularization (SSR) loss is first proposed to effectively encourage the supernet to converge to its discrete one.
A new Hierarchical and Progressive Solution Space Shrinking method is presented to further achieve high efficiency of searching.
arXiv Detail & Related papers (2022-08-10T11:07:33Z) - A Metaheuristic Algorithm for Large Maximum Weight Independent Set
Problems [58.348679046591265]
Given a node-weighted graph, find a set of independent (mutually nonadjacent) nodes whose node-weight sum is maximum.
Some of the graphs airsing in this application are large, having hundreds of thousands of nodes and hundreds of millions of edges.
We develop a new local search algorithm, which is a metaheuristic in the greedy randomized adaptive search framework.
arXiv Detail & Related papers (2022-03-28T21:34:16Z) - A density peaks clustering algorithm with sparse search and K-d tree [16.141611031128427]
Density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem.
Experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms.
arXiv Detail & Related papers (2022-03-02T09:29:40Z) - Exploring Complicated Search Spaces with Interleaving-Free Sampling [127.07551427957362]
In this paper, we build the search algorithm upon a complicated search space with long-distance connections.
We present a simple yet effective algorithm named textbfIF-NAS, where we perform a periodic sampling strategy to construct different sub-networks.
In the proposed search space, IF-NAS outperform both random sampling and previous weight-sharing search algorithms by a significant margin.
arXiv Detail & Related papers (2021-12-05T06:42:48Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - Selection-Expansion: A Unifying Framework for Motion-Planning and
Diversity Search Algorithms [69.87173070473717]
We investigate the properties of two diversity search algorithms, the Novelty Search and the Goal Exploration Process algorithms.
The relation to MP algorithms reveals that the smoothness, or lack of smoothness of the mapping between the policy parameter space and the outcome space plays a key role in the search efficiency.
arXiv Detail & Related papers (2021-04-10T13:52:27Z) - Manifold learning with approximate nearest neighbors [1.8477401359673706]
We use a broad range of approximate nearest neighbor algorithms within manifold learning algorithms and evaluate their impact on embedding accuracy.
Via a thorough empirical investigation based on the benchmark MNIST dataset, it is shown that approximate nearest neighbors lead to substantial improvements in computational time.
This application demonstrates how the proposed methods can be used to visualize and identify anomalies and uncover underlying structure within high-dimensional data.
arXiv Detail & Related papers (2021-02-22T12:04:23Z) - Fuzzy clustering algorithms with distance metric learning and entropy
regularization [0.0]
This paper proposes fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization.
Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.
arXiv Detail & Related papers (2021-02-18T18:19:04Z) - Similarity-based Distance for Categorical Clustering using Space
Structure [5.543220407902113]
We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data.
Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
arXiv Detail & Related papers (2020-11-19T15:18:26Z) - Distributed Optimization, Averaging via ADMM, and Network Topology [0.0]
We study the connection between network topology and convergence rates for different algorithms on a real world problem of sensor localization.
We also show interesting connections between ADMM and lifted Markov chains besides providing an explicitly characterization of its convergence.
arXiv Detail & Related papers (2020-09-05T21:44:39Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Stable and consistent density-based clustering via multiparameter
persistence [77.34726150561087]
We consider the degree-Rips construction from topological data analysis.
We analyze its stability to perturbations of the input data using the correspondence-interleaving distance.
We integrate these methods into a pipeline for density-based clustering, which we call Persistable.
arXiv Detail & Related papers (2020-05-18T19:45:04Z) - LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew [58.21885402826496]
All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity.
We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
arXiv Detail & Related papers (2020-03-06T00:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.