Related papers: A Triangle Inequality for Cosine Similarity

A Triangle Inequality for Cosine Similarity

URL: http://arxiv.org/abs/2107.04071v1
Date: Thu, 8 Jul 2021 19:13:34 GMT
Title: A Triangle Inequality for Cosine Similarity
Authors: Erich Schubert
Abstract summary: Similarity search is a fundamental problem for many data analysis techniques. In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, Cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for Cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.

Related papers

Infinity Search: Approximate Vector Search with Projections on q-Metric Spaces [94.12116458306916]
We show that in $q$-metric spaces, metric trees can leverage a stronger version of the triangle inequality to reduce comparisons for exact search.<n>We propose a novel projection method that embeds datasets with arbitrary dissimilarity measures into $q$-metric spaces.
arXiv Detail & Related papers (2025-06-06T22:09:44Z)
Variance-Adjusted Cosine Distance as Similarity Metric [3.776817669946595]
This study demonstrates limitations of application of cosine similarity. Traditional cosine similarity metric is valid only in the Euclidean space. When there is variance and correlation in the data, then cosine distance is not a completely accurate measure of similarity.
arXiv Detail & Related papers (2025-02-04T11:20:57Z)
COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval [0.0]
This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG) Traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces. We incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors.
arXiv Detail & Related papers (2024-06-02T06:48:43Z)
A general framework for distributed approximate similarity search with arbitrary distances [0.5030361857850012]
Similarity search is a central problem in domains such as information management and retrieval or data analysis. Many similarity search algorithms are designed or specifically adapted to metric distances. This paper presents GDASC, a general framework for distributed approximate similarity search that accepts arbitrary distances.
arXiv Detail & Related papers (2024-05-22T16:19:52Z)
Is Cosine-Similarity of Embeddings Really About Similarity? [46.75365717794515]
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. We study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless similarities'
arXiv Detail & Related papers (2024-03-08T16:48:20Z)
Unsupervised Hashing with Similarity Distribution Calibration [127.34239817201549]
Unsupervised hashing methods aim to preserve the similarity between data points in a feature space by mapping them to binary hash codes. These methods often overlook the fact that the similarity between data points in the continuous feature space may not be preserved in the discrete hash code space. The similarity range is bounded by the code length and can lead to a problem known as similarity collapse. This paper introduces a novel Similarity Distribution (SDC) method to alleviate this problem.
arXiv Detail & Related papers (2023-02-15T14:06:39Z)
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words [45.58634797899206]
We find that cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words.
arXiv Detail & Related papers (2022-05-10T18:00:06Z)
Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images. Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph. Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z)
Kernel distance measures for time series, random fields and other structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data. It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution. Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z)
User-friendly Comparison of Similarity Algorithms on Wikidata [2.8551587610394904]
We present a user-friendly interface that allows flexible computation of similarity between Qnodes in Wikidata. At present, the similarity interface supports four algorithms, based on: graph embeddings (TransE, ComplEx), text embeddings (BERT) and class-based similarity. We also provide a REST API that can compute most similar neighbors for any Qnode in Wikidata.
arXiv Detail & Related papers (2021-08-11T18:59:25Z)
LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew [58.21885402826496]
All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
arXiv Detail & Related papers (2020-03-06T00:06:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.