A Triangle Inequality for Cosine Similarity
- URL: http://arxiv.org/abs/2107.04071v1
- Date: Thu, 8 Jul 2021 19:13:34 GMT
- Title: A Triangle Inequality for Cosine Similarity
- Authors: Erich Schubert
- Abstract summary: Similarity search is a fundamental problem for many data analysis techniques.
In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Similarity search is a fundamental problem for many data analysis techniques.
Many efficient search techniques rely on the triangle inequality of metrics,
which allows pruning parts of the search space based on transitive bounds on
distances. Recently, Cosine similarity has become a popular alternative choice
to the standard Euclidean metric, in particular in the context of textual data
and neural network embeddings. Unfortunately, Cosine similarity is not metric
and does not satisfy the standard triangle inequality. Instead, many search
techniques for Cosine rely on approximation techniques such as locality
sensitive hashing. In this paper, we derive a triangle inequality for Cosine
similarity that is suitable for efficient similarity search with many standard
search structures (such as the VP-tree, Cover-tree, and M-tree); show that this
bound is tight and discuss fast approximations for it. We hope that this spurs
new research on accelerating exact similarity search for cosine similarity, and
possible other similarity measures beyond the existing work for distance
metrics.
Related papers
- Variance-Adjusted Cosine Distance as Similarity Metric [3.776817669946595]
This study demonstrates limitations of application of cosine similarity.
Traditional cosine similarity metric is valid only in the Euclidean space.
When there is variance and correlation in the data, then cosine distance is not a completely accurate measure of similarity.
arXiv Detail & Related papers (2025-02-04T11:20:57Z) - COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval [0.0]
This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG)
Traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces.
We incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors.
arXiv Detail & Related papers (2024-06-02T06:48:43Z) - A general framework for distributed approximate similarity search with arbitrary distances [0.5030361857850012]
Similarity search is a central problem in domains such as information management and retrieval or data analysis.
Many similarity search algorithms are designed or specifically adapted to metric distances.
This paper presents GDASC, a general framework for distributed approximate similarity search that accepts arbitrary distances.
arXiv Detail & Related papers (2024-05-22T16:19:52Z) - Is Cosine-Similarity of Embeddings Really About Similarity? [46.75365717794515]
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations.
We study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights.
We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless similarities'
arXiv Detail & Related papers (2024-03-08T16:48:20Z) - Unsupervised Hashing with Similarity Distribution Calibration [127.34239817201549]
Unsupervised hashing methods aim to preserve the similarity between data points in a feature space by mapping them to binary hash codes.
These methods often overlook the fact that the similarity between data points in the continuous feature space may not be preserved in the discrete hash code space.
The similarity range is bounded by the code length and can lead to a problem known as similarity collapse.
This paper introduces a novel Similarity Distribution (SDC) method to alleviate this problem.
arXiv Detail & Related papers (2023-02-15T14:06:39Z) - Problems with Cosine as a Measure of Embedding Similarity for High
Frequency Words [45.58634797899206]
We find that cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts.
We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words.
arXiv Detail & Related papers (2022-05-10T18:00:06Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew [58.21885402826496]
All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity.
We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
arXiv Detail & Related papers (2020-03-06T00:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.