Document Similarity from Vector Space Densities
- URL: http://arxiv.org/abs/2009.00672v1
- Date: Tue, 1 Sep 2020 19:28:51 GMT
- Title: Document Similarity from Vector Space Densities
- Authors: Ilia Rushkin
- Abstract summary: We propose a method for estimating similarities between text documents.
The method is based on a word embedding in a high-dimensional Euclidean space and on kernel regression.
We find that the accuracy of this method is virtually the same as that of a state-of-the-art method, while the gain in speed is very substantial.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a computationally light method for estimating similarities between
text documents, which we call the density similarity (DS) method. The method is
based on a word embedding in a high-dimensional Euclidean space and on kernel
regression, and takes into account semantic relations among words. We find that
the accuracy of this method is virtually the same as that of a state-of-the-art
method, while the gain in speed is very substantial. Additionally, we introduce
generalized versions of the top-k accuracy metric and of the Jaccard metric of
agreement between similarity models.
Related papers
- Context-Aware Palmprint Recognition via a Relative Similarity Metric [0.0]
We propose a new approach to matching mechanism for palmprint recognition by introducing a Relative Similarity Metric (RSM)
RSM captures how a pairwise similarity compares within the context of the entire dataset.
Our method achieves a new state-of-the-art 0.000036% Equal Error Rate (EER) on the Tongji dataset, outperforming previous methods.
arXiv Detail & Related papers (2025-04-15T15:46:17Z) - Rethinking Distance Metrics for Counterfactual Explainability [53.436414009687]
We investigate a framing for counterfactual generation methods that considers counterfactuals not as independent draws from a region around the reference, but as jointly sampled with the reference from the underlying data distribution.
We derive a distance metric, tailored for counterfactual similarity that can be applied to a broad range of settings.
arXiv Detail & Related papers (2024-10-18T15:06:50Z) - COS-Mix: Cosine Similarity and Distance Fusion for Improved Information Retrieval [0.0]
This study proposes a novel hybrid retrieval strategy for Retrieval-Augmented Generation (RAG)
Traditional cosine similarity measure is widely used to capture the similarity between vectors in high-dimensional spaces.
We incorporate cosine distance measures to provide a complementary perspective by quantifying the dissimilarity between vectors.
arXiv Detail & Related papers (2024-06-02T06:48:43Z) - Semantic similarity prediction is better than other semantic similarity
measures [5.176134438571082]
We argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task.
Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.
arXiv Detail & Related papers (2023-09-22T08:11:01Z) - A Comparative Study of Sentence Embedding Models for Assessing Semantic
Variation [0.0]
We compare several recent sentence embedding methods via time-series of semantic similarity between successive sentences and matrices of pairwise sentence similarity for multiple books of literature.
We find that most of the sentence embedding methods considered do infer highly correlated patterns of semantic similarity in a given document, but show interesting differences.
arXiv Detail & Related papers (2023-08-08T23:31:10Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric [48.66580267438049]
We present FastKASSIM, a metric for utterance- and document-level syntactic similarity.
It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels.
It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
arXiv Detail & Related papers (2022-03-15T22:33:26Z) - Recall@k Surrogate Loss with Large Batches and Similarity Mixup [62.67458021725227]
Direct optimization, by gradient descent, of an evaluation metric is not possible when it is non-differentiable.
In this work, a differentiable surrogate loss for the recall is proposed.
The proposed method achieves state-of-the-art results in several image retrieval benchmarks.
arXiv Detail & Related papers (2021-08-25T11:09:11Z) - Instance Similarity Learning for Unsupervised Feature Representation [83.31011038813459]
We propose an instance similarity learning (ISL) method for unsupervised feature representation.
We employ the Generative Adversarial Networks (GAN) to mine the underlying feature manifold.
Experiments on image classification demonstrate the superiority of our method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-08-05T16:42:06Z) - Word Rotator's Distance [50.67809662270474]
Key principle in assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment.
We show that the norm of word vectors is a good proxy for word importance, and their angle is a good proxy for word similarity.
We propose a method that first decouples word vectors into their norm and direction, and then computes alignment-based similarity.
arXiv Detail & Related papers (2020-04-30T17:48:42Z) - Style-transfer and Paraphrase: Looking for a Sensible Semantic
Similarity Metric [18.313879914379005]
We show that none of the metrics widely used in the literature is close enough to human judgment in these tasks.
A number of recently proposed metrics provide comparable results, yet Word Mover Distance is shown to be the most reasonable solution.
arXiv Detail & Related papers (2020-04-10T11:52:06Z) - Learning Flat Latent Manifolds with VAEs [16.725880610265378]
We propose an extension to the framework of variational auto-encoders, where the Euclidean metric is a proxy for the similarity between data points.
We replace the compact prior typically used in variational auto-encoders with a recently presented, more expressive hierarchical one.
We evaluate our method on a range of data-sets, including a video-tracking benchmark.
arXiv Detail & Related papers (2020-02-12T09:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.