Problems with Cosine as a Measure of Embedding Similarity for High
Frequency Words
- URL: http://arxiv.org/abs/2205.05092v1
- Date: Tue, 10 May 2022 18:00:06 GMT
- Title: Problems with Cosine as a Measure of Embedding Similarity for High
Frequency Words
- Authors: Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky
- Abstract summary: We find that cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts.
We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words.
- Score: 45.58634797899206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cosine similarity of contextual embeddings is used in many NLP tasks (e.g.,
QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in
which word similarities estimated by cosine over BERT embeddings are
understated and trace this effect to training data frequency. We find that
relative to human judgements, cosine similarity underestimates the similarity
of frequent words with other instances of the same word or other words across
contexts, even after controlling for polysemy and other factors. We conjecture
that this underestimation of similarity for high frequency words is due to
differences in the representational geometry of high and low frequency words
and provide a formal argument for the two-dimensional case.
Related papers
- Solving Cosine Similarity Underestimation between High Frequency Words
by L2 Norm Discounting [19.12036493733793]
We propose a method to discount the L2 norm of a contextualised word embedding by the frequency of that word in a corpus when measuring the cosine similarities between words.
Experimental results on a contextualised word similarity dataset show that our proposed discounting method accurately solves the similarity underestimation problem.
arXiv Detail & Related papers (2023-05-17T23:41:30Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Investigating the Frequency Distortion of Word Embeddings and Its Impact
on Bias Metrics [2.1374208474242815]
We systematically study the association between frequency and semantic similarity in several static word embeddings.
We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations.
arXiv Detail & Related papers (2022-11-15T15:11:06Z) - Word Embeddings Are Capable of Capturing Rhythmic Similarity of Words [0.0]
Word embedding systems such as Word2Vec and GloVe are well-known in deep learning approaches to NLP.
In this work we investigated their usefulness in capturing rhythmic similarity of words instead.
The results show that vectors these embeddings assign to rhyming words are more similar to each other, compared to the other words.
arXiv Detail & Related papers (2022-04-11T02:33:23Z) - Comparing in context: Improving cosine similarity measures with a metric
tensor [0.0]
Cosine similarity is a widely used measure of the relatedness of pre-trained word embeddings, trained on a language modeling goal.
We propose instead the use of an extended cosine similarity measure to improve performance on that task, with gains in interpretability.
We learn contextualized metrics and compare the results with the baseline values obtained using the standard cosine similarity measure, which consistently shows improvement.
We also train a contextualized similarity measure for both SimLex-999 and WordSim-353, comparing the results with the corresponding baselines, and using these datasets as independent test sets for the all-context similarity measure learned on
arXiv Detail & Related papers (2022-03-28T18:04:26Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric [48.66580267438049]
We present FastKASSIM, a metric for utterance- and document-level syntactic similarity.
It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels.
It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
arXiv Detail & Related papers (2022-03-15T22:33:26Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Frequency-based Distortions in Contextualized Word Embeddings [29.88883761339757]
This work explores the geometric characteristics of contextualized word embeddings with two novel tools.
Words of high and low frequency differ significantly with respect to their representational geometry.
BERT-Base has more trouble differentiating between South American and African countries than North American and European ones.
arXiv Detail & Related papers (2021-04-17T06:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.