Solving Cosine Similarity Underestimation between High Frequency Words
by L2 Norm Discounting
- URL: http://arxiv.org/abs/2305.10610v1
- Date: Wed, 17 May 2023 23:41:30 GMT
- Title: Solving Cosine Similarity Underestimation between High Frequency Words
by L2 Norm Discounting
- Authors: Saeth Wannasuphoprasit, Yi Zhou, Danushka Bollegala
- Abstract summary: We propose a method to discount the L2 norm of a contextualised word embedding by the frequency of that word in a corpus when measuring the cosine similarities between words.
Experimental results on a contextualised word similarity dataset show that our proposed discounting method accurately solves the similarity underestimation problem.
- Score: 19.12036493733793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cosine similarity between two words, computed using their contextualised
token embeddings obtained from masked language models (MLMs) such as BERT has
shown to underestimate the actual similarity between those words (Zhou et al.,
2022). This similarity underestimation problem is particularly severe for
highly frequent words. Although this problem has been noted in prior work, no
solution has been proposed thus far. We observe that the L2 norm of
contextualised embeddings of a word correlates with its log-frequency in the
pretraining corpus. Consequently, the larger L2 norms associated with the
highly frequent words reduce the cosine similarity values measured between
them, thus underestimating the similarity scores. To solve this issue, we
propose a method to discount the L2 norm of a contextualised word embedding by
the frequency of that word in a corpus when measuring the cosine similarities
between words. We show that the so called stop words behave differently from
the rest of the words, which require special consideration during their
discounting process. Experimental results on a contextualised word similarity
dataset show that our proposed discounting method accurately solves the
similarity underestimation problem.
Related papers
- Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS [68.34155010428941]
It is unclear what kind of sentence pairs a sentence encoder (SE) would consider similar.
HEROS is constructed by transforming an original sentence into a new sentence based on certain rules to form a textitminimal pair
By systematically comparing the performance of over 60 supervised and unsupervised SEs on HEROS, we reveal that most unsupervised sentence encoders are insensitive to negation.
arXiv Detail & Related papers (2023-06-08T10:24:02Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Investigating the Frequency Distortion of Word Embeddings and Its Impact
on Bias Metrics [2.1374208474242815]
We systematically study the association between frequency and semantic similarity in several static word embeddings.
We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations.
arXiv Detail & Related papers (2022-11-15T15:11:06Z) - Improving Contextual Recognition of Rare Words with an Alternate
Spelling Prediction Model [0.0]
We release contextual biasing lists to accompany the Earnings21 dataset.
We show results for shallow fusion contextual biasing applied to two different decoding algorithms.
We propose an alternate spelling prediction model that improves recall of rare words by 34.7% relative.
arXiv Detail & Related papers (2022-09-02T19:30:16Z) - Problems with Cosine as a Measure of Embedding Similarity for High
Frequency Words [45.58634797899206]
We find that cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts.
We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words.
arXiv Detail & Related papers (2022-05-10T18:00:06Z) - Comparing in context: Improving cosine similarity measures with a metric
tensor [0.0]
Cosine similarity is a widely used measure of the relatedness of pre-trained word embeddings, trained on a language modeling goal.
We propose instead the use of an extended cosine similarity measure to improve performance on that task, with gains in interpretability.
We learn contextualized metrics and compare the results with the baseline values obtained using the standard cosine similarity measure, which consistently shows improvement.
We also train a contextualized similarity measure for both SimLex-999 and WordSim-353, comparing the results with the corresponding baselines, and using these datasets as independent test sets for the all-context similarity measure learned on
arXiv Detail & Related papers (2022-03-28T18:04:26Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Frequency-based Distortions in Contextualized Word Embeddings [29.88883761339757]
This work explores the geometric characteristics of contextualized word embeddings with two novel tools.
Words of high and low frequency differ significantly with respect to their representational geometry.
BERT-Base has more trouble differentiating between South American and African countries than North American and European ones.
arXiv Detail & Related papers (2021-04-17T06:35:48Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.