Investigating the Frequency Distortion of Word Embeddings and Its Impact
on Bias Metrics
- URL: http://arxiv.org/abs/2211.08203v2
- Date: Thu, 19 Oct 2023 19:07:40 GMT
- Title: Investigating the Frequency Distortion of Word Embeddings and Its Impact
on Bias Metrics
- Authors: Francisco Valentini, Juan Cruz Sosa, Diego Fernandez Slezak, Edgar
Altszyler
- Abstract summary: We systematically study the association between frequency and semantic similarity in several static word embeddings.
We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations.
- Score: 2.1374208474242815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has shown that static word embeddings can encode word
frequency information. However, little has been studied about this phenomenon
and its effects on downstream tasks. In the present work, we systematically
study the association between frequency and semantic similarity in several
static word embeddings. We find that Skip-gram, GloVe and FastText embeddings
tend to produce higher semantic similarity between high-frequency words than
between other frequency combinations. We show that the association between
frequency and similarity also appears when words are randomly shuffled. This
proves that the patterns found are not due to real semantic associations
present in the texts, but are an artifact produced by the word embeddings.
Finally, we provide an example of how word frequency can strongly impact the
measurement of gender bias with embedding-based metrics. In particular, we
carry out a controlled experiment that shows that biases can even change sign
or reverse their order by manipulating word frequencies.
Related papers
- Spoken Word2Vec: Learning Skipgram Embeddings from Speech [0.8901073744693314]
We show how shallow skipgram-like algorithms fail to encode distributional semantics when the input units are acoustically correlated.
We illustrate the potential of an alternative deep end-to-end variant of the model and examine the effects on the resulting embeddings.
arXiv Detail & Related papers (2023-11-15T19:25:29Z) - Neighboring Words Affect Human Interpretation of Saliency Explanations [65.29015910991261]
Word-level saliency explanations are often used to communicate feature-attribution in text-based models.
Recent studies found that superficial factors such as word length can distort human interpretation of the communicated saliency scores.
We investigate how the marking of a word's neighboring words affect the explainee's perception of the word's importance in the context of a saliency explanation.
arXiv Detail & Related papers (2023-05-04T09:50:25Z) - The Undesirable Dependence on Frequency of Gender Bias Metrics Based on
Word Embeddings [0.0]
We study the effect of frequency when measuring female vs. male gender bias with word embedding-based bias quantification methods.
We find that Skip-gram with negative sampling and GloVe tend to detect male bias in high frequency words, while GloVe tends to return female bias in low frequency words.
This proves that the frequency-based effect observed in unshuffled corpora stems from properties of the metric rather than from word associations.
arXiv Detail & Related papers (2023-01-02T18:27:10Z) - Boosting word frequencies in authorship attribution [0.0]
I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks.
The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question.
The proposed method outperforms classical most-frequent-word approaches substantially.
arXiv Detail & Related papers (2022-11-02T17:11:35Z) - Problems with Cosine as a Measure of Embedding Similarity for High
Frequency Words [45.58634797899206]
We find that cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts.
We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words.
arXiv Detail & Related papers (2022-05-10T18:00:06Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Frequency-based Distortions in Contextualized Word Embeddings [29.88883761339757]
This work explores the geometric characteristics of contextualized word embeddings with two novel tools.
Words of high and low frequency differ significantly with respect to their representational geometry.
BERT-Base has more trouble differentiating between South American and African countries than North American and European ones.
arXiv Detail & Related papers (2021-04-17T06:35:48Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.