Norm of Word Embedding Encodes Information Gain
- URL: http://arxiv.org/abs/2212.09663v3
- Date: Thu, 2 Nov 2023 16:01:32 GMT
- Title: Norm of Word Embedding Encodes Information Gain
- Authors: Momose Oyama, Sho Yokoi, Hidetoshi Shimodaira
- Abstract summary: We show that the squared norm of static word embedding encodes the information gain conveyed by the word.
We also demonstrate that both the KL divergence and the squared norm of embedding provide a useful metric of the informativeness of a word.
- Score: 7.934452214142754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed representations of words encode lexical semantic information, but
what type of information is encoded and how? Focusing on the skip-gram with
negative-sampling method, we found that the squared norm of static word
embedding encodes the information gain conveyed by the word; the information
gain is defined by the Kullback-Leibler divergence of the co-occurrence
distribution of the word to the unigram distribution. Our findings are
explained by the theoretical framework of the exponential family of probability
distributions and confirmed through precise experiments that remove spurious
correlations arising from word frequency. This theory also extends to
contextualized word embeddings in language models or any neural networks with
the softmax output layer. We also demonstrate that both the KL divergence and
the squared norm of embedding provide a useful metric of the informativeness of
a word in tasks such as keyword extraction, proper-noun discrimination, and
hypernym discrimination.
Related papers
- Zipfian Whitening [7.927385005964994]
Most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform.
In reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law.
We show that simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance.
arXiv Detail & Related papers (2024-11-01T15:40:19Z) - How well do distributed representations convey contextual lexical semantics: a Thesis Proposal [3.3585951129432323]
In this thesis, we examine the efficacy of distributed representations from modern neural networks in encoding lexical meaning.
We identify four sources of ambiguity based on the relatedness and similarity of meanings influenced by context.
We then aim to evaluate these sources by collecting or constructing multilingual datasets, leveraging various language models, and employing linguistic analysis tools.
arXiv Detail & Related papers (2024-06-02T14:08:51Z) - Probing with Noise: Unpicking the Warp and Weft of Embeddings [2.9874726192215157]
We argue that it is possible for the vector norm to also carry linguistic information.
We develop a method to test this: an extension of the probing framework.
We find evidence that confirms the existence of separate information containers in English GloVe and BERT embeddings.
arXiv Detail & Related papers (2022-10-21T19:33:33Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Integrating Information Theory and Adversarial Learning for Cross-modal
Retrieval [19.600581093189362]
Accurately matching visual and textual data in cross-modal retrieval has been widely studied in the multimedia community.
We propose integrating Shannon information theory and adversarial learning.
In terms of the gap, we integrate modality classification and information entropy adversarially.
arXiv Detail & Related papers (2021-04-11T11:04:55Z) - R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic
Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching.
We first employ BERT to encode the input sentences from a global perspective.
Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective.
To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z) - Neutralizing Gender Bias in Word Embedding with Latent Disentanglement
and Counterfactual Generation [25.060917870666803]
We introduce a siamese auto-encoder structure with an adapted gradient reversal layer.
Our structure enables the separation of the semantic latent information and gender latent information of given word into the disjoint latent dimensions.
arXiv Detail & Related papers (2020-04-07T05:16:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.