The Impact of Word Splitting on the Semantic Content of Contextualized
Word Representations
- URL: http://arxiv.org/abs/2402.14616v1
- Date: Thu, 22 Feb 2024 15:04:24 GMT
- Title: The Impact of Word Splitting on the Semantic Content of Contextualized
Word Representations
- Authors: Aina Gar\'i Soler, Matthieu Labeau and Chlo\'e Clavel
- Abstract summary: The quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words.
Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words.
- Score: 3.4668147567693453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When deriving contextualized word representations from language models, a
decision needs to be made on how to obtain one for out-of-vocabulary (OOV)
words that are segmented into subwords. What is the best way to represent these
words with a single vector, and are these representations of worse quality than
those of in-vocabulary words? We carry out an intrinsic evaluation of
embeddings from different models on semantic similarity tasks involving OOV
words. Our analysis reveals, among other interesting findings, that the quality
of representations of words that are split is often, but not always, worse than
that of the embeddings of known words. Their similarity values, however, must
be interpreted with caution.
Related papers
- Investigating Idiomaticity in Word Representations [9.208145117062339]
We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese)
We present a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels.
We define a set of fine-grained metrics of Affinity and Scaled Similarity to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
arXiv Detail & Related papers (2024-11-04T21:05:01Z) - Unsupervised Mapping of Arguments of Deverbal Nouns to Their
Corresponding Verbal Labels [52.940886615390106]
Deverbal nouns are verbs commonly used in written English texts to describe events or actions, as well as their arguments.
The solutions that do exist for handling arguments of nominalized constructions are based on semantic annotation.
We propose to adopt a more syntactic approach, which maps the arguments of deverbal nouns to the corresponding verbal construction.
arXiv Detail & Related papers (2023-06-24T10:07:01Z) - Neighboring Words Affect Human Interpretation of Saliency Explanations [65.29015910991261]
Word-level saliency explanations are often used to communicate feature-attribution in text-based models.
Recent studies found that superficial factors such as word length can distort human interpretation of the communicated saliency scores.
We investigate how the marking of a word's neighboring words affect the explainee's perception of the word's importance in the context of a saliency explanation.
arXiv Detail & Related papers (2023-05-04T09:50:25Z) - Lost in Context? On the Sense-wise Variance of Contextualized Word
Embeddings [11.475144702935568]
We quantify how much the contextualized embeddings of each word sense vary across contexts in typical pre-trained models.
We find that word representations are position-biased, where the first words in different contexts tend to be more similar.
arXiv Detail & Related papers (2022-08-20T12:27:25Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Morphological Skip-Gram: Using morphological knowledge to improve word
representation [2.0129974477913457]
We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes.
The results show a competitive performance compared to FastText.
arXiv Detail & Related papers (2020-07-20T12:47:36Z) - Comparative Analysis of Word Embeddings for Capturing Word Similarities [0.0]
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks.
Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings.
selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.
arXiv Detail & Related papers (2020-05-08T01:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.