Semantics or spelling? Probing contextual word embeddings with orthographic noise
- URL: http://arxiv.org/abs/2408.04162v1
- Date: Thu, 8 Aug 2024 02:07:25 GMT
- Title: Semantics or spelling? Probing contextual word embeddings with orthographic noise
- Authors: Jacob A. Matthews, John R. Starr, Marten van Schijndel,
- Abstract summary: It remains unclear exactly what information is encoded in PLM hidden states.
Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data.
This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data.
- Score: 4.622165486890317
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pretrained language model (PLM) hidden states are frequently employed as contextual word embeddings (CWE): high-dimensional representations that encode semantic information given linguistic context. Across many areas of computational linguistics research, similarity between CWEs is interpreted as semantic similarity. However, it remains unclear exactly what information is encoded in PLM hidden states. We investigate this practice by probing PLM representations using minimal orthographic noise. We expect that if CWEs primarily encode semantic information, a single character swap in the input word will not drastically affect the resulting representation,given sufficient linguistic context. Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data, and that this sensitivity is related to subword tokenization: the fewer tokens used to represent a word at input, the more sensitive its corresponding CWE. This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data. We conclude that these PLM-derived CWEs may not be reliable semantic proxies, and that caution is warranted when interpreting representational similarity
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - Investigating the Contextualised Word Embedding Dimensions Responsible for Contextual and Temporal Semantic Changes [30.563130208194977]
It remains unclear as to how the meaning changes are encoded in the embedding space.
We compare pre-trained CWEs and their fine-tuned versions on semantic change benchmarks.
Our results reveal several novel insights such as (a) although there exist a smaller number of axes that are responsible for semantic changes of words in the pre-trained CWE space, this information gets distributed across all dimensions when fine-tuned.
arXiv Detail & Related papers (2024-07-03T05:42:20Z) - Self-Supervised Speech Representations are More Phonetic than Semantic [52.02626675137819]
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications.
We seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms.
Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity.
arXiv Detail & Related papers (2024-06-12T20:04:44Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Can Pretrained Language Models Derive Correct Semantics from Corrupt
Subwords under Noise? [9.380410177526425]
This study assesses the robustness of PLMs against various disrupted segmentation caused by noise.
It provides a systematic categorization of segmentation corruption under noise and evaluation protocols.
Experimental results indicate that PLMs are unable to accurately compute word meanings if the noise introduces completely different subwords, small subword fragments, or a large number of additional subwords.
arXiv Detail & Related papers (2023-06-27T07:51:01Z) - Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages [18.210880703295253]
We finetune pretrained language models (PLMs) on seven languages from three different families.
We analyze their zero-shot performance on closely related, non-standardized varieties.
Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
arXiv Detail & Related papers (2023-04-20T08:32:34Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Conditional probing: measuring usable information beyond a baseline [103.93673427217527]
One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation.
We propose conditional probing, which explicitly conditions on the information in the baseline.
In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network.
arXiv Detail & Related papers (2021-09-19T21:56:58Z) - R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic
Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching.
We first employ BERT to encode the input sentences from a global perspective.
Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective.
To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z) - Picking BERT's Brain: Probing for Linguistic Dependencies in
Contextualized Embeddings Using Representational Similarity Analysis [13.016284599828232]
We investigate the degree to which a verb embedding encodes the verb's subject, a pronoun embedding encodes the pronoun's antecedent, and a full-sentence representation encodes the sentence's head word.
In all cases, we show that BERT's contextualized embeddings reflect the linguistic dependency being studied, and that BERT encodes these dependencies to a greater degree than it encodes less linguistically-salient controls.
arXiv Detail & Related papers (2020-11-24T13:19:06Z) - Using Holographically Compressed Embeddings in Question Answering [0.0]
This research employs holographic compression of pre-trained embeddings to represent a token, its part-of-speech, and named entity type.
The implementation, in a modified question answering recurrent deep learning network, shows that semantic relationships are preserved, and yields strong performance.
arXiv Detail & Related papers (2020-07-14T18:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.