Can Pretrained Language Models Derive Correct Semantics from Corrupt
Subwords under Noise?
- URL: http://arxiv.org/abs/2306.15268v1
- Date: Tue, 27 Jun 2023 07:51:01 GMT
- Title: Can Pretrained Language Models Derive Correct Semantics from Corrupt
Subwords under Noise?
- Authors: Xinzhe Li, Ming Liu, Shang Gao
- Abstract summary: This study assesses the robustness of PLMs against various disrupted segmentation caused by noise.
It provides a systematic categorization of segmentation corruption under noise and evaluation protocols.
Experimental results indicate that PLMs are unable to accurately compute word meanings if the noise introduces completely different subwords, small subword fragments, or a large number of additional subwords.
- Score: 9.380410177526425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For Pretrained Language Models (PLMs), their susceptibility to noise has
recently been linked to subword segmentation. However, it is unclear which
aspects of segmentation affect their understanding. This study assesses the
robustness of PLMs against various disrupted segmentation caused by noise. An
evaluation framework for subword segmentation, named Contrastive Lexical
Semantic (CoLeS) probe, is proposed. It provides a systematic categorization of
segmentation corruption under noise and evaluation protocols by generating
contrastive datasets with canonical-noisy word pairs. Experimental results
indicate that PLMs are unable to accurately compute word meanings if the noise
introduces completely different subwords, small subword fragments, or a large
number of additional subwords, particularly when they are inserted within other
subwords.
Related papers
- Semantics or spelling? Probing contextual word embeddings with orthographic noise [4.622165486890317]
It remains unclear exactly what information is encoded in PLM hidden states.
Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data.
This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data.
arXiv Detail & Related papers (2024-08-08T02:07:25Z) - Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation.
First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization.
Second, we present an method for obtaining subword embeddings grounded in a word embedding space.
Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention [19.520840812910357]
Sindhi word segmentation is a challenging task due to space omission and insertion issues.
Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features.
We propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task.
arXiv Detail & Related papers (2020-12-30T08:31:31Z) - Improving Chinese Segmentation-free Word Embedding With Unsupervised
Association Measure [3.9435648520559177]
segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI)
The proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts.
arXiv Detail & Related papers (2020-07-05T13:55:19Z) - Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings [28.04666950237383]
We consider segmental models for whole-word ("acoustic-to-word") speech recognition.
We describe an efficient approach for end-to-end whole-word segmental models.
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation.
arXiv Detail & Related papers (2020-07-01T02:22:09Z) - DenoiSeg: Joint Denoising and Segmentation [75.91760529986958]
We propose DenoiSeg, a new method that can be trained end-to-end on only a few annotated ground truth segmentations.
We achieve this by extending Noise2Void, a self-supervised denoising scheme that can be trained on noisy images alone, to also predict dense 3-class segmentations.
arXiv Detail & Related papers (2020-05-06T17:42:54Z) - Learning Interpretable and Discrete Representations with Adversarial
Training for Unsupervised Text Classification [87.28408260725138]
TIGAN learns to encode texts into two disentangled representations, including a discrete code and a continuous noise.
The extracted topical words for representing latent topics show that TIGAN learns coherent and highly interpretable topics.
arXiv Detail & Related papers (2020-04-28T02:53:59Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.