Improving Contextual Representation with Gloss Regularized Pre-training
- URL: http://arxiv.org/abs/2205.06603v1
- Date: Fri, 13 May 2022 12:50:32 GMT
- Title: Improving Contextual Representation with Gloss Regularized Pre-training
- Authors: Yu Lin, Zhecheng An, Peihao Wu, Zejun Ma
- Abstract summary: We propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT) to enhance word semantic similarity.
By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled.
Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation.
- Score: 9.589252392388758
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Though achieving impressive results on many NLP tasks, the BERT-like masked
language models (MLM) encounter the discrepancy between pre-training and
inference. In light of this gap, we investigate the contextual representation
of pre-training and inference from the perspective of word probability
distribution. We discover that BERT risks neglecting the contextual word
similarity in pre-training. To tackle this issue, we propose an auxiliary gloss
regularizer module to BERT pre-training (GR-BERT), to enhance word semantic
similarity. By predicting masked words and aligning contextual embeddings to
corresponding glosses simultaneously, the word similarity can be explicitly
modeled. We design two architectures for GR-BERT and evaluate our model in
downstream tasks. Experimental results show that the gloss regularizer benefits
BERT in word-level and sentence-level semantic representation. The GR-BERT
achieves new state-of-the-art in lexical substitution task and greatly promotes
BERT sentence representation in both unsupervised and supervised STS tasks.
Related papers
- Breaking Down Word Semantics from Pre-trained Language Models through
Layer-wise Dimension Selection [0.0]
This paper aims to disentangle semantic sense from BERT by applying a binary mask to middle outputs across the layers.
The disentangled embeddings are evaluated through binary classification to determine if the target word in two different sentences has the same meaning.
arXiv Detail & Related papers (2023-10-08T11:07:19Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight
Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT.
Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic
Priming [8.08493736237816]
We present a case study analyzing the pre-trained BERT model with tests informed by semantic priming.
We find that BERT too shows "priming," predicting a word with greater probability when the context includes a related word versus an unrelated one.
Follow-up analysis shows BERT to be increasingly distracted by related prime words as context becomes more informative.
arXiv Detail & Related papers (2020-10-06T20:30:59Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.