On the Sentence Embeddings from Pre-trained Language Models
- URL: http://arxiv.org/abs/2011.05864v1
- Date: Mon, 2 Nov 2020 13:14:57 GMT
- Title: On the Sentence Embeddings from Pre-trained Language Models
- Authors: Bohan Li and Hao Zhou and Junxian He and Mingxuan Wang and Yiming Yang
and Lei Li
- Abstract summary: In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
- Score: 78.45172445684126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained contextual representations like BERT have achieved great success
in natural language processing. However, the sentence embeddings from the
pre-trained language models without fine-tuning have been found to poorly
capture semantic meaning of sentences. In this paper, we argue that the
semantic information in the BERT embeddings is not fully exploited. We first
reveal the theoretical connection between the masked language model
pre-training objective and the semantic similarity task theoretically, and then
analyze the BERT sentence embeddings empirically. We find that BERT always
induces a non-smooth anisotropic semantic space of sentences, which harms its
performance of semantic similarity. To address this issue, we propose to
transform the anisotropic sentence embedding distribution to a smooth and
isotropic Gaussian distribution through normalizing flows that are learned with
an unsupervised objective. Experimental results show that our proposed
BERT-flow method obtains significant performance gains over the
state-of-the-art sentence embeddings on a variety of semantic textual
similarity tasks. The code is available at
https://github.com/bohanli/BERT-flow.
Related papers
- DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings [29.273438110694574]
Sentence embeddings from pre-trained language models suffer from a bias towards uninformative words.
We propose a simple and efficient unsupervised approach, Diagonal Attention Pooling (Ditto), which weights words with model-based importance estimations.
We show Ditto can alleviate the anisotropy problem and improve various pre-trained models on semantic textual similarity tasks.
arXiv Detail & Related papers (2023-05-18T07:56:40Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Improving Contextual Representation with Gloss Regularized Pre-training [9.589252392388758]
We propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT) to enhance word semantic similarity.
By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled.
Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation.
arXiv Detail & Related papers (2022-05-13T12:50:32Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Disentangling Semantics and Syntax in Sentence Embeddings with
Pre-trained Language Models [32.003787396501075]
ParaBART is a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models.
ParaBART is trained to perform syntax-guided paraphrasing, based on a source sentence that shares semantics with the target paraphrase, and a parse tree that specifies the target syntax.
arXiv Detail & Related papers (2021-04-11T21:34:46Z) - GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight
Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT.
Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z) - Latte-Mix: Measuring Sentence Semantic Similarity with Latent
Categorical Mixtures [0.0]
We learn a categorical variational autoencoder based on off-the-shelf pre-trained language models.
We empirically demonstrate that these finetuned models could be further improved by Latte-Mix.
arXiv Detail & Related papers (2020-10-21T23:45:18Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.