SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word
Models
- URL: http://arxiv.org/abs/2002.06652v2
- Date: Mon, 1 Jun 2020 17:39:09 GMT
- Title: SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word
Models
- Authors: Bin Wang, C.-C. Jay Kuo
- Abstract summary: A contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks.
Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models.
We propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation.
- Score: 43.18970770343777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentence embedding is an important research topic in natural language
processing (NLP) since it can transfer knowledge to downstream tasks.
Meanwhile, a contextualized word representation, called BERT, achieves the
state-of-the-art performance in quite a few NLP tasks. Yet, it is an open
problem to generate a high quality sentence representation from BERT-based word
models. It was shown in previous study that different layers of BERT capture
different linguistic properties. This allows us to fusion information across
layers to find better sentence representation. In this work, we study the
layer-wise pattern of the word representation of deep contextualized models.
Then, we propose a new sentence embedding method by dissecting BERT-based word
models through geometric analysis of the space spanned by the word
representation. It is called the SBERT-WK method. No further training is
required in SBERT-WK. We evaluate SBERT-WK on semantic textual similarity and
downstream supervised tasks. Furthermore, ten sentence-level probing tasks are
presented for detailed linguistic analysis. Experiments show that SBERT-WK
achieves the state-of-the-art performance. Our codes are publicly available.
Related papers
- Extracting Sentence Embeddings from Pretrained Transformer Models [0.0]
Given 110M parameters BERT's hidden representations from multiple layers and multiple tokens we tried various ways to extract optimal sentence representations.
All methods were tested on 8 Semantic Textual Similarity (STS), 6 short text clustering, and 12 classification tasks.
Very high improvements for static token-based models, especially random embeddings for STS tasks almost reach the performance of BERT-based representations.
arXiv Detail & Related papers (2024-08-15T10:54:55Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - Roof-BERT: Divide Understanding Labour and Join in Work [7.523253052992842]
Roof-BERT is a model with two underlying BERTs and a fusion layer on them.
One of the underlying BERTs encodes the knowledge resources and the other one encodes the original input sentences.
Experiment results on QA task reveal the effectiveness of the proposed model.
arXiv Detail & Related papers (2021-12-13T15:40:54Z) - Evaluation of BERT and ALBERT Sentence Embedding Performance on
Downstream NLP Tasks [4.955649816620742]
This paper explores on sentence embedding models for BERT and ALBERT.
We take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT)
arXiv Detail & Related papers (2021-01-26T09:14:06Z) - Does Chinese BERT Encode Word Structure? [17.836131968160917]
Contextualized representations give significantly improved results for a wide range of NLP tasks.
Much work has been dedicated to analyzing the features captured by representative models such as BERT.
We investigate Chinese BERT using both attention weight distribution statistics and probing tasks, finding that (1) word information is captured by BERT; (2) word-level features are mostly in the middle representation layers; (3) downstream tasks make different use of word features in BERT.
arXiv Detail & Related papers (2020-10-15T12:40:56Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.