Related papers: BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT

URL: http://arxiv.org/abs/2001.09309v2
Date: Mon, 15 Feb 2021 09:54:30 GMT
Title: BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT
Authors: Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, Hung-Yi Lee
Abstract summary: Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks. We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input. We propose a quite simple method to boost the performance of BERT.
Score: 53.63288887672302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box. A variety of previous works have tried to lift the veil of BERT and understand each layer's functionality. In this paper, we found that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input, even though the output layer has never seen the input other than the final hidden layer. This fact remains true across a wide variety of BERT-based models, even when some layers are duplicated. Based on this observation, we propose a quite simple method to boost the performance of BERT. By duplicating some layers in the BERT-based models to make it deeper (no extra training required in this step), they obtain better performance in the downstream tasks after fine-tuning.

Related papers

Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z)
PromptBERT: Improving BERT Sentence Embeddings with Prompts [95.45347849834765]
We propose a prompt based sentence embeddings method which can reduce token embeddings biases and make the original BERT layers more effective. We also propose a novel unsupervised training objective by the technology of template denoising, which substantially shortens the performance gap between the supervised and unsupervised setting. Our fine-tuned method outperforms the state-of-the-art method SimCSE in both unsupervised and supervised settings.
arXiv Detail & Related papers (2022-01-12T06:54:21Z)
Roof-BERT: Divide Understanding Labour and Join in Work [7.523253052992842]
Roof-BERT is a model with two underlying BERTs and a fusion layer on them. One of the underlying BERTs encodes the knowledge resources and the other one encodes the original input sentences. Experiment results on QA task reveal the effectiveness of the proposed model.
arXiv Detail & Related papers (2021-12-13T15:40:54Z)
BERT-DRE: BERT with Deep Recursive Encoder for Natural Language Sentence Matching [4.002351785644765]
This paper presents a deep neural architecture, for Natural Language Sentence Matching (NLSM) by adding a deep recursive encoder to BERT. Our analysis of model behavior shows that BERT still does not capture the full complexity of text. The BERT algorithm on the religious dataset achieved an accuracy of 89.70%, and BERT-DRE architectures improved to 90.29% using the same dataset.
arXiv Detail & Related papers (2021-11-03T12:56:13Z)
Bertinho: Galician BERT Representations [14.341471404165349]
This paper presents a monolingual BERT model for Galician. We release two models, built using 6 and 12 transformer layers, respectively. We show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.
arXiv Detail & Related papers (2021-03-25T12:51:34Z)
BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention [9.366359346271567]
We propose a novel BERT-enhanced neural machine translation model called BERT-JAM. BERT-JAM uses joint-attention modules to allow the encoder/decoder layers to dynamically allocate attention between different representations. Our experiments show that BERT-JAM achieves SOTA BLEU scores on multiple translation tasks.
arXiv Detail & Related papers (2020-11-09T09:30:37Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models [18.155121103400333]
We probe a Dutch BERT-based model and the multilingual BERT model for Dutch NLP tasks. Through a deeper analysis of part-of-speech tagging, we show that also within a given task, information is spread over different parts of the network.
arXiv Detail & Related papers (2020-04-14T13:41:48Z)
DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding [90.85913515409275]
Recent studies on open-domain question answering have achieved prominent performance improvement using pre-trained language models such as BERT. We propose DC-BERT, a contextual encoding framework that has dual BERT models: an online BERT which encodes the question only once, and an offline BERT which pre-encodes all the documents and caches their encodings. On SQuAD Open and Natural Questions Open datasets, DC-BERT achieves 10x speedup on document retrieval, while retaining most (about 98%) of the QA performance.
arXiv Detail & Related papers (2020-02-28T08:18:37Z)
Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.