CERT: Contrastive Self-supervised Learning for Language Understanding
- URL: http://arxiv.org/abs/2005.12766v2
- Date: Thu, 18 Jun 2020 12:47:18 GMT
- Title: CERT: Contrastive Self-supervised Learning for Language Understanding
- Authors: Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan Ding, Pengtao Xie
- Abstract summary: We propose CERT: Contrastive self-supervised Representations from Transformers.
CERT pretrains language representation models using contrastive self-supervised learning at the sentence level.
We evaluate CERT on 11 natural language understanding tasks in the GLUE benchmark where CERT outperforms BERT on 7 tasks, achieves the same performance as BERT on 2 tasks, and performs worse than BERT on 2 tasks.
- Score: 20.17416958052909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models such as BERT, GPT have shown great effectiveness
in language understanding. The auxiliary predictive tasks in existing
pretraining approaches are mostly defined on tokens, thus may not be able to
capture sentence-level semantics very well. To address this issue, we propose
CERT: Contrastive self-supervised Encoder Representations from Transformers,
which pretrains language representation models using contrastive
self-supervised learning at the sentence level. CERT creates augmentations of
original sentences using back-translation. Then it finetunes a pretrained
language encoder (e.g., BERT) by predicting whether two augmented sentences
originate from the same sentence. CERT is simple to use and can be flexibly
plugged into any pretraining-finetuning NLP pipeline. We evaluate CERT on 11
natural language understanding tasks in the GLUE benchmark where CERT
outperforms BERT on 7 tasks, achieves the same performance as BERT on 2 tasks,
and performs worse than BERT on 2 tasks. On the averaged score of the 11 tasks,
CERT outperforms BERT. The data and code are available at
https://github.com/UCSD-AI4H/CERT
Related papers
- Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - PromptBERT: Improving BERT Sentence Embeddings with Prompts [95.45347849834765]
We propose a prompt based sentence embeddings method which can reduce token embeddings biases and make the original BERT layers more effective.
We also propose a novel unsupervised training objective by the technology of template denoising, which substantially shortens the performance gap between the supervised and unsupervised setting.
Our fine-tuned method outperforms the state-of-the-art method SimCSE in both unsupervised and supervised settings.
arXiv Detail & Related papers (2022-01-12T06:54:21Z) - ConSERT: A Contrastive Framework for Self-Supervised Sentence
Representation Transfer [19.643512923368743]
We present ConSERT, a Contrastive Framework for Self-Supervised Sentence Representation Transfer.
By making use of unlabeled texts, ConSERT solves the collapse issue of BERT-derived sentence representations.
Experiments on STS datasets demonstrate that ConSERT achieves an 8% relative improvement over the previous state-of-the-art.
arXiv Detail & Related papers (2021-05-25T08:15:01Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z) - AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization [13.082435183692393]
We propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT)
For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization.
Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE.
arXiv Detail & Related papers (2020-08-27T00:23:48Z) - Exploring Cross-sentence Contexts for Named Entity Recognition with BERT [1.4998865865537996]
We present a study exploring the use of cross-sentence information for NER using BERT models in five languages.
We find that adding context in the form of additional sentences to BERT input increases NER performance on all of the tested languages and models.
We propose a straightforward method, Contextual Majority Voting (CMV), to combine different predictions for sentences and demonstrate this to further increase NER performance with BERT.
arXiv Detail & Related papers (2020-06-02T12:34:52Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.