What's so special about BERT's layers? A closer look at the NLP pipeline
in monolingual and multilingual models
- URL: http://arxiv.org/abs/2004.06499v2
- Date: Mon, 12 Oct 2020 11:51:34 GMT
- Title: What's so special about BERT's layers? A closer look at the NLP pipeline
in monolingual and multilingual models
- Authors: Wietse de Vries, Andreas van Cranenburgh and Malvina Nissim
- Abstract summary: We probe a Dutch BERT-based model and the multilingual BERT model for Dutch NLP tasks.
Through a deeper analysis of part-of-speech tagging, we show that also within a given task, information is spread over different parts of the network.
- Score: 18.155121103400333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Peeking into the inner workings of BERT has shown that its layers resemble
the classical NLP pipeline, with progressively more complex tasks being
concentrated in later layers. To investigate to what extent these results also
hold for a language other than English, we probe a Dutch BERT-based model and
the multilingual BERT model for Dutch NLP tasks. In addition, through a deeper
analysis of part-of-speech tagging, we show that also within a given task,
information is spread over different parts of the network and the pipeline
might not be as neat as it seems. Each layer has different specialisations, so
that it may be more useful to combine information from different layers,
instead of selecting a single one based on the best overall performance.
Related papers
- Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives [0.0]
BERT has revolutionized the NLP field by enabling transfer learning with large language models.
This article studies how to better cope with the different embeddings provided by the BERT output layer and the usage of language-specific instead of multilingual models.
arXiv Detail & Related papers (2022-01-10T15:05:05Z) - Bertinho: Galician BERT Representations [14.341471404165349]
This paper presents a monolingual BERT model for Galician.
We release two models, built using 6 and 12 transformer layers, respectively.
We show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.
arXiv Detail & Related papers (2021-03-25T12:51:34Z) - Deep Clustering of Text Representations for Supervision-free Probing of
Syntax [51.904014754864875]
We consider part of speech induction (POSI) and constituency labelling (CoLab) in this work.
We find that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English.
We report competitive performance of our probe on 45-tag English POSI, state-of-the-art performance on 12-tag POSI across 10 languages, and competitive results on CoLab.
arXiv Detail & Related papers (2020-10-24T05:06:29Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - Identifying Necessary Elements for BERT's Multilinguality [4.822598110892846]
multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer.
We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual.
arXiv Detail & Related papers (2020-05-01T14:27:14Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z) - BERT's output layer recognizes all hidden layers? Some Intriguing
Phenomena and a simple way to boost BERT [53.63288887672302]
Bidirectional Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks.
We find that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input.
We propose a quite simple method to boost the performance of BERT.
arXiv Detail & Related papers (2020-01-25T13:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.