LV-BERT: Exploiting Layer Variety for BERT
- URL: http://arxiv.org/abs/2106.11740v1
- Date: Tue, 22 Jun 2021 13:20:14 GMT
- Title: LV-BERT: Exploiting Layer Variety for BERT
- Authors: Weihao Yu, Zihang Jiang, Fei Chen, Qibin Hou and Jiashi Feng
- Abstract summary: We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
- Score: 85.27287501885807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern pre-trained language models are mostly built upon backbones stacking
self-attention and feed-forward layers in an interleaved order. In this paper,
beyond this stereotyped layer pattern, we aim to improve pre-trained models by
exploiting layer variety from two aspects: the layer type set and the layer
order. Specifically, besides the original self-attention and feed-forward
layers, we introduce convolution into the layer type set, which is
experimentally found beneficial to pre-trained models. Furthermore, beyond the
original interleaved order, we explore more layer orders to discover more
powerful architectures. However, the introduced layer variety leads to a large
architecture space of more than billions of candidates, while training a single
candidate model from scratch already requires huge computation cost, making it
not affordable to search such a space by directly training large amounts of
candidate models. To solve this problem, we first pre-train a supernet from
which the weights of all candidate models can be inherited, and then adopt an
evolutionary algorithm guided by pre-training accuracy to find the optimal
architecture. Extensive experiments show that LV-BERT model obtained by our
method outperforms BERT and its variants on various downstream tasks. For
example, LV-BERT-small achieves 78.8 on the GLUE testing set, 1.8 higher than
the strong baseline ELECTRA-small.
Related papers
- Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification [7.005068872406135]
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks.
We present a novel approach for exploiting the multilayered nature of pretrained models for ASV.
We show how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
arXiv Detail & Related papers (2024-09-12T05:55:32Z) - Inheritune: Training Smaller Yet More Attentive Language Models [61.363259848264725]
Inheritune is a simple yet effective training recipe for developing smaller, high-performing language models.
We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb_edu.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Improving Reliability of Fine-tuning with Block-wise Optimisation [6.83082949264991]
Finetuning can be used to tackle domain-specific tasks by transferring knowledge.
We propose a novel block-wise optimization mechanism, which adapts the weights of a group of layers of a pre-trained model.
The proposed approaches are tested on an often-used dataset, Tf_flower.
arXiv Detail & Related papers (2023-01-15T16:20:18Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Training with Multi-Layer Embeddings for Model Reduction [0.9046327456472286]
We introduce a multi-layer embedding training architecture that trains embeddings via a sequence of linear layers.
We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy.
arXiv Detail & Related papers (2020-06-10T02:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.