Nonparametric Variational Regularisation of Pretrained Transformers
- URL: http://arxiv.org/abs/2312.00662v1
- Date: Fri, 1 Dec 2023 15:40:30 GMT
- Title: Nonparametric Variational Regularisation of Pretrained Transformers
- Authors: Fabio Fehr, James Henderson
- Abstract summary: We propose Non Variational Information Bottleneck (NVIB) as a regulariser for training cross-attention in Transformers.
We show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism.
- Score: 15.313475675235843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current paradigm of large-scale pre-training and fine-tuning Transformer
large language models has lead to significant improvements across the board in
natural language processing. However, such large models are susceptible to
overfitting to their training data, and as a result the models perform poorly
when the domain changes. Also, due to the model's scale, the cost of
fine-tuning the model to the new domain is large. Nonparametric Variational
Information Bottleneck (NVIB) has been proposed as a regulariser for training
cross-attention in Transformers, potentially addressing the overfitting
problem. We extend the NVIB framework to replace all types of attention
functions in Transformers, and show that existing pretrained Transformers can
be reinterpreted as Nonparametric Variational (NV) models using a proposed
identity initialisation. We then show that changing the initialisation
introduces a novel, information-theoretic post-training regularisation in the
attention mechanism, which improves out-of-domain generalisation without any
training. This success supports the hypothesis that pretrained Transformers are
implicitly NV Bayesian models.
Related papers
- Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z) - Leveraging Pre-trained Models for Failure Analysis Triplets Generation [0.0]
We leverage the attention mechanism of pre-trained causal language models such as Transformer model for the downstream task of generating Failure Analysis Triplets (FATs)
We observe that Generative Pre-trained Transformer 2 (GPT2) outperformed other transformer model for the failure analysis triplet generation (FATG) task.
In particular, we observe that GPT2 (trained on 1.5B parameters) outperforms pre-trained BERT, BART and GPT3 by a large margin on ROUGE.
arXiv Detail & Related papers (2022-10-31T17:21:15Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.