Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers
- URL: http://arxiv.org/abs/2101.00234v1
- Date: Fri, 1 Jan 2021 13:53:22 GMT
- Title: Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers
- Authors: Machel Reid, Edison Marrese-Taylor and Yutaka Matsuo
- Abstract summary: We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
- Score: 16.88840622945725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of the Transformer can arguably be described as a driving force
behind many of the recent advances in natural language processing. However,
despite their sizeable performance improvements, as recently shown, the model
is severely over-parameterized, being parameter inefficient and computationally
expensive to train. Inspired by the success of parameter-sharing in pretrained
deep contextualized word representation encoders, we explore parameter-sharing
methods in Transformers, with a specific focus on encoder-decoder models for
sequence-to-sequence tasks such as neural machine translation. We perform an
analysis of different parameter sharing/reduction methods and develop the
Subformer, a parameter efficient Transformer-based model which combines the
newly proposed Sandwich-style parameter sharing technique - designed to
overcome the deficiencies in naive cross-layer parameter sharing for generative
models - and self-attentive embedding factorization (SAFE). Experiments on
machine translation, abstractive summarization, and language modeling show that
the Subformer can outperform the Transformer even when using significantly
fewer parameters.
Related papers
- MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Analyzing Transformers in Embedding Space [59.434807802802105]
We present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space.
We show that parameters of both pretrained and fine-tuned models can be interpreted in embedding space.
Our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.
arXiv Detail & Related papers (2022-09-06T14:36:57Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.